Background ========== Pedagogical framework --------------------- Making With Code and the games unit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``retro-gamer`` is developed for use in `Making With Code `__ (MWC), a high school computer science curriculum designed around the constructionist principle that students learn most durably by building things they care about. In MWC's games unit, students design and implement their own games using the ``retro-games`` framework: a Python library for building terminal-based, character-grid games in the style of early arcade software. Students start from concept, work through design, implement agents and game logic in Python, and end with a complete, playable game. The games unit gives students deep familiarity with one particular game and its code. They know which characters appear on the board, what the state dictionary contains, how reward accumulates, and what strategies tend to work. This knowledge is ordinarily tacit—embedded in how they play—but it is exactly the kind of knowledge that ``retro-gamer`` asks students to make explicit. The act of writing a ``config.toml`` that accurately describes your game to a learning algorithm is a form of structured reflection: you have to articulate, in precise terms, what you know. Objects to think with ~~~~~~~~~~~~~~~~~~~~~ The educational psychologist and mathematician Seymour Papert introduced the concept of *objects to think with*: concrete artifacts that serve as anchors for otherwise abstract ideas (Papert 1980). A gear, for Papert, was an object to think with about mathematics. The turtle in Logo was an object to think with about procedural thinking. In each case, the learner's embodied, intuitive knowledge of the object—how gears mesh, how the turtle moves—provides traction on abstract relationships that might otherwise remain inaccessible. A game that a student has built and played is a particularly rich object to think with. The student knows the game's behavior intimately: they have watched characters interact, experienced the score signal as meaningful, and developed intuitions about what makes a good move. These intuitions are not merely useful—they are *translatable* into the language of reinforcement learning. The reward signal the student experiences as a player is the same signal the trainer uses to evaluate actions. The patterns the student recognizes as meaningful on the board are precisely the patterns a convolutional neural network is designed to detect. The exploration-exploitation tradeoff the trainer navigates—trying new things versus sticking with what has worked—is analogous to the choices a student makes when learning a new game. ``retro-gamer`` is designed to make these translations visible. When the student reads the training log and sees that the trainer chose a CNN because the game is spatial, they can connect that decision to their own knowledge of how the board works. When they see the reward increasing episode by episode, they can reason about *why*—what the agent is learning to do—rather than watching an opaque number change. Metadata as structured reflection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A student who has built a game knows things about it that its code does not make explicit. They know which characters matter—which ones indicate danger, opportunity, or neutral terrain. They know what game state changes signal success. They know whether the arrangement of pieces on the board is meaningful or incidental. This knowledge is usually tacit: embedded in how they play, not in anything they have written down. ``retro-gamer`` asks students to make this tacit knowledge explicit by writing a ``[tool.retro-gamer]`` section in their game's ``pyproject.toml``. The choice of location is deliberate: placing game metadata in the game's own project file frames it as *a property of the game*, not as a configuration setting for the training tool. The student is not giving hints to the trainer; they are accurately describing what they built. This framing matters for how students reason about the relationship between description and performance. A student who omits a character from the character set and then notices degraded training performance is not observing a failure of their trainer configuration—they are observing the consequence of having described the game inaccurately. The fix is not to adjust a hyperparameter; it is to write a more accurate description. The question "is my description of the game correct?" is precisely the kind of structured reflection that produces conceptual understanding, because it requires the student to connect what they know about the game to the representations the learning algorithm uses. Knowledge building and discussion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Making a game does not, by itself, guarantee conceptual understanding of reinforcement learning. Students may engage deeply with the implementation details of their game while remaining unable to articulate the big ideas that ``retro-gamer`` is meant to make salient. Research in the knowledge-building tradition (Scardamalia and Bereiter 2006) suggests that conceptual understanding deepens substantially when students discuss their ideas with others—explaining, questioning, and revising their understanding in dialogue. ``retro-gamer`` is designed to generate the kind of specific, grounded questions that productive discussion requires. "What happens if I leave a character out of the character set?" is not an abstract question; it is a question about a specific game the student knows well, and it has a specific, reasoned answer. "Why does training improve faster with prioritized experience replay?" connects a hyperparameter setting to a mechanism. These are better starting points for discussion than the generic questions that arise from reading about reinforcement learning without a concrete artifact to refer to. Research design ~~~~~~~~~~~~~~~ The pedagogical hypothesis underlying ``retro-gamer`` is being evaluated in a research study conducted in the context of MWC's games unit. The study investigates how two interventions—using ``retro-gamer`` to train an agent, and discussing reinforcement learning with a large language model—interact to support conceptual understanding of reinforcement learning. The key outcome is measured by a set of scenario-based conceptual questions. Representative examples include: - *Imagine you were training an agent to play a game with a specified character set. If you forgot to include one of the characters which is used in the game, how would it affect the trained agent's performance? Explain your reasoning.* - *Imagine you are training an agent to play a game which has a specified character set. You realize that only half of the specified characters are actually used in the game. If you change the character set to include only the characters that actually appear, how would the training process change? Explain your reasoning.* - *Imagine you are creating a game where the goal is to win, and partial success has no value—for example, a game where the goal is to escape a maze. What would be the effect on agent training of adding artificial rewards for completing sub-goals such as reaching a milestone halfway to the exit? Explain your reasoning.* Each question is evaluated using a rubric that rewards conceptual understanding, even where specific misconceptions remain. Participants all receive a traditional classroom lesson on reinforcement learning before the study begins, ensuring that the same conceptual vocabulary is available to everyone. They then complete a pretest of the conceptual questions. Participants are randomly assigned to one of four conditions in a 2×2 design: the first factor is whether they use ``retro-gamer`` to train an agent on their game; the second is whether they discuss reinforcement learning with a large language model. One week later, participants complete the posttest. We hypothesize that the combination of ``retro-gamer`` and LLM discussion will produce the largest gains, mediated by more specific and more numerous questions to the LLM—a sign that students are reasoning more deeply about the underlying concepts. Technical background -------------------- This section provides a conceptual introduction to the ideas underlying ``retro-gamer``. It is intended to be accessible to students who have not studied machine learning before, while also connecting each concept to the specific choices you make when using the tool. Reinforcement learning ~~~~~~~~~~~~~~~~~~~~~~ *Reinforcement learning* (RL) is a framework for training an *agent* to make good decisions by interacting with an *environment*. At every moment, the environment is in some *state*, and the agent observes something about that state. The agent chooses an *action*, the environment transitions to a new state in response, and the agent receives a *reward* signal—a number that indicates how well it is doing. The agent's goal is to learn a *policy*: a rule for choosing actions that maximizes the total reward it accumulates over time. In ``retro-gamer``, the game is the environment, the character grid and state dictionary are what the agent observes, pressing a key is an action, and the change in score is the reward. A distinctive feature of reinforcement learning—distinguishing it from supervised learning, where a model is trained on labeled examples—is that the agent must discover what good behavior looks like through experience. There is no teacher providing correct answers. The reward signal is all the agent has to go on. This makes reinforcement learning both powerful (it can find solutions no human designer would think to specify) and tricky (poorly chosen reward signals can produce strange or unintended behavior). The total reward the agent receives from a given state onward—if it acts according to its current policy—is called the *return*. Because rewards in the far future are harder to predict and plan for, RL algorithms typically *discount* future rewards: a reward received ``t`` turns from now is worth only ``γ^t`` times its face value, where ``γ`` (gamma) is a number slightly less than 1. The ``gamma`` hyperparameter in ``retro-gamer`` controls this discount. A value close to 1 means the agent values the distant future almost as much as the immediate present; a smaller value makes the agent more myopic. Q-learning ~~~~~~~~~~~ A natural way to formalize the agent's goal is to define the *Q-function* (or *Q-value*): Q(s, a) is the expected total discounted reward the agent will receive if it is in state ``s``, takes action ``a``, and then follows its current policy from that point on. If the agent knew the true Q-function, it could act optimally simply by choosing the action with the highest Q-value in each state. Q-learning is an algorithm for learning the Q-function by experience. Starting from an arbitrary initial estimate, the agent uses the *Bellman equation* to update its Q-estimates after each transition. The key insight is that the Q-value of taking action ``a`` in state ``s`` is related to the immediate reward and the best Q-value achievable from the next state: .. math:: Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a') After each turn, the agent computes this *temporal difference* (TD) error—the gap between its current Q-estimate and what the Bellman equation says it should be—and adjusts its estimates to reduce the error. Over many iterations, the Q-estimates converge toward their true values. Deep Q-networks ~~~~~~~~~~~~~~~ Classical Q-learning stores the Q-function in a table: one entry for every possible (state, action) pair. This is feasible only when the number of possible states is small. For a game board with even modest dimensions—say 32×16 cells, each displaying one of a handful of characters—the number of possible board configurations is astronomically large. Storing a table of Q-values for every configuration is not practical. *Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this problem by approximating the Q-function with a neural network. Instead of a table, the network takes the current state as input and outputs Q-value estimates for all possible actions simultaneously. The network *generalizes*: having learned that moving right is a good idea when the apple is to the right and nothing is in the way, it applies that knowledge to board configurations it has never seen before. The training process in ``retro-gamer`` follows the DQN algorithm. At each turn, the agent uses its current network to estimate Q-values and selects an action. It stores the experience—(state, action, reward, next state)—in a *replay buffer*. Periodically, it samples a random batch of experiences from the buffer and uses them to compute TD errors, then adjusts the network weights to reduce those errors. This process continues for many episodes. Experience replay ~~~~~~~~~~~~~~~~~ A key ingredient of DQN is *experience replay*. Rather than training on experiences as they arrive—which would mean training on correlated, sequential transitions—the agent stores experiences in a buffer and samples them randomly for training. This has two benefits. First, each experience is potentially used many times for training, making data use more efficient. Second, random sampling breaks the correlations between consecutive transitions, which would otherwise cause the network's weight updates to interfere with each other. ``retro-gamer`` offers a standard replay buffer and an optional *prioritized* replay buffer (PER). In PER, experiences with larger TD errors—cases where the agent's prediction was most wrong—are sampled more often. The intuition is that surprising transitions are more informative. Prioritized replay often improves training efficiency but introduces a bias that must be corrected with *importance sampling weights* (Schaul et al. 2015). The ``memory_capacity`` hyperparameter sets how many experiences the buffer can hold. When the buffer is full, old experiences are discarded. A larger buffer provides more diverse training data but uses more memory. Target networks ~~~~~~~~~~~~~~~ A subtle challenge in DQN training is that the Q-values computed by the Bellman equation depend on the network's own estimates of the next state's Q-values. If the network is updated constantly, its Q-value estimates keep shifting, making the training target a moving one. This can cause instability. DQN addresses this with a *target network*: a copy of the main network that is updated only every ``target_update_freq`` steps. The Bellman target is computed using the target network, while the main network is updated by gradient descent. Because the target network changes slowly, training targets remain stable long enough for the main network to make progress. Exploration vs. exploitation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A reinforcement learning agent faces a fundamental dilemma: should it *exploit* what it already knows (taking the action with the highest estimated Q-value) or *explore* (trying actions it is less certain about, in case they lead to better outcomes it has not yet discovered)? Exploiting too much early in training means the agent never discovers better strategies; exploring too much later means the agent wastes time on random behavior when it already knows what to do. ``retro-gamer`` uses *ε-greedy exploration*: with probability ε (epsilon), the agent chooses a random action; with probability 1 − ε, it exploits its current Q-function. ε starts at 1 (pure exploration) and decays over training according to ``epsilon_decay``, reaching a floor of ``epsilon_min``. Reading the ``epsilon`` column in the training log shows how exploration decreases as training progresses. Representing the game board ~~~~~~~~~~~~~~~~~~~~~~~~~~~ A neural network operates on numbers, not characters. Before the game board can be fed to the Q-network, it must be converted to a numerical representation. ``retro-gamer`` uses *one-hot encoding*. For a character set of ``n`` distinct characters, each cell on the board is represented by a vector of ``n`` numbers, all zero except for the one position corresponding to the character in that cell, which is set to 1. For example, with character set ``['@', '*', '>']``, the character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is encoded as ``[0, 0, 0]``. The full board representation is a three-dimensional array of shape (H, W, C), where H is the board height, W is the board width, and C is the number of characters in the character set. The total number of numbers in this array—H × W × C—is the size of the board part of the observation. For a 32×16 board with 6 characters, this is 32 × 16 × 6 = 3,072 numbers. The ``character_set`` field in the game description determines which characters the agent can distinguish. A character not in the set appears as an all-zero vector—indistinguishable from an empty cell. If the character set is not specified, ``retro-gamer`` runs a brief exploration phase before training to observe which characters actually appear. In addition to the board, the agent can observe numerical values from the game's state dictionary via ``observe_state``. These are appended to the end of the observation vector. The reward key must not be included in ``observe_state``: it would give the agent direct access to its own performance signal, which is not a realistic observation in most game contexts and can cause training pathologies. Neural network architectures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The architecture of the Q-network—the number and arrangement of its layers—is one of the most consequential choices in DQN training. ``retro-gamer`` selects an architecture based on the ``spatial`` field in the game description and generates a plain-language rationale. **Multilayer perceptrons (MLP)** The simplest neural network architecture for fixed-size input is the *multilayer perceptron* (MLP). An MLP is a sequence of *fully connected layers*: every unit in one layer is connected to every unit in the next. Each connection has a learnable *weight*; a unit computes a weighted sum of its inputs, passes it through a nonlinear *activation function* (``retro-gamer`` uses the rectified linear unit, or ReLU: ``max(0, x)``), and sends the result to the next layer. The final layer has one unit per action, producing Q-value estimates. An MLP with two hidden layers of width 128, for an observation of size 3,072 and 5 possible actions, would have approximately 400,000 trainable parameters. Training adjusts all of these parameters simultaneously to reduce the TD error. An MLP treats its input as a flat list of numbers. It does not know that these numbers were arranged in a 2D grid, or that spatially adjacent cells are related. This is appropriate when the game's observation is better understood as a collection of independent readings—a set of meters or status indicators—rather than as a spatial scene. Set ``spatial = false`` in the game description to use this architecture. **Convolutional neural networks (CNN)** When the game board is genuinely spatial—when the relative positions of characters matter—a *convolutional neural network* (CNN) is a much better fit. A CNN applies a set of learnable *filters* (small weight matrices) across the board, computing a dot product of each filter with every overlapping patch of the input. The result is a set of *feature maps*: each feature map highlights where in the board a particular pattern appears. This is efficient for two reasons. First, the same filter is applied at every board position: a filter that detects "apple to the right of snake head" works the same way whether the apple is at position (10,5) or (20,12). This *translational invariance* means the network can generalize across positions without learning a separate rule for each one. Second, each filter needs only a small number of parameters (the filter size)—far fewer than the equivalent fully connected connections. ``retro-gamer`` uses two convolutional layers (with 32 and 64 output channels respectively, kernel size 3, padding 1) followed by a flattening step and an MLP head. The padding ensures that the spatial dimensions are preserved through the convolution, so the output of the second conv layer has shape (64, H, W), which is then flattened and passed to the MLP. Set ``spatial = true`` (the default) to use this architecture. Connecting architecture to game metadata ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The architectural choices ``retro-gamer`` makes are not arbitrary: they follow from the game description you provide. This connection is worth making explicit, because understanding it is one of the main paths into understanding why neural network architecture matters. - If ``spatial = true``, the CNN can detect local patterns—which characters are adjacent to which—without needing to see every possible arrangement. This is appropriate for games like Snake, where the snake's direction and the apple's relative position are spatially encoded. - If ``spatial = false``, the MLP treats the board as a flat vector. This may be appropriate for games that use the character grid primarily as a display rather than a spatial field—for example, a game where characters appear in fixed, non-interacting positions as status indicators. - The ``character_set`` determines the depth (C) of the board tensor. More characters mean more numbers per cell and a larger input to the network. A character set that includes characters the game never uses wastes capacity; a character set that omits relevant characters forces the agent to treat different things as the same. - The ``observe_state`` fields are appended to the flattened CNN output before the MLP head. This allows the agent to use explicit state variables—a timer, a lives count—alongside the visual board representation. These relationships are not incidental features of the implementation. They are the reason the game description matters: every field you fill in shapes what the agent can perceive and therefore what it can learn.