Initial commit

2026-05-08 14:07:17 -04:00
commit 5ca97dc5d0
36 changed files with 4147 additions and 0 deletions
--- a/docs/background.rst
+++ b/docs/background.rst
@@ -0,0 +1,442 @@
+Background
+==========
+
+Pedagogical framework
+---------------------
+
+Making With Code and the games unit
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``retro-gamer`` is developed for use in
+`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
+computer science curriculum designed around the constructionist
+principle that students learn most durably by building things they care
+about. In MWC's games unit, students design and implement their own
+games using the ``retro-games`` framework: a Python library for
+building terminal-based, character-grid games in the style of early
+arcade software. Students start from concept, work through design,
+implement agents and game logic in Python, and end with a complete,
+playable game.
+
+The games unit gives students deep familiarity with one particular
+game and its code. They know which characters appear on the board,
+what the state dictionary contains, how reward accumulates, and what
+strategies tend to work. This knowledge is ordinarily tacit—embedded
+in how they play—but it is exactly the kind of knowledge that
+``retro-gamer`` asks students to make explicit. The act of writing a
+``config.toml`` that accurately describes your game to a learning
+algorithm is a form of structured reflection: you have to articulate,
+in precise terms, what you know.
+
+Objects to think with
+~~~~~~~~~~~~~~~~~~~~~
+
+The educational psychologist and mathematician Seymour Papert
+introduced the concept of *objects to think with*: concrete artifacts
+that serve as anchors for otherwise abstract ideas (Papert 1980). A
+gear, for Papert, was an object to think with about mathematics. The
+turtle in Logo was an object to think with about procedural thinking.
+In each case, the learner's embodied, intuitive knowledge of the
+object—how gears mesh, how the turtle moves—provides traction on
+abstract relationships that might otherwise remain inaccessible.
+
+A game that a student has built and played is a particularly rich
+object to think with. The student knows the game's behavior
+intimately: they have watched characters interact, experienced the
+score signal as meaningful, and developed intuitions about what makes
+a good move. These intuitions are not merely useful—they are
+*translatable* into the language of reinforcement learning. The reward
+signal the student experiences as a player is the same signal the
+trainer uses to evaluate actions. The patterns the student recognizes
+as meaningful on the board are precisely the patterns a convolutional
+neural network is designed to detect. The exploration-exploitation
+tradeoff the trainer navigates—trying new things versus sticking with
+what has worked—is analogous to the choices a student makes when
+learning a new game.
+
+``retro-gamer`` is designed to make these translations visible. When
+the student reads the training log and sees that the trainer chose a
+CNN because the game is spatial, they can connect that decision to
+their own knowledge of how the board works. When they see the reward
+increasing episode by episode, they can reason about *why*—what the
+agent is learning to do—rather than watching an opaque number change.
+
+Metadata as structured reflection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A student who has built a game knows things about it that its code does
+not make explicit. They know which characters matter—which ones indicate
+danger, opportunity, or neutral terrain. They know what game state
+changes signal success. They know whether the arrangement of pieces on
+the board is meaningful or incidental. This knowledge is usually tacit:
+embedded in how they play, not in anything they have written down.
+
+``retro-gamer`` asks students to make this tacit knowledge explicit by
+writing a ``[tool.retro-gamer]`` section in their game's
+``pyproject.toml``. The choice of location is deliberate: placing game
+metadata in the game's own project file frames it as *a property of the
+game*, not as a configuration setting for the training tool. The student
+is not giving hints to the trainer; they are accurately describing what
+they built.
+
+This framing matters for how students reason about the relationship
+between description and performance. A student who omits a character
+from the character set and then notices degraded training performance is
+not observing a failure of their trainer configuration—they are
+observing the consequence of having described the game inaccurately.
+The fix is not to adjust a hyperparameter; it is to write a more
+accurate description. The question "is my description of the game
+correct?" is precisely the kind of structured reflection that produces
+conceptual understanding, because it requires the student to connect
+what they know about the game to the representations the learning
+algorithm uses.
+
+Knowledge building and discussion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Making a game does not, by itself, guarantee conceptual understanding
+of reinforcement learning. Students may engage deeply with the
+implementation details of their game while remaining unable to
+articulate the big ideas that ``retro-gamer`` is meant to make
+salient. Research in the knowledge-building tradition (Scardamalia and
+Bereiter 2006) suggests that conceptual understanding deepens
+substantially when students discuss their ideas with others—explaining,
+questioning, and revising their understanding in dialogue.
+
+``retro-gamer`` is designed to generate the kind of specific,
+grounded questions that productive discussion requires. "What happens
+if I leave a character out of the character set?" is not an abstract
+question; it is a question about a specific game the student knows
+well, and it has a specific, reasoned answer. "Why does training
+improve faster with prioritized experience replay?" connects a
+hyperparameter setting to a mechanism. These are better starting
+points for discussion than the generic questions that arise from
+reading about reinforcement learning without a concrete artifact to
+refer to.
+
+Research design
+~~~~~~~~~~~~~~~
+
+The pedagogical hypothesis underlying ``retro-gamer`` is being
+evaluated in a research study conducted in the context of MWC's games
+unit. The study investigates how two interventions—using
+``retro-gamer`` to train an agent, and discussing reinforcement
+learning with a large language model—interact to support conceptual
+understanding of reinforcement learning.
+
+The key outcome is measured by a set of scenario-based conceptual
+questions. Representative examples include:
+
+- *Imagine you were training an agent to play a game with a specified
+  character set. If you forgot to include one of the characters which
+  is used in the game, how would it affect the trained agent's
+  performance? Explain your reasoning.*
+- *Imagine you are training an agent to play a game which has a
+  specified character set. You realize that only half of the specified
+  characters are actually used in the game. If you change the
+  character set to include only the characters that actually appear,
+  how would the training process change? Explain your reasoning.*
+- *Imagine you are creating a game where the goal is to win, and
+  partial success has no value—for example, a game where the goal is
+  to escape a maze. What would be the effect on agent training of
+  adding artificial rewards for completing sub-goals such as reaching
+  a milestone halfway to the exit? Explain your reasoning.*
+
+Each question is evaluated using a rubric that rewards conceptual
+understanding, even where specific misconceptions remain.
+
+Participants all receive a traditional classroom lesson on
+reinforcement learning before the study begins, ensuring that the same
+conceptual vocabulary is available to everyone. They then complete a
+pretest of the conceptual questions. Participants are randomly assigned
+to one of four conditions in a 2×2 design: the first factor is whether
+they use ``retro-gamer`` to train an agent on their game; the second
+is whether they discuss reinforcement learning with a large language
+model. One week later, participants complete the posttest. We
+hypothesize that the combination of ``retro-gamer`` and LLM discussion
+will produce the largest gains, mediated by more specific and more
+numerous questions to the LLM—a sign that students are reasoning more
+deeply about the underlying concepts.
+
+Technical background
+--------------------
+
+This section provides a conceptual introduction to the ideas underlying
+``retro-gamer``. It is intended to be accessible to students who have
+not studied machine learning before, while also connecting each concept
+to the specific choices you make when using the tool.
+
+Reinforcement learning
+~~~~~~~~~~~~~~~~~~~~~~
+
+*Reinforcement learning* (RL) is a framework for training an *agent*
+to make good decisions by interacting with an *environment*.
+
+At every moment, the environment is in some *state*, and the agent
+observes something about that state. The agent chooses an *action*,
+the environment transitions to a new state in response, and the agent
+receives a *reward* signal—a number that indicates how well it is
+doing. The agent's goal is to learn a *policy*: a rule for choosing
+actions that maximizes the total reward it accumulates over time. In
+``retro-gamer``, the game is the environment, the character grid and
+state dictionary are what the agent observes, pressing a key is an
+action, and the change in score is the reward.
+
+A distinctive feature of reinforcement learning—distinguishing it from
+supervised learning, where a model is trained on labeled examples—is
+that the agent must discover what good behavior looks like through
+experience. There is no teacher providing correct answers. The reward
+signal is all the agent has to go on. This makes reinforcement
+learning both powerful (it can find solutions no human designer would
+think to specify) and tricky (poorly chosen reward signals can produce
+strange or unintended behavior).
+
+The total reward the agent receives from a given state onward—if it
+acts according to its current policy—is called the *return*. Because
+rewards in the far future are harder to predict and plan for, RL
+algorithms typically *discount* future rewards: a reward received
+``t`` turns from now is worth only ``γ^t`` times its face value, where
+``γ`` (gamma) is a number slightly less than 1. The ``gamma``
+hyperparameter in ``retro-gamer`` controls this discount. A value
+close to 1 means the agent values the distant future almost as much
+as the immediate present; a smaller value makes the agent more
+myopic.
+
+Q-learning
+~~~~~~~~~~~
+
+A natural way to formalize the agent's goal is to define the *Q-function*
+(or *Q-value*): Q(s, a) is the expected total discounted reward the
+agent will receive if it is in state ``s``, takes action ``a``, and
+then follows its current policy from that point on. If the agent knew
+the true Q-function, it could act optimally simply by choosing the
+action with the highest Q-value in each state.
+
+Q-learning is an algorithm for learning the Q-function by experience.
+Starting from an arbitrary initial estimate, the agent uses the
+*Bellman equation* to update its Q-estimates after each transition.
+The key insight is that the Q-value of taking action ``a`` in state
+``s`` is related to the immediate reward and the best Q-value
+achievable from the next state:
+
+.. math::
+
+   Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')
+
+After each turn, the agent computes this *temporal difference* (TD)
+error—the gap between its current Q-estimate and what the Bellman
+equation says it should be—and adjusts its estimates to reduce the
+error. Over many iterations, the Q-estimates converge toward their
+true values.
+
+Deep Q-networks
+~~~~~~~~~~~~~~~
+
+Classical Q-learning stores the Q-function in a table: one entry for
+every possible (state, action) pair. This is feasible only when the
+number of possible states is small. For a game board with even modest
+dimensions—say 32×16 cells, each displaying one of a handful of
+characters—the number of possible board configurations is astronomically
+large. Storing a table of Q-values for every configuration is not
+practical.
+
+*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
+problem by approximating the Q-function with a neural network. Instead
+of a table, the network takes the current state as input and outputs
+Q-value estimates for all possible actions simultaneously. The network
+*generalizes*: having learned that moving right is a good idea when
+the apple is to the right and nothing is in the way, it applies that
+knowledge to board configurations it has never seen before.
+
+The training process in ``retro-gamer`` follows the DQN algorithm. At
+each turn, the agent uses its current network to estimate Q-values and
+selects an action. It stores the experience—(state, action, reward,
+next state)—in a *replay buffer*. Periodically, it samples a random
+batch of experiences from the buffer and uses them to compute TD
+errors, then adjusts the network weights to reduce those errors. This
+process continues for many episodes.
+
+Experience replay
+~~~~~~~~~~~~~~~~~
+
+A key ingredient of DQN is *experience replay*. Rather than training
+on experiences as they arrive—which would mean training on correlated,
+sequential transitions—the agent stores experiences in a buffer and
+samples them randomly for training. This has two benefits. First, each
+experience is potentially used many times for training, making data
+use more efficient. Second, random sampling breaks the correlations
+between consecutive transitions, which would otherwise cause the
+network's weight updates to interfere with each other.
+
+``retro-gamer`` offers a standard replay buffer and an optional
+*prioritized* replay buffer (PER). In PER, experiences with larger TD
+errors—cases where the agent's prediction was most wrong—are sampled
+more often. The intuition is that surprising transitions are more
+informative. Prioritized replay often improves training efficiency but
+introduces a bias that must be corrected with *importance sampling
+weights* (Schaul et al. 2015).
+
+The ``memory_capacity`` hyperparameter sets how many experiences the
+buffer can hold. When the buffer is full, old experiences are
+discarded. A larger buffer provides more diverse training data but
+uses more memory.
+
+Target networks
+~~~~~~~~~~~~~~~
+
+A subtle challenge in DQN training is that the Q-values computed by the
+Bellman equation depend on the network's own estimates of the next
+state's Q-values. If the network is updated constantly, its Q-value
+estimates keep shifting, making the training target a moving one. This
+can cause instability.
+
+DQN addresses this with a *target network*: a copy of the main network
+that is updated only every ``target_update_freq`` steps. The Bellman
+target is computed using the target network, while the main network is
+updated by gradient descent. Because the target network changes slowly,
+training targets remain stable long enough for the main network to
+make progress.
+
+Exploration vs. exploitation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A reinforcement learning agent faces a fundamental dilemma: should it
+*exploit* what it already knows (taking the action with the highest
+estimated Q-value) or *explore* (trying actions it is less certain
+about, in case they lead to better outcomes it has not yet discovered)?
+Exploiting too much early in training means the agent never discovers
+better strategies; exploring too much later means the agent wastes time
+on random behavior when it already knows what to do.
+
+``retro-gamer`` uses *ε-greedy exploration*: with probability ε
+(epsilon), the agent chooses a random action; with probability 1 − ε,
+it exploits its current Q-function. ε starts at 1 (pure exploration)
+and decays over training according to ``epsilon_decay``, reaching
+a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
+training log shows how exploration decreases as training progresses.
+
+Representing the game board
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A neural network operates on numbers, not characters. Before the
+game board can be fed to the Q-network, it must be converted to a
+numerical representation. ``retro-gamer`` uses *one-hot encoding*.
+
+For a character set of ``n`` distinct characters, each cell on the
+board is represented by a vector of ``n`` numbers, all zero except for
+the one position corresponding to the character in that cell, which is
+set to 1. For example, with character set ``['@', '*', '>']``, the
+character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
+encoded as ``[0, 0, 0]``.
+
+The full board representation is a three-dimensional array of shape
+(H, W, C), where H is the board height, W is the board width, and
+C is the number of characters in the character set. The total number
+of numbers in this array—H × W × C—is the size of the board part of
+the observation. For a 32×16 board with 6 characters, this is
+32 × 16 × 6 = 3,072 numbers.
+
+The ``character_set`` field in the game description determines which
+characters the agent can distinguish. A character not in the set
+appears as an all-zero vector—indistinguishable from an empty cell.
+If the character set is not specified, ``retro-gamer`` runs a brief
+exploration phase before training to observe which characters actually
+appear.
+
+In addition to the board, the agent can observe numerical values from
+the game's state dictionary via ``observe_state``. These are
+appended to the end of the observation vector. The reward key must
+not be included in ``observe_state``: it would give the agent direct
+access to its own performance signal, which is not a realistic observation
+in most game contexts and can cause training pathologies.
+
+Neural network architectures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The architecture of the Q-network—the number and arrangement of its
+layers—is one of the most consequential choices in DQN training.
+``retro-gamer`` selects an architecture based on the ``spatial``
+field in the game description and generates a plain-language rationale.
+
+**Multilayer perceptrons (MLP)**
+
+The simplest neural network architecture for fixed-size input is the
+*multilayer perceptron* (MLP). An MLP is a sequence of *fully
+connected layers*: every unit in one layer is connected to every unit
+in the next. Each connection has a learnable *weight*; a unit computes
+a weighted sum of its inputs, passes it through a nonlinear *activation
+function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
+``max(0, x)``), and sends the result to the next layer. The final
+layer has one unit per action, producing Q-value estimates.
+
+An MLP with two hidden layers of width 128, for an observation of size
+3,072 and 5 possible actions, would have approximately 400,000 trainable
+parameters. Training adjusts all of these parameters simultaneously to
+reduce the TD error.
+
+An MLP treats its input as a flat list of numbers. It does not know
+that these numbers were arranged in a 2D grid, or that spatially
+adjacent cells are related. This is appropriate when the game's
+observation is better understood as a collection of independent
+readings—a set of meters or status indicators—rather than as a spatial
+scene. Set ``spatial = false`` in the game description to use this
+architecture.
+
+**Convolutional neural networks (CNN)**
+
+When the game board is genuinely spatial—when the relative positions
+of characters matter—a *convolutional neural network* (CNN) is a much
+better fit. A CNN applies a set of learnable *filters* (small weight
+matrices) across the board, computing a dot product of each filter with
+every overlapping patch of the input. The result is a set of *feature
+maps*: each feature map highlights where in the board a particular
+pattern appears.
+
+This is efficient for two reasons. First, the same filter is applied
+at every board position: a filter that detects "apple to the right of
+snake head" works the same way whether the apple is at position (10,5)
+or (20,12). This *translational invariance* means the network can
+generalize across positions without learning a separate rule for each
+one. Second, each filter needs only a small number of parameters (the
+filter size)—far fewer than the equivalent fully connected connections.
+
+``retro-gamer`` uses two convolutional layers (with 32 and 64 output
+channels respectively, kernel size 3, padding 1) followed by a
+flattening step and an MLP head. The padding ensures that the spatial
+dimensions are preserved through the convolution, so the output of the
+second conv layer has shape (64, H, W), which is then flattened and
+passed to the MLP. Set ``spatial = true`` (the default) to use this
+architecture.
+
+Connecting architecture to game metadata
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The architectural choices ``retro-gamer`` makes are not arbitrary: they
+follow from the game description you provide. This connection is worth
+making explicit, because understanding it is one of the main paths into
+understanding why neural network architecture matters.
+
+- If ``spatial = true``, the CNN can detect local patterns—which characters
+  are adjacent to which—without needing to see every possible arrangement.
+  This is appropriate for games like Snake, where the snake's direction
+  and the apple's relative position are spatially encoded.
+
+- If ``spatial = false``, the MLP treats the board as a flat vector. This
+  may be appropriate for games that use the character grid primarily as a
+  display rather than a spatial field—for example, a game where characters
+  appear in fixed, non-interacting positions as status indicators.
+
+- The ``character_set`` determines the depth (C) of the board tensor.
+  More characters mean more numbers per cell and a larger input to the
+  network. A character set that includes characters the game never uses
+  wastes capacity; a character set that omits relevant characters forces
+  the agent to treat different things as the same.
+
+- The ``observe_state`` fields are appended to the flattened CNN output
+  before the MLP head. This allows the agent to use explicit state
+  variables—a timer, a lives count—alongside the visual board
+  representation.
+
+These relationships are not incidental features of the implementation.
+They are the reason the game description matters: every field you fill
+in shapes what the agent can perceive and therefore what it can learn.