retro-gamer/docs/background.rst

Background
==========

Pedagogical framework
---------------------

Making With Code and the games unit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``retro-gamer`` is developed for use in
`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
computer science curriculum designed around the constructionist
principle that students learn most durably by building things they care
about. In MWC's games unit, students design and implement their own
games using the ``retro-games`` framework: a Python library for
building terminal-based, character-grid games in the style of early
arcade software. Students start from concept, work through design,
implement agents and game logic in Python, and end with a complete,
playable game.

The games unit gives students deep familiarity with one particular
game and its code. They know which characters appear on the board,
what the state dictionary contains, how reward accumulates, and what
strategies tend to work. This knowledge is ordinarily tacit—embedded
in how they play—but it is exactly the kind of knowledge that
``retro-gamer`` asks students to make explicit. The act of writing a
``config.toml`` that accurately describes your game to a learning
algorithm is a form of structured reflection: you have to articulate,
in precise terms, what you know.

Objects to think with
~~~~~~~~~~~~~~~~~~~~~

The educational psychologist and mathematician Seymour Papert
introduced the concept of *objects to think with*: concrete artifacts
that serve as anchors for otherwise abstract ideas (Papert 1980). A
gear, for Papert, was an object to think with about mathematics. The
turtle in Logo was an object to think with about procedural thinking.
In each case, the learner's embodied, intuitive knowledge of the
object—how gears mesh, how the turtle moves—provides traction on
abstract relationships that might otherwise remain inaccessible.

A game that a student has built and played is a particularly rich
object to think with. The student knows the game's behavior
intimately: they have watched characters interact, experienced the
score signal as meaningful, and developed intuitions about what makes
a good move. These intuitions are not merely useful—they are
*translatable* into the language of reinforcement learning. The reward
signal the student experiences as a player is the same signal the
trainer uses to evaluate actions. The patterns the student recognizes
as meaningful on the board are precisely the patterns a convolutional
neural network is designed to detect. The exploration-exploitation
tradeoff the trainer navigates—trying new things versus sticking with
what has worked—is analogous to the choices a student makes when
learning a new game.

``retro-gamer`` is designed to make these translations visible. When
the student reads the training log and sees that the trainer chose a
CNN because the game is spatial, they can connect that decision to
their own knowledge of how the board works. When they see the reward
increasing episode by episode, they can reason about *why*—what the
agent is learning to do—rather than watching an opaque number change.

Metadata as structured reflection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A student who has built a game knows things about it that its code does
not make explicit. They know which characters matter—which ones indicate
danger, opportunity, or neutral terrain. They know what game state
changes signal success. They know whether the arrangement of pieces on
the board is meaningful or incidental. This knowledge is usually tacit:
embedded in how they play, not in anything they have written down.

``retro-gamer`` asks students to make this tacit knowledge explicit by
writing a ``[tool.retro-gamer]`` section in their game's
``pyproject.toml``. The choice of location is deliberate: placing game
metadata in the game's own project file frames it as *a property of the
game*, not as a configuration setting for the training tool. The student
is not giving hints to the trainer; they are accurately describing what
they built.

This framing matters for how students reason about the relationship
between description and performance. A student who omits a character
from the character set and then notices degraded training performance is
not observing a failure of their trainer configuration—they are
observing the consequence of having described the game inaccurately.
The fix is not to adjust a hyperparameter; it is to write a more
accurate description. The question "is my description of the game
correct?" is precisely the kind of structured reflection that produces
conceptual understanding, because it requires the student to connect
what they know about the game to the representations the learning
algorithm uses.

Knowledge building and discussion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Making a game does not, by itself, guarantee conceptual understanding
of reinforcement learning. Students may engage deeply with the
implementation details of their game while remaining unable to
articulate the big ideas that ``retro-gamer`` is meant to make
salient. Research in the knowledge-building tradition (Scardamalia and
Bereiter 2006) suggests that conceptual understanding deepens
substantially when students discuss their ideas with others—explaining,
questioning, and revising their understanding in dialogue.

``retro-gamer`` is designed to generate the kind of specific,
grounded questions that productive discussion requires. "What happens
if I leave a character out of the character set?" is not an abstract
question; it is a question about a specific game the student knows
well, and it has a specific, reasoned answer. "Why does training
improve faster with prioritized experience replay?" connects a
hyperparameter setting to a mechanism. These are better starting
points for discussion than the generic questions that arise from
reading about reinforcement learning without a concrete artifact to
refer to.

Research design
~~~~~~~~~~~~~~~

The pedagogical hypothesis underlying ``retro-gamer`` is being
evaluated in a research study conducted in the context of MWC's games
unit. The study investigates how two interventions—using
``retro-gamer`` to train an agent, and discussing reinforcement
learning with a large language model—interact to support conceptual
understanding of reinforcement learning.

The key outcome is measured by a set of scenario-based conceptual
questions. Representative examples include:

- *Imagine you were training an agent to play a game with a specified
  character set. If you forgot to include one of the characters which
  is used in the game, how would it affect the trained agent's
  performance? Explain your reasoning.*
- *Imagine you are training an agent to play a game which has a
  specified character set. You realize that only half of the specified
  characters are actually used in the game. If you change the
  character set to include only the characters that actually appear,
  how would the training process change? Explain your reasoning.*
- *Imagine you are creating a game where the goal is to win, and
  partial success has no value—for example, a game where the goal is
  to escape a maze. What would be the effect on agent training of
  adding artificial rewards for completing sub-goals such as reaching
  a milestone halfway to the exit? Explain your reasoning.*

Each question is evaluated using a rubric that rewards conceptual
understanding, even where specific misconceptions remain.

Participants all receive a traditional classroom lesson on
reinforcement learning before the study begins, ensuring that the same
conceptual vocabulary is available to everyone. They then complete a
pretest of the conceptual questions. Participants are randomly assigned
to one of four conditions in a 2×2 design: the first factor is whether
they use ``retro-gamer`` to train an agent on their game; the second
is whether they discuss reinforcement learning with a large language
model. One week later, participants complete the posttest. We
hypothesize that the combination of ``retro-gamer`` and LLM discussion
will produce the largest gains, mediated by more specific and more
numerous questions to the LLM—a sign that students are reasoning more
deeply about the underlying concepts.

Technical background
--------------------

This section provides a conceptual introduction to the ideas underlying
``retro-gamer``. It is intended to be accessible to students who have
not studied machine learning before, while also connecting each concept
to the specific choices you make when using the tool.

Reinforcement learning
~~~~~~~~~~~~~~~~~~~~~~

*Reinforcement learning* (RL) is a framework for training an *agent*
to make good decisions by interacting with an *environment*.

At every moment, the environment is in some *state*, and the agent
observes something about that state. The agent chooses an *action*,
the environment transitions to a new state in response, and the agent
receives a *reward* signal—a number that indicates how well it is
doing. The agent's goal is to learn a *policy*: a rule for choosing
actions that maximizes the total reward it accumulates over time. In
``retro-gamer``, the game is the environment, the character grid and
state dictionary are what the agent observes, pressing a key is an
action, and the change in score is the reward.

A distinctive feature of reinforcement learning—distinguishing it from
supervised learning, where a model is trained on labeled examples—is
that the agent must discover what good behavior looks like through
experience. There is no teacher providing correct answers. The reward
signal is all the agent has to go on. This makes reinforcement
learning both powerful (it can find solutions no human designer would
think to specify) and tricky (poorly chosen reward signals can produce
strange or unintended behavior).

The total reward the agent receives from a given state onward—if it
acts according to its current policy—is called the *return*. Because
rewards in the far future are harder to predict and plan for, RL
algorithms typically *discount* future rewards: a reward received
``t`` turns from now is worth only ``γ^t`` times its face value, where
``γ`` (gamma) is a number slightly less than 1. The ``gamma``
hyperparameter in ``retro-gamer`` controls this discount. A value
close to 1 means the agent values the distant future almost as much
as the immediate present; a smaller value makes the agent more
myopic.

Q-learning
~~~~~~~~~~~

A natural way to formalize the agent's goal is to define the *Q-function*
(or *Q-value*): Q(s, a) is the expected total discounted reward the
agent will receive if it is in state ``s``, takes action ``a``, and
then follows its current policy from that point on. If the agent knew
the true Q-function, it could act optimally simply by choosing the
action with the highest Q-value in each state.

Q-learning is an algorithm for learning the Q-function by experience.
Starting from an arbitrary initial estimate, the agent uses the
*Bellman equation* to update its Q-estimates after each transition.
The key insight is that the Q-value of taking action ``a`` in state
``s`` is related to the immediate reward and the best Q-value
achievable from the next state:

.. math::

   Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')

After each turn, the agent computes this *temporal difference* (TD)
error—the gap between its current Q-estimate and what the Bellman
equation says it should be—and adjusts its estimates to reduce the
error. Over many iterations, the Q-estimates converge toward their
true values.

Deep Q-networks
~~~~~~~~~~~~~~~

Classical Q-learning stores the Q-function in a table: one entry for
every possible (state, action) pair. This is feasible only when the
number of possible states is small. For a game board with even modest
dimensions—say 32×16 cells, each displaying one of a handful of
characters—the number of possible board configurations is astronomically
large. Storing a table of Q-values for every configuration is not
practical.

*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
problem by approximating the Q-function with a neural network. Instead
of a table, the network takes the current state as input and outputs
Q-value estimates for all possible actions simultaneously. The network
*generalizes*: having learned that moving right is a good idea when
the apple is to the right and nothing is in the way, it applies that
knowledge to board configurations it has never seen before.

The training process in ``retro-gamer`` follows the DQN algorithm. At
each turn, the agent uses its current network to estimate Q-values and
selects an action. It stores the experience—(state, action, reward,
next state)—in a *replay buffer*. Periodically, it samples a random
batch of experiences from the buffer and uses them to compute TD
errors, then adjusts the network weights to reduce those errors. This
process continues for many episodes.

Experience replay
~~~~~~~~~~~~~~~~~

A key ingredient of DQN is *experience replay*. Rather than training
on experiences as they arrive—which would mean training on correlated,
sequential transitions—the agent stores experiences in a buffer and
samples them randomly for training. This has two benefits. First, each
experience is potentially used many times for training, making data
use more efficient. Second, random sampling breaks the correlations
between consecutive transitions, which would otherwise cause the
network's weight updates to interfere with each other.

``retro-gamer`` offers a standard replay buffer and an optional
*prioritized* replay buffer (PER). In PER, experiences with larger TD
errors—cases where the agent's prediction was most wrong—are sampled
more often. The intuition is that surprising transitions are more
informative. Prioritized replay often improves training efficiency but
introduces a bias that must be corrected with *importance sampling
weights* (Schaul et al. 2015).

The ``memory_capacity`` hyperparameter sets how many experiences the
buffer can hold. When the buffer is full, old experiences are
discarded. A larger buffer provides more diverse training data but
uses more memory.

Target networks
~~~~~~~~~~~~~~~

A subtle challenge in DQN training is that the Q-values computed by the
Bellman equation depend on the network's own estimates of the next
state's Q-values. If the network is updated constantly, its Q-value
estimates keep shifting, making the training target a moving one. This
can cause instability.

DQN addresses this with a *target network*: a copy of the main network
that is updated only every ``target_update_freq`` steps. The Bellman
target is computed using the target network, while the main network is
updated by gradient descent. Because the target network changes slowly,
training targets remain stable long enough for the main network to
make progress.

Exploration vs. exploitation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A reinforcement learning agent faces a fundamental dilemma: should it
*exploit* what it already knows (taking the action with the highest
estimated Q-value) or *explore* (trying actions it is less certain
about, in case they lead to better outcomes it has not yet discovered)?
Exploiting too much early in training means the agent never discovers
better strategies; exploring too much later means the agent wastes time
on random behavior when it already knows what to do.

``retro-gamer`` uses *ε-greedy exploration*: with probability ε
(epsilon), the agent chooses a random action; with probability 1 − ε,
it exploits its current Q-function. ε starts at 1 (pure exploration)
and decays over training according to ``epsilon_decay``, reaching
a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
training log shows how exploration decreases as training progresses.

Representing the game board
~~~~~~~~~~~~~~~~~~~~~~~~~~~

A neural network operates on numbers, not characters. Before the
game board can be fed to the Q-network, it must be converted to a
numerical representation. ``retro-gamer`` uses *one-hot encoding*.

For a character set of ``n`` distinct characters, each cell on the
board is represented by a vector of ``n`` numbers, all zero except for
the one position corresponding to the character in that cell, which is
set to 1. For example, with character set ``['@', '*', '>']``, the
character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
encoded as ``[0, 0, 0]``.

The full board representation is a three-dimensional array of shape
(H, W, C), where H is the board height, W is the board width, and
C is the number of characters in the character set. The total number
of numbers in this array—H × W × C—is the size of the board part of
the observation. For a 32×16 board with 6 characters, this is
32 × 16 × 6 = 3,072 numbers.

The ``character_set`` field in the game description determines which
characters the agent can distinguish. A character not in the set
appears as an all-zero vector—indistinguishable from an empty cell.
If the character set is not specified, ``retro-gamer`` runs a brief
exploration phase before training to observe which characters actually
appear.

In addition to the board, the agent can observe extra computed values
from ``game.state``. Listing keys in the ``observe_state`` option of
``[preprocessing]`` causes those values to be appended to the
observation vector after the board encoding. This is where feature
engineering decisions live: what derived quantities should the agent
see, and does giving it those values give it an advantage a human
player would not have?

Neural network architectures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The architecture of the Q-network—the number and arrangement of its
layers—is one of the most consequential choices in DQN training.
``retro-gamer`` selects an architecture based on the ``spatial``
option in ``[preprocessing]`` of ``config.toml`` and generates a
plain-language rationale.

**Multilayer perceptrons (MLP)**

The simplest neural network architecture for fixed-size input is the
*multilayer perceptron* (MLP). An MLP is a sequence of *fully
connected layers*: every unit in one layer is connected to every unit
in the next. Each connection has a learnable *weight*; a unit computes
a weighted sum of its inputs, passes it through a nonlinear *activation
function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
``max(0, x)``), and sends the result to the next layer. The final
layer has one unit per action, producing Q-value estimates.

An MLP with two hidden layers of width 128, for an observation of size
3,072 and 5 possible actions, would have approximately 400,000 trainable
parameters. Training adjusts all of these parameters simultaneously to
reduce the TD error.

An MLP treats its input as a flat list of numbers. It does not know
that these numbers were arranged in a 2D grid, or that spatially
adjacent cells are related. This is appropriate when the game's
observation is better understood as a collection of independent
readings—a set of meters or status indicators—rather than as a spatial
scene. ``spatial = false`` (the default) selects this architecture.

**Convolutional neural networks (CNN)**

When the game board is genuinely spatial—when the relative positions
of characters matter—a *convolutional neural network* (CNN) is a much
better fit. A CNN applies a set of learnable *filters* (small weight
matrices) across the board, computing a dot product of each filter with
every overlapping patch of the input. The result is a set of *feature
maps*: each feature map highlights where in the board a particular
pattern appears.

This is efficient for two reasons. First, the same filter is applied
at every board position: a filter that detects "apple to the right of
snake head" works the same way whether the apple is at position (10,5)
or (20,12). This *translational invariance* means the network can
generalize across positions without learning a separate rule for each
one. Second, each filter needs only a small number of parameters (the
filter size)—far fewer than the equivalent fully connected connections.

``retro-gamer`` uses two convolutional layers (with 32 and 64 output
channels respectively, kernel size 3, padding 1) followed by a
flattening step and an MLP head. The padding ensures that the spatial
dimensions are preserved through the convolution, so the output of the
second conv layer has shape (64, H, W), which is then flattened and
passed to the MLP. Set ``spatial = true`` in ``[preprocessing]`` to
use this architecture.

Connecting architecture to game metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The architectural choices ``retro-gamer`` makes are not arbitrary: they
follow from the game description you provide. This connection is worth
making explicit, because understanding it is one of the main paths into
understanding why neural network architecture matters.

- If ``spatial = true`` (in ``[preprocessing]``), the CNN can detect
  local patterns—which characters are adjacent to which—without needing
  to see every possible arrangement. This is appropriate for games like
  Snake, where the snake's direction and the apple's relative position
  are spatially encoded.

- If ``spatial = false`` (the default), the MLP treats the board as a
  flat vector. This may be appropriate for games that use the character
  grid primarily as a display rather than a spatial field—for example,
  a game where characters appear in fixed, non-interacting positions as
  status indicators.

- The ``character_set`` determines the depth (C) of the board tensor.
  More characters mean more numbers per cell and a larger input to the
  network. A character set that includes characters the game never uses
  wastes capacity; a character set that omits relevant characters forces
  the agent to treat different things as the same.

- Keys listed in ``observe_state`` (in ``[preprocessing]``) are appended
  to the flattened board output before the MLP head. This allows the
  agent to use computed values—a direction to the goal, a distance, a
  timer—alongside the visual board representation.

These relationships are not incidental features of the implementation.
They are the reason the game description matters: every field you fill
in shapes what the agent can perceive and therefore what it can learn.

Design rationale
----------------

This section explains the reasoning behind several design decisions in
``retro-gamer`` that go beyond technical necessity. Each choice was
made with a specific pedagogical goal: to create a tool that not only
trains agents, but also helps students build genuine understanding of
how and why the training process works.

Checkpoint compatibility and the "start fresh" workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a student changes the game description or network architecture
mid-training, ``retro-gamer`` refuses to resume and explains exactly
which fields changed and why they are incompatible. This behavior is
deliberate.

The immediate practical reason is correctness: if the character set
changes, the network's input layer changes size, and the saved weights
no longer correspond to any meaningful function. Loading them would
produce garbage behavior. If the reward signal changes, the Q-values
the network has accumulated are estimates of a *different* objective;
resuming would mislead the network, not help it.

But the deeper reason is pedagogical. The incompatibility check is a
moment of forced reflection. When a student sees::

   character_set
     was : ['@', '*', '>', '<', '^', 'v']
     now : ['@', '*', '>', '<', '^', 'v', '#']
     why : the set of board characters (changes input layer size)

they are confronted with the concrete consequence of a description
change. The character set is not a label; it determines the shape of
the tensor the network operates on. Changing it invalidates the
network the same way changing the rules of chess would invalidate a
chess engine. The error message is designed to make this connection
legible, not just to block a problematic action.

The ``retro-gamer clean`` command exists to make the recovery path
explicit: you can start fresh, and you should. There is no partial
salvage. This mirrors an important truth about RL training: some
decisions are foundational, and changing them means starting over.
Students who encounter this—who have to decide whether a change is
worth the cost of retraining—are reasoning about the architecture in
a way that purely reading about it does not produce.

The distinction between incompatible changes (game description,
network architecture) and safe changes (hyperparameters like learning
rate and epsilon) is also pedagogically useful. It encodes, in the
tool itself, the distinction between *what the agent is learning* and
*how it is learning*. Students who ask "can I change the learning rate
without retraining?" are asking a question with a precise answer, and
answering it correctly requires understanding why the learning rate is
different in kind from the character set.

Checkpoint-level logging
~~~~~~~~~~~~~~~~~~~~~~~~~

Early versions of ``retro-gamer`` logged one line per episode. This
was accurate but not very useful: a run of 1,000 episodes produces
1,000 log lines, most of which are noise. Individual episodes vary
widely due to randomness in both the game and the agent's exploration,
making it hard to see the underlying trend.

The current format logs one line per checkpoint—once every 100
episodes—using averages over that window. This design serves several
goals:

**Noise reduction.** Single-episode rewards are highly variable,
especially when epsilon is high and the agent is behaving randomly.
Averaging over 100 episodes smooths out this variance and makes
genuine trends visible.

**Interpretive scaffolding.** The log line includes ``epsilon``
alongside ``avg_reward``, so students can directly see the
relationship between exploration rate and performance. Early entries
with low ``avg_reward`` and high ``epsilon`` invite the question:
"is this bad performance, or just exploration?" The answer—that random
behavior is expected when epsilon is near 1—is readable from the log
itself.

**Timing information.** Each log line records both the elapsed time
for that 100-episode interval and the total training time accumulated
across all sessions. This serves two purposes. Practically, it lets
students estimate how long continued training will take. Conceptually,
it makes the cost of training tangible: RL is not instant, and the
log makes the time investment visible.

**Session continuity.** When training resumes from a checkpoint, a
header line marks the break (``=== Resumed from ep_0500.pt ===``).
This lets the full log tell the story of a run across multiple
sessions, preserving the history of when training happened even if the
student stops and restarts many times.

The stop-watch-adjust-resume workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``retro-gamer`` is designed around a workflow that the log format and
checkpoint system both support: stop training, watch the agent play,
decide what to change, and resume.

This workflow is pedagogically productive because it gives students
a *reason* to look at the log and a *reason* to think about
hyperparameters. Watching the agent at episode 100 play erratically,
then watching the agent at episode 500 navigate toward the apple more
consistently, is not just satisfying—it raises concrete questions.
Why did the agent improve? What changed between those two checkpoints?
What would happen if we gave it more time, or adjusted the reward?

These questions are best answered by consulting the log. The log in
turn connects the behavior the student observed to numbers they can
reason about: a decreasing loss, a declining epsilon, a rising average
reward. The three—visual observation, log interpretation, and
conceptual understanding—form a feedback loop that is much harder to
close if training is treated as a black box that produces only a final
model.

The fact that training can be stopped and resumed freely, with no
penalty and no extra flags, removes friction from this cycle. Students
who feel they can experiment—stop, look, think, resume—are more
likely to do so than students who feel they have to commit to a full
training run before seeing results.

Reward design as game description
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``reward`` field in ``[tool.retro-gamer]`` specifies a key from
the game's state dictionary, not a function or a formula. This is
another deliberate design choice. The reward signal is defined in the
game code—in how the score changes when certain events occur—not in
the training configuration.

This forces students to engage with the reward where it lives: in the
game logic. If a student wants to change the reward structure, they
must change the game. This connects the RL concept of reward shaping
to the concrete act of writing Python code that updates a score. The
question "what reward should the agent get for moving toward the
apple?" becomes "what code should run when the snake moves?"—and
answering it requires reasoning about what behavior you want to
encourage and how a small, frequent signal compares to a large,
infrequent one.

The distinction between reward-signal design (a pedagogically rich
question with many possible answers) and reward-field specification
(a technical detail) is preserved in the interface. Students configure
the *key* to track; they design the *signal* in the game itself.

Metadata as game description, not training configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The game description lives in ``[tool.retro-gamer]`` inside the
game's own ``pyproject.toml``, not in a separate training
configuration file. This placement encodes a claim: the character set,
the action space, and the reward signal are *properties of the game*,
not settings for the trainer.

A student who edits the character set is not tweaking the trainer;
they are more accurately describing their game. This framing matters
because it positions the student as the expert on the game—which they
are—and the trainer as a tool that depends on the accuracy of that
description. Errors in the description are not configuration mistakes;
they are inaccurate descriptions of something the student knows.

When a student omits a character from the character set and the agent
fails to notice that character on the board, the diagnostic question
is not "what went wrong with training?" but "is my description of the
game correct?" This is a more productive question, because it connects
the student's domain knowledge (they know what characters appear and
why they matter) to the technical representation (one-hot encoding
requires knowing in advance which characters to encode). The fix is
not to adjust a hyperparameter; it is to describe the game more
accurately.