620 lines
30 KiB
ReStructuredText
620 lines
30 KiB
ReStructuredText
Background
|
||
==========
|
||
|
||
Pedagogical framework
|
||
---------------------
|
||
|
||
Making With Code and the games unit
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
``retro-gamer`` is developed for use in
|
||
`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
|
||
computer science curriculum designed around the constructionist
|
||
principle that students learn most durably by building things they care
|
||
about. In MWC's games unit, students design and implement their own
|
||
games using the ``retro-games`` framework: a Python library for
|
||
building terminal-based, character-grid games in the style of early
|
||
arcade software. Students start from concept, work through design,
|
||
implement agents and game logic in Python, and end with a complete,
|
||
playable game.
|
||
|
||
The games unit gives students deep familiarity with one particular
|
||
game and its code. They know which characters appear on the board,
|
||
what the state dictionary contains, how reward accumulates, and what
|
||
strategies tend to work. This knowledge is ordinarily tacit—embedded
|
||
in how they play—but it is exactly the kind of knowledge that
|
||
``retro-gamer`` asks students to make explicit. The act of writing a
|
||
``config.toml`` that accurately describes your game to a learning
|
||
algorithm is a form of structured reflection: you have to articulate,
|
||
in precise terms, what you know.
|
||
|
||
Objects to think with
|
||
~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The educational psychologist and mathematician Seymour Papert
|
||
introduced the concept of *objects to think with*: concrete artifacts
|
||
that serve as anchors for otherwise abstract ideas (Papert 1980). A
|
||
gear, for Papert, was an object to think with about mathematics. The
|
||
turtle in Logo was an object to think with about procedural thinking.
|
||
In each case, the learner's embodied, intuitive knowledge of the
|
||
object—how gears mesh, how the turtle moves—provides traction on
|
||
abstract relationships that might otherwise remain inaccessible.
|
||
|
||
A game that a student has built and played is a particularly rich
|
||
object to think with. The student knows the game's behavior
|
||
intimately: they have watched characters interact, experienced the
|
||
score signal as meaningful, and developed intuitions about what makes
|
||
a good move. These intuitions are not merely useful—they are
|
||
*translatable* into the language of reinforcement learning. The reward
|
||
signal the student experiences as a player is the same signal the
|
||
trainer uses to evaluate actions. The patterns the student recognizes
|
||
as meaningful on the board are precisely the patterns a convolutional
|
||
neural network is designed to detect. The exploration-exploitation
|
||
tradeoff the trainer navigates—trying new things versus sticking with
|
||
what has worked—is analogous to the choices a student makes when
|
||
learning a new game.
|
||
|
||
``retro-gamer`` is designed to make these translations visible. When
|
||
the student reads the training log and sees that the trainer chose a
|
||
CNN because the game is spatial, they can connect that decision to
|
||
their own knowledge of how the board works. When they see the reward
|
||
increasing episode by episode, they can reason about *why*—what the
|
||
agent is learning to do—rather than watching an opaque number change.
|
||
|
||
Metadata as structured reflection
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A student who has built a game knows things about it that its code does
|
||
not make explicit. They know which characters matter—which ones indicate
|
||
danger, opportunity, or neutral terrain. They know what game state
|
||
changes signal success. They know whether the arrangement of pieces on
|
||
the board is meaningful or incidental. This knowledge is usually tacit:
|
||
embedded in how they play, not in anything they have written down.
|
||
|
||
``retro-gamer`` asks students to make this tacit knowledge explicit by
|
||
writing a ``[tool.retro-gamer]`` section in their game's
|
||
``pyproject.toml``. The choice of location is deliberate: placing game
|
||
metadata in the game's own project file frames it as *a property of the
|
||
game*, not as a configuration setting for the training tool. The student
|
||
is not giving hints to the trainer; they are accurately describing what
|
||
they built.
|
||
|
||
This framing matters for how students reason about the relationship
|
||
between description and performance. A student who omits a character
|
||
from the character set and then notices degraded training performance is
|
||
not observing a failure of their trainer configuration—they are
|
||
observing the consequence of having described the game inaccurately.
|
||
The fix is not to adjust a hyperparameter; it is to write a more
|
||
accurate description. The question "is my description of the game
|
||
correct?" is precisely the kind of structured reflection that produces
|
||
conceptual understanding, because it requires the student to connect
|
||
what they know about the game to the representations the learning
|
||
algorithm uses.
|
||
|
||
Knowledge building and discussion
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Making a game does not, by itself, guarantee conceptual understanding
|
||
of reinforcement learning. Students may engage deeply with the
|
||
implementation details of their game while remaining unable to
|
||
articulate the big ideas that ``retro-gamer`` is meant to make
|
||
salient. Research in the knowledge-building tradition (Scardamalia and
|
||
Bereiter 2006) suggests that conceptual understanding deepens
|
||
substantially when students discuss their ideas with others—explaining,
|
||
questioning, and revising their understanding in dialogue.
|
||
|
||
``retro-gamer`` is designed to generate the kind of specific,
|
||
grounded questions that productive discussion requires. "What happens
|
||
if I leave a character out of the character set?" is not an abstract
|
||
question; it is a question about a specific game the student knows
|
||
well, and it has a specific, reasoned answer. "Why does training
|
||
improve faster with prioritized experience replay?" connects a
|
||
hyperparameter setting to a mechanism. These are better starting
|
||
points for discussion than the generic questions that arise from
|
||
reading about reinforcement learning without a concrete artifact to
|
||
refer to.
|
||
|
||
Research design
|
||
~~~~~~~~~~~~~~~
|
||
|
||
The pedagogical hypothesis underlying ``retro-gamer`` is being
|
||
evaluated in a research study conducted in the context of MWC's games
|
||
unit. The study investigates how two interventions—using
|
||
``retro-gamer`` to train an agent, and discussing reinforcement
|
||
learning with a large language model—interact to support conceptual
|
||
understanding of reinforcement learning.
|
||
|
||
The key outcome is measured by a set of scenario-based conceptual
|
||
questions. Representative examples include:
|
||
|
||
- *Imagine you were training an agent to play a game with a specified
|
||
character set. If you forgot to include one of the characters which
|
||
is used in the game, how would it affect the trained agent's
|
||
performance? Explain your reasoning.*
|
||
- *Imagine you are training an agent to play a game which has a
|
||
specified character set. You realize that only half of the specified
|
||
characters are actually used in the game. If you change the
|
||
character set to include only the characters that actually appear,
|
||
how would the training process change? Explain your reasoning.*
|
||
- *Imagine you are creating a game where the goal is to win, and
|
||
partial success has no value—for example, a game where the goal is
|
||
to escape a maze. What would be the effect on agent training of
|
||
adding artificial rewards for completing sub-goals such as reaching
|
||
a milestone halfway to the exit? Explain your reasoning.*
|
||
|
||
Each question is evaluated using a rubric that rewards conceptual
|
||
understanding, even where specific misconceptions remain.
|
||
|
||
Participants all receive a traditional classroom lesson on
|
||
reinforcement learning before the study begins, ensuring that the same
|
||
conceptual vocabulary is available to everyone. They then complete a
|
||
pretest of the conceptual questions. Participants are randomly assigned
|
||
to one of four conditions in a 2×2 design: the first factor is whether
|
||
they use ``retro-gamer`` to train an agent on their game; the second
|
||
is whether they discuss reinforcement learning with a large language
|
||
model. One week later, participants complete the posttest. We
|
||
hypothesize that the combination of ``retro-gamer`` and LLM discussion
|
||
will produce the largest gains, mediated by more specific and more
|
||
numerous questions to the LLM—a sign that students are reasoning more
|
||
deeply about the underlying concepts.
|
||
|
||
Technical background
|
||
--------------------
|
||
|
||
This section provides a conceptual introduction to the ideas underlying
|
||
``retro-gamer``. It is intended to be accessible to students who have
|
||
not studied machine learning before, while also connecting each concept
|
||
to the specific choices you make when using the tool.
|
||
|
||
Reinforcement learning
|
||
~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
*Reinforcement learning* (RL) is a framework for training an *agent*
|
||
to make good decisions by interacting with an *environment*.
|
||
|
||
At every moment, the environment is in some *state*, and the agent
|
||
observes something about that state. The agent chooses an *action*,
|
||
the environment transitions to a new state in response, and the agent
|
||
receives a *reward* signal—a number that indicates how well it is
|
||
doing. The agent's goal is to learn a *policy*: a rule for choosing
|
||
actions that maximizes the total reward it accumulates over time. In
|
||
``retro-gamer``, the game is the environment, the character grid and
|
||
state dictionary are what the agent observes, pressing a key is an
|
||
action, and the change in score is the reward.
|
||
|
||
A distinctive feature of reinforcement learning—distinguishing it from
|
||
supervised learning, where a model is trained on labeled examples—is
|
||
that the agent must discover what good behavior looks like through
|
||
experience. There is no teacher providing correct answers. The reward
|
||
signal is all the agent has to go on. This makes reinforcement
|
||
learning both powerful (it can find solutions no human designer would
|
||
think to specify) and tricky (poorly chosen reward signals can produce
|
||
strange or unintended behavior).
|
||
|
||
The total reward the agent receives from a given state onward—if it
|
||
acts according to its current policy—is called the *return*. Because
|
||
rewards in the far future are harder to predict and plan for, RL
|
||
algorithms typically *discount* future rewards: a reward received
|
||
``t`` turns from now is worth only ``γ^t`` times its face value, where
|
||
``γ`` (gamma) is a number slightly less than 1. The ``gamma``
|
||
hyperparameter in ``retro-gamer`` controls this discount. A value
|
||
close to 1 means the agent values the distant future almost as much
|
||
as the immediate present; a smaller value makes the agent more
|
||
myopic.
|
||
|
||
Q-learning
|
||
~~~~~~~~~~~
|
||
|
||
A natural way to formalize the agent's goal is to define the *Q-function*
|
||
(or *Q-value*): Q(s, a) is the expected total discounted reward the
|
||
agent will receive if it is in state ``s``, takes action ``a``, and
|
||
then follows its current policy from that point on. If the agent knew
|
||
the true Q-function, it could act optimally simply by choosing the
|
||
action with the highest Q-value in each state.
|
||
|
||
Q-learning is an algorithm for learning the Q-function by experience.
|
||
Starting from an arbitrary initial estimate, the agent uses the
|
||
*Bellman equation* to update its Q-estimates after each transition.
|
||
The key insight is that the Q-value of taking action ``a`` in state
|
||
``s`` is related to the immediate reward and the best Q-value
|
||
achievable from the next state:
|
||
|
||
.. math::
|
||
|
||
Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')
|
||
|
||
After each turn, the agent computes this *temporal difference* (TD)
|
||
error—the gap between its current Q-estimate and what the Bellman
|
||
equation says it should be—and adjusts its estimates to reduce the
|
||
error. Over many iterations, the Q-estimates converge toward their
|
||
true values.
|
||
|
||
Deep Q-networks
|
||
~~~~~~~~~~~~~~~
|
||
|
||
Classical Q-learning stores the Q-function in a table: one entry for
|
||
every possible (state, action) pair. This is feasible only when the
|
||
number of possible states is small. For a game board with even modest
|
||
dimensions—say 32×16 cells, each displaying one of a handful of
|
||
characters—the number of possible board configurations is astronomically
|
||
large. Storing a table of Q-values for every configuration is not
|
||
practical.
|
||
|
||
*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
|
||
problem by approximating the Q-function with a neural network. Instead
|
||
of a table, the network takes the current state as input and outputs
|
||
Q-value estimates for all possible actions simultaneously. The network
|
||
*generalizes*: having learned that moving right is a good idea when
|
||
the apple is to the right and nothing is in the way, it applies that
|
||
knowledge to board configurations it has never seen before.
|
||
|
||
The training process in ``retro-gamer`` follows the DQN algorithm. At
|
||
each turn, the agent uses its current network to estimate Q-values and
|
||
selects an action. It stores the experience—(state, action, reward,
|
||
next state)—in a *replay buffer*. Periodically, it samples a random
|
||
batch of experiences from the buffer and uses them to compute TD
|
||
errors, then adjusts the network weights to reduce those errors. This
|
||
process continues for many episodes.
|
||
|
||
Experience replay
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
A key ingredient of DQN is *experience replay*. Rather than training
|
||
on experiences as they arrive—which would mean training on correlated,
|
||
sequential transitions—the agent stores experiences in a buffer and
|
||
samples them randomly for training. This has two benefits. First, each
|
||
experience is potentially used many times for training, making data
|
||
use more efficient. Second, random sampling breaks the correlations
|
||
between consecutive transitions, which would otherwise cause the
|
||
network's weight updates to interfere with each other.
|
||
|
||
``retro-gamer`` offers a standard replay buffer and an optional
|
||
*prioritized* replay buffer (PER). In PER, experiences with larger TD
|
||
errors—cases where the agent's prediction was most wrong—are sampled
|
||
more often. The intuition is that surprising transitions are more
|
||
informative. Prioritized replay often improves training efficiency but
|
||
introduces a bias that must be corrected with *importance sampling
|
||
weights* (Schaul et al. 2015).
|
||
|
||
The ``memory_capacity`` hyperparameter sets how many experiences the
|
||
buffer can hold. When the buffer is full, old experiences are
|
||
discarded. A larger buffer provides more diverse training data but
|
||
uses more memory.
|
||
|
||
Target networks
|
||
~~~~~~~~~~~~~~~
|
||
|
||
A subtle challenge in DQN training is that the Q-values computed by the
|
||
Bellman equation depend on the network's own estimates of the next
|
||
state's Q-values. If the network is updated constantly, its Q-value
|
||
estimates keep shifting, making the training target a moving one. This
|
||
can cause instability.
|
||
|
||
DQN addresses this with a *target network*: a copy of the main network
|
||
that is updated only every ``target_update_freq`` steps. The Bellman
|
||
target is computed using the target network, while the main network is
|
||
updated by gradient descent. Because the target network changes slowly,
|
||
training targets remain stable long enough for the main network to
|
||
make progress.
|
||
|
||
Exploration vs. exploitation
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A reinforcement learning agent faces a fundamental dilemma: should it
|
||
*exploit* what it already knows (taking the action with the highest
|
||
estimated Q-value) or *explore* (trying actions it is less certain
|
||
about, in case they lead to better outcomes it has not yet discovered)?
|
||
Exploiting too much early in training means the agent never discovers
|
||
better strategies; exploring too much later means the agent wastes time
|
||
on random behavior when it already knows what to do.
|
||
|
||
``retro-gamer`` uses *ε-greedy exploration*: with probability ε
|
||
(epsilon), the agent chooses a random action; with probability 1 − ε,
|
||
it exploits its current Q-function. ε starts at 1 (pure exploration)
|
||
and decays over training according to ``epsilon_decay``, reaching
|
||
a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
|
||
training log shows how exploration decreases as training progresses.
|
||
|
||
Representing the game board
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A neural network operates on numbers, not characters. Before the
|
||
game board can be fed to the Q-network, it must be converted to a
|
||
numerical representation. ``retro-gamer`` uses *one-hot encoding*.
|
||
|
||
For a character set of ``n`` distinct characters, each cell on the
|
||
board is represented by a vector of ``n`` numbers, all zero except for
|
||
the one position corresponding to the character in that cell, which is
|
||
set to 1. For example, with character set ``['@', '*', '>']``, the
|
||
character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
|
||
encoded as ``[0, 0, 0]``.
|
||
|
||
The full board representation is a three-dimensional array of shape
|
||
(H, W, C), where H is the board height, W is the board width, and
|
||
C is the number of characters in the character set. The total number
|
||
of numbers in this array—H × W × C—is the size of the board part of
|
||
the observation. For a 32×16 board with 6 characters, this is
|
||
32 × 16 × 6 = 3,072 numbers.
|
||
|
||
The ``character_set`` field in the game description determines which
|
||
characters the agent can distinguish. A character not in the set
|
||
appears as an all-zero vector—indistinguishable from an empty cell.
|
||
If the character set is not specified, ``retro-gamer`` runs a brief
|
||
exploration phase before training to observe which characters actually
|
||
appear.
|
||
|
||
In addition to the board, the agent can observe extra computed values
|
||
from ``game.state``. Listing keys in the ``observe_state`` option of
|
||
``[preprocessing]`` causes those values to be appended to the
|
||
observation vector after the board encoding. This is where feature
|
||
engineering decisions live: what derived quantities should the agent
|
||
see, and does giving it those values give it an advantage a human
|
||
player would not have?
|
||
|
||
Neural network architectures
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The architecture of the Q-network—the number and arrangement of its
|
||
layers—is one of the most consequential choices in DQN training.
|
||
``retro-gamer`` selects an architecture based on the ``spatial``
|
||
option in ``[preprocessing]`` of ``config.toml`` and generates a
|
||
plain-language rationale.
|
||
|
||
**Multilayer perceptrons (MLP)**
|
||
|
||
The simplest neural network architecture for fixed-size input is the
|
||
*multilayer perceptron* (MLP). An MLP is a sequence of *fully
|
||
connected layers*: every unit in one layer is connected to every unit
|
||
in the next. Each connection has a learnable *weight*; a unit computes
|
||
a weighted sum of its inputs, passes it through a nonlinear *activation
|
||
function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
|
||
``max(0, x)``), and sends the result to the next layer. The final
|
||
layer has one unit per action, producing Q-value estimates.
|
||
|
||
An MLP with two hidden layers of width 128, for an observation of size
|
||
3,072 and 5 possible actions, would have approximately 400,000 trainable
|
||
parameters. Training adjusts all of these parameters simultaneously to
|
||
reduce the TD error.
|
||
|
||
An MLP treats its input as a flat list of numbers. It does not know
|
||
that these numbers were arranged in a 2D grid, or that spatially
|
||
adjacent cells are related. This is appropriate when the game's
|
||
observation is better understood as a collection of independent
|
||
readings—a set of meters or status indicators—rather than as a spatial
|
||
scene. ``spatial = false`` (the default) selects this architecture.
|
||
|
||
**Convolutional neural networks (CNN)**
|
||
|
||
When the game board is genuinely spatial—when the relative positions
|
||
of characters matter—a *convolutional neural network* (CNN) is a much
|
||
better fit. A CNN applies a set of learnable *filters* (small weight
|
||
matrices) across the board, computing a dot product of each filter with
|
||
every overlapping patch of the input. The result is a set of *feature
|
||
maps*: each feature map highlights where in the board a particular
|
||
pattern appears.
|
||
|
||
This is efficient for two reasons. First, the same filter is applied
|
||
at every board position: a filter that detects "apple to the right of
|
||
snake head" works the same way whether the apple is at position (10,5)
|
||
or (20,12). This *translational invariance* means the network can
|
||
generalize across positions without learning a separate rule for each
|
||
one. Second, each filter needs only a small number of parameters (the
|
||
filter size)—far fewer than the equivalent fully connected connections.
|
||
|
||
``retro-gamer`` uses two convolutional layers (with 32 and 64 output
|
||
channels respectively, kernel size 3, padding 1) followed by a
|
||
flattening step and an MLP head. The padding ensures that the spatial
|
||
dimensions are preserved through the convolution, so the output of the
|
||
second conv layer has shape (64, H, W), which is then flattened and
|
||
passed to the MLP. Set ``spatial = true`` in ``[preprocessing]`` to
|
||
use this architecture.
|
||
|
||
Connecting architecture to game metadata
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The architectural choices ``retro-gamer`` makes are not arbitrary: they
|
||
follow from the game description you provide. This connection is worth
|
||
making explicit, because understanding it is one of the main paths into
|
||
understanding why neural network architecture matters.
|
||
|
||
- If ``spatial = true`` (in ``[preprocessing]``), the CNN can detect
|
||
local patterns—which characters are adjacent to which—without needing
|
||
to see every possible arrangement. This is appropriate for games like
|
||
Snake, where the snake's direction and the apple's relative position
|
||
are spatially encoded.
|
||
|
||
- If ``spatial = false`` (the default), the MLP treats the board as a
|
||
flat vector. This may be appropriate for games that use the character
|
||
grid primarily as a display rather than a spatial field—for example,
|
||
a game where characters appear in fixed, non-interacting positions as
|
||
status indicators.
|
||
|
||
- The ``character_set`` determines the depth (C) of the board tensor.
|
||
More characters mean more numbers per cell and a larger input to the
|
||
network. A character set that includes characters the game never uses
|
||
wastes capacity; a character set that omits relevant characters forces
|
||
the agent to treat different things as the same.
|
||
|
||
- Keys listed in ``observe_state`` (in ``[preprocessing]``) are appended
|
||
to the flattened board output before the MLP head. This allows the
|
||
agent to use computed values—a direction to the goal, a distance, a
|
||
timer—alongside the visual board representation.
|
||
|
||
These relationships are not incidental features of the implementation.
|
||
They are the reason the game description matters: every field you fill
|
||
in shapes what the agent can perceive and therefore what it can learn.
|
||
|
||
Design rationale
|
||
----------------
|
||
|
||
This section explains the reasoning behind several design decisions in
|
||
``retro-gamer`` that go beyond technical necessity. Each choice was
|
||
made with a specific pedagogical goal: to create a tool that not only
|
||
trains agents, but also helps students build genuine understanding of
|
||
how and why the training process works.
|
||
|
||
Checkpoint compatibility and the "start fresh" workflow
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
When a student changes the game description or network architecture
|
||
mid-training, ``retro-gamer`` refuses to resume and explains exactly
|
||
which fields changed and why they are incompatible. This behavior is
|
||
deliberate.
|
||
|
||
The immediate practical reason is correctness: if the character set
|
||
changes, the network's input layer changes size, and the saved weights
|
||
no longer correspond to any meaningful function. Loading them would
|
||
produce garbage behavior. If the reward signal changes, the Q-values
|
||
the network has accumulated are estimates of a *different* objective;
|
||
resuming would mislead the network, not help it.
|
||
|
||
But the deeper reason is pedagogical. The incompatibility check is a
|
||
moment of forced reflection. When a student sees::
|
||
|
||
character_set
|
||
was : ['@', '*', '>', '<', '^', 'v']
|
||
now : ['@', '*', '>', '<', '^', 'v', '#']
|
||
why : the set of board characters (changes input layer size)
|
||
|
||
they are confronted with the concrete consequence of a description
|
||
change. The character set is not a label; it determines the shape of
|
||
the tensor the network operates on. Changing it invalidates the
|
||
network the same way changing the rules of chess would invalidate a
|
||
chess engine. The error message is designed to make this connection
|
||
legible, not just to block a problematic action.
|
||
|
||
The ``retro-gamer clean`` command exists to make the recovery path
|
||
explicit: you can start fresh, and you should. There is no partial
|
||
salvage. This mirrors an important truth about RL training: some
|
||
decisions are foundational, and changing them means starting over.
|
||
Students who encounter this—who have to decide whether a change is
|
||
worth the cost of retraining—are reasoning about the architecture in
|
||
a way that purely reading about it does not produce.
|
||
|
||
The distinction between incompatible changes (game description,
|
||
network architecture) and safe changes (hyperparameters like learning
|
||
rate and epsilon) is also pedagogically useful. It encodes, in the
|
||
tool itself, the distinction between *what the agent is learning* and
|
||
*how it is learning*. Students who ask "can I change the learning rate
|
||
without retraining?" are asking a question with a precise answer, and
|
||
answering it correctly requires understanding why the learning rate is
|
||
different in kind from the character set.
|
||
|
||
Checkpoint-level logging
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Early versions of ``retro-gamer`` logged one line per episode. This
|
||
was accurate but not very useful: a run of 1,000 episodes produces
|
||
1,000 log lines, most of which are noise. Individual episodes vary
|
||
widely due to randomness in both the game and the agent's exploration,
|
||
making it hard to see the underlying trend.
|
||
|
||
The current format logs one line per checkpoint—once every 100
|
||
episodes—using averages over that window. This design serves several
|
||
goals:
|
||
|
||
**Noise reduction.** Single-episode rewards are highly variable,
|
||
especially when epsilon is high and the agent is behaving randomly.
|
||
Averaging over 100 episodes smooths out this variance and makes
|
||
genuine trends visible.
|
||
|
||
**Interpretive scaffolding.** The log line includes ``epsilon``
|
||
alongside ``avg_reward``, so students can directly see the
|
||
relationship between exploration rate and performance. Early entries
|
||
with low ``avg_reward`` and high ``epsilon`` invite the question:
|
||
"is this bad performance, or just exploration?" The answer—that random
|
||
behavior is expected when epsilon is near 1—is readable from the log
|
||
itself.
|
||
|
||
**Timing information.** Each log line records both the elapsed time
|
||
for that 100-episode interval and the total training time accumulated
|
||
across all sessions. This serves two purposes. Practically, it lets
|
||
students estimate how long continued training will take. Conceptually,
|
||
it makes the cost of training tangible: RL is not instant, and the
|
||
log makes the time investment visible.
|
||
|
||
**Session continuity.** When training resumes from a checkpoint, a
|
||
header line marks the break (``=== Resumed from ep_0500.pt ===``).
|
||
This lets the full log tell the story of a run across multiple
|
||
sessions, preserving the history of when training happened even if the
|
||
student stops and restarts many times.
|
||
|
||
The stop-watch-adjust-resume workflow
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
``retro-gamer`` is designed around a workflow that the log format and
|
||
checkpoint system both support: stop training, watch the agent play,
|
||
decide what to change, and resume.
|
||
|
||
This workflow is pedagogically productive because it gives students
|
||
a *reason* to look at the log and a *reason* to think about
|
||
hyperparameters. Watching the agent at episode 100 play erratically,
|
||
then watching the agent at episode 500 navigate toward the apple more
|
||
consistently, is not just satisfying—it raises concrete questions.
|
||
Why did the agent improve? What changed between those two checkpoints?
|
||
What would happen if we gave it more time, or adjusted the reward?
|
||
|
||
These questions are best answered by consulting the log. The log in
|
||
turn connects the behavior the student observed to numbers they can
|
||
reason about: a decreasing loss, a declining epsilon, a rising average
|
||
reward. The three—visual observation, log interpretation, and
|
||
conceptual understanding—form a feedback loop that is much harder to
|
||
close if training is treated as a black box that produces only a final
|
||
model.
|
||
|
||
The fact that training can be stopped and resumed freely, with no
|
||
penalty and no extra flags, removes friction from this cycle. Students
|
||
who feel they can experiment—stop, look, think, resume—are more
|
||
likely to do so than students who feel they have to commit to a full
|
||
training run before seeing results.
|
||
|
||
Reward design as game description
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ``reward`` field in ``[tool.retro-gamer]`` specifies a key from
|
||
the game's state dictionary, not a function or a formula. This is
|
||
another deliberate design choice. The reward signal is defined in the
|
||
game code—in how the score changes when certain events occur—not in
|
||
the training configuration.
|
||
|
||
This forces students to engage with the reward where it lives: in the
|
||
game logic. If a student wants to change the reward structure, they
|
||
must change the game. This connects the RL concept of reward shaping
|
||
to the concrete act of writing Python code that updates a score. The
|
||
question "what reward should the agent get for moving toward the
|
||
apple?" becomes "what code should run when the snake moves?"—and
|
||
answering it requires reasoning about what behavior you want to
|
||
encourage and how a small, frequent signal compares to a large,
|
||
infrequent one.
|
||
|
||
The distinction between reward-signal design (a pedagogically rich
|
||
question with many possible answers) and reward-field specification
|
||
(a technical detail) is preserved in the interface. Students configure
|
||
the *key* to track; they design the *signal* in the game itself.
|
||
|
||
Metadata as game description, not training configuration
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The game description lives in ``[tool.retro-gamer]`` inside the
|
||
game's own ``pyproject.toml``, not in a separate training
|
||
configuration file. This placement encodes a claim: the character set,
|
||
the action space, and the reward signal are *properties of the game*,
|
||
not settings for the trainer.
|
||
|
||
A student who edits the character set is not tweaking the trainer;
|
||
they are more accurately describing their game. This framing matters
|
||
because it positions the student as the expert on the game—which they
|
||
are—and the trainer as a tool that depends on the accuracy of that
|
||
description. Errors in the description are not configuration mistakes;
|
||
they are inaccurate descriptions of something the student knows.
|
||
|
||
When a student omits a character from the character set and the agent
|
||
fails to notice that character on the board, the diagnostic question
|
||
is not "what went wrong with training?" but "is my description of the
|
||
game correct?" This is a more productive question, because it connects
|
||
the student's domain knowledge (they know what characters appear and
|
||
why they matter) to the technical representation (one-hot encoding
|
||
requires knowing in advance which characters to encode). The fix is
|
||
not to adjust a hyperparameter; it is to describe the game more
|
||
accurately.
|