443 lines
22 KiB
ReStructuredText
443 lines
22 KiB
ReStructuredText
Background
|
||
==========
|
||
|
||
Pedagogical framework
|
||
---------------------
|
||
|
||
Making With Code and the games unit
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
``retro-gamer`` is developed for use in
|
||
`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
|
||
computer science curriculum designed around the constructionist
|
||
principle that students learn most durably by building things they care
|
||
about. In MWC's games unit, students design and implement their own
|
||
games using the ``retro-games`` framework: a Python library for
|
||
building terminal-based, character-grid games in the style of early
|
||
arcade software. Students start from concept, work through design,
|
||
implement agents and game logic in Python, and end with a complete,
|
||
playable game.
|
||
|
||
The games unit gives students deep familiarity with one particular
|
||
game and its code. They know which characters appear on the board,
|
||
what the state dictionary contains, how reward accumulates, and what
|
||
strategies tend to work. This knowledge is ordinarily tacit—embedded
|
||
in how they play—but it is exactly the kind of knowledge that
|
||
``retro-gamer`` asks students to make explicit. The act of writing a
|
||
``config.toml`` that accurately describes your game to a learning
|
||
algorithm is a form of structured reflection: you have to articulate,
|
||
in precise terms, what you know.
|
||
|
||
Objects to think with
|
||
~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The educational psychologist and mathematician Seymour Papert
|
||
introduced the concept of *objects to think with*: concrete artifacts
|
||
that serve as anchors for otherwise abstract ideas (Papert 1980). A
|
||
gear, for Papert, was an object to think with about mathematics. The
|
||
turtle in Logo was an object to think with about procedural thinking.
|
||
In each case, the learner's embodied, intuitive knowledge of the
|
||
object—how gears mesh, how the turtle moves—provides traction on
|
||
abstract relationships that might otherwise remain inaccessible.
|
||
|
||
A game that a student has built and played is a particularly rich
|
||
object to think with. The student knows the game's behavior
|
||
intimately: they have watched characters interact, experienced the
|
||
score signal as meaningful, and developed intuitions about what makes
|
||
a good move. These intuitions are not merely useful—they are
|
||
*translatable* into the language of reinforcement learning. The reward
|
||
signal the student experiences as a player is the same signal the
|
||
trainer uses to evaluate actions. The patterns the student recognizes
|
||
as meaningful on the board are precisely the patterns a convolutional
|
||
neural network is designed to detect. The exploration-exploitation
|
||
tradeoff the trainer navigates—trying new things versus sticking with
|
||
what has worked—is analogous to the choices a student makes when
|
||
learning a new game.
|
||
|
||
``retro-gamer`` is designed to make these translations visible. When
|
||
the student reads the training log and sees that the trainer chose a
|
||
CNN because the game is spatial, they can connect that decision to
|
||
their own knowledge of how the board works. When they see the reward
|
||
increasing episode by episode, they can reason about *why*—what the
|
||
agent is learning to do—rather than watching an opaque number change.
|
||
|
||
Metadata as structured reflection
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A student who has built a game knows things about it that its code does
|
||
not make explicit. They know which characters matter—which ones indicate
|
||
danger, opportunity, or neutral terrain. They know what game state
|
||
changes signal success. They know whether the arrangement of pieces on
|
||
the board is meaningful or incidental. This knowledge is usually tacit:
|
||
embedded in how they play, not in anything they have written down.
|
||
|
||
``retro-gamer`` asks students to make this tacit knowledge explicit by
|
||
writing a ``[tool.retro-gamer]`` section in their game's
|
||
``pyproject.toml``. The choice of location is deliberate: placing game
|
||
metadata in the game's own project file frames it as *a property of the
|
||
game*, not as a configuration setting for the training tool. The student
|
||
is not giving hints to the trainer; they are accurately describing what
|
||
they built.
|
||
|
||
This framing matters for how students reason about the relationship
|
||
between description and performance. A student who omits a character
|
||
from the character set and then notices degraded training performance is
|
||
not observing a failure of their trainer configuration—they are
|
||
observing the consequence of having described the game inaccurately.
|
||
The fix is not to adjust a hyperparameter; it is to write a more
|
||
accurate description. The question "is my description of the game
|
||
correct?" is precisely the kind of structured reflection that produces
|
||
conceptual understanding, because it requires the student to connect
|
||
what they know about the game to the representations the learning
|
||
algorithm uses.
|
||
|
||
Knowledge building and discussion
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Making a game does not, by itself, guarantee conceptual understanding
|
||
of reinforcement learning. Students may engage deeply with the
|
||
implementation details of their game while remaining unable to
|
||
articulate the big ideas that ``retro-gamer`` is meant to make
|
||
salient. Research in the knowledge-building tradition (Scardamalia and
|
||
Bereiter 2006) suggests that conceptual understanding deepens
|
||
substantially when students discuss their ideas with others—explaining,
|
||
questioning, and revising their understanding in dialogue.
|
||
|
||
``retro-gamer`` is designed to generate the kind of specific,
|
||
grounded questions that productive discussion requires. "What happens
|
||
if I leave a character out of the character set?" is not an abstract
|
||
question; it is a question about a specific game the student knows
|
||
well, and it has a specific, reasoned answer. "Why does training
|
||
improve faster with prioritized experience replay?" connects a
|
||
hyperparameter setting to a mechanism. These are better starting
|
||
points for discussion than the generic questions that arise from
|
||
reading about reinforcement learning without a concrete artifact to
|
||
refer to.
|
||
|
||
Research design
|
||
~~~~~~~~~~~~~~~
|
||
|
||
The pedagogical hypothesis underlying ``retro-gamer`` is being
|
||
evaluated in a research study conducted in the context of MWC's games
|
||
unit. The study investigates how two interventions—using
|
||
``retro-gamer`` to train an agent, and discussing reinforcement
|
||
learning with a large language model—interact to support conceptual
|
||
understanding of reinforcement learning.
|
||
|
||
The key outcome is measured by a set of scenario-based conceptual
|
||
questions. Representative examples include:
|
||
|
||
- *Imagine you were training an agent to play a game with a specified
|
||
character set. If you forgot to include one of the characters which
|
||
is used in the game, how would it affect the trained agent's
|
||
performance? Explain your reasoning.*
|
||
- *Imagine you are training an agent to play a game which has a
|
||
specified character set. You realize that only half of the specified
|
||
characters are actually used in the game. If you change the
|
||
character set to include only the characters that actually appear,
|
||
how would the training process change? Explain your reasoning.*
|
||
- *Imagine you are creating a game where the goal is to win, and
|
||
partial success has no value—for example, a game where the goal is
|
||
to escape a maze. What would be the effect on agent training of
|
||
adding artificial rewards for completing sub-goals such as reaching
|
||
a milestone halfway to the exit? Explain your reasoning.*
|
||
|
||
Each question is evaluated using a rubric that rewards conceptual
|
||
understanding, even where specific misconceptions remain.
|
||
|
||
Participants all receive a traditional classroom lesson on
|
||
reinforcement learning before the study begins, ensuring that the same
|
||
conceptual vocabulary is available to everyone. They then complete a
|
||
pretest of the conceptual questions. Participants are randomly assigned
|
||
to one of four conditions in a 2×2 design: the first factor is whether
|
||
they use ``retro-gamer`` to train an agent on their game; the second
|
||
is whether they discuss reinforcement learning with a large language
|
||
model. One week later, participants complete the posttest. We
|
||
hypothesize that the combination of ``retro-gamer`` and LLM discussion
|
||
will produce the largest gains, mediated by more specific and more
|
||
numerous questions to the LLM—a sign that students are reasoning more
|
||
deeply about the underlying concepts.
|
||
|
||
Technical background
|
||
--------------------
|
||
|
||
This section provides a conceptual introduction to the ideas underlying
|
||
``retro-gamer``. It is intended to be accessible to students who have
|
||
not studied machine learning before, while also connecting each concept
|
||
to the specific choices you make when using the tool.
|
||
|
||
Reinforcement learning
|
||
~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
*Reinforcement learning* (RL) is a framework for training an *agent*
|
||
to make good decisions by interacting with an *environment*.
|
||
|
||
At every moment, the environment is in some *state*, and the agent
|
||
observes something about that state. The agent chooses an *action*,
|
||
the environment transitions to a new state in response, and the agent
|
||
receives a *reward* signal—a number that indicates how well it is
|
||
doing. The agent's goal is to learn a *policy*: a rule for choosing
|
||
actions that maximizes the total reward it accumulates over time. In
|
||
``retro-gamer``, the game is the environment, the character grid and
|
||
state dictionary are what the agent observes, pressing a key is an
|
||
action, and the change in score is the reward.
|
||
|
||
A distinctive feature of reinforcement learning—distinguishing it from
|
||
supervised learning, where a model is trained on labeled examples—is
|
||
that the agent must discover what good behavior looks like through
|
||
experience. There is no teacher providing correct answers. The reward
|
||
signal is all the agent has to go on. This makes reinforcement
|
||
learning both powerful (it can find solutions no human designer would
|
||
think to specify) and tricky (poorly chosen reward signals can produce
|
||
strange or unintended behavior).
|
||
|
||
The total reward the agent receives from a given state onward—if it
|
||
acts according to its current policy—is called the *return*. Because
|
||
rewards in the far future are harder to predict and plan for, RL
|
||
algorithms typically *discount* future rewards: a reward received
|
||
``t`` turns from now is worth only ``γ^t`` times its face value, where
|
||
``γ`` (gamma) is a number slightly less than 1. The ``gamma``
|
||
hyperparameter in ``retro-gamer`` controls this discount. A value
|
||
close to 1 means the agent values the distant future almost as much
|
||
as the immediate present; a smaller value makes the agent more
|
||
myopic.
|
||
|
||
Q-learning
|
||
~~~~~~~~~~~
|
||
|
||
A natural way to formalize the agent's goal is to define the *Q-function*
|
||
(or *Q-value*): Q(s, a) is the expected total discounted reward the
|
||
agent will receive if it is in state ``s``, takes action ``a``, and
|
||
then follows its current policy from that point on. If the agent knew
|
||
the true Q-function, it could act optimally simply by choosing the
|
||
action with the highest Q-value in each state.
|
||
|
||
Q-learning is an algorithm for learning the Q-function by experience.
|
||
Starting from an arbitrary initial estimate, the agent uses the
|
||
*Bellman equation* to update its Q-estimates after each transition.
|
||
The key insight is that the Q-value of taking action ``a`` in state
|
||
``s`` is related to the immediate reward and the best Q-value
|
||
achievable from the next state:
|
||
|
||
.. math::
|
||
|
||
Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')
|
||
|
||
After each turn, the agent computes this *temporal difference* (TD)
|
||
error—the gap between its current Q-estimate and what the Bellman
|
||
equation says it should be—and adjusts its estimates to reduce the
|
||
error. Over many iterations, the Q-estimates converge toward their
|
||
true values.
|
||
|
||
Deep Q-networks
|
||
~~~~~~~~~~~~~~~
|
||
|
||
Classical Q-learning stores the Q-function in a table: one entry for
|
||
every possible (state, action) pair. This is feasible only when the
|
||
number of possible states is small. For a game board with even modest
|
||
dimensions—say 32×16 cells, each displaying one of a handful of
|
||
characters—the number of possible board configurations is astronomically
|
||
large. Storing a table of Q-values for every configuration is not
|
||
practical.
|
||
|
||
*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
|
||
problem by approximating the Q-function with a neural network. Instead
|
||
of a table, the network takes the current state as input and outputs
|
||
Q-value estimates for all possible actions simultaneously. The network
|
||
*generalizes*: having learned that moving right is a good idea when
|
||
the apple is to the right and nothing is in the way, it applies that
|
||
knowledge to board configurations it has never seen before.
|
||
|
||
The training process in ``retro-gamer`` follows the DQN algorithm. At
|
||
each turn, the agent uses its current network to estimate Q-values and
|
||
selects an action. It stores the experience—(state, action, reward,
|
||
next state)—in a *replay buffer*. Periodically, it samples a random
|
||
batch of experiences from the buffer and uses them to compute TD
|
||
errors, then adjusts the network weights to reduce those errors. This
|
||
process continues for many episodes.
|
||
|
||
Experience replay
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
A key ingredient of DQN is *experience replay*. Rather than training
|
||
on experiences as they arrive—which would mean training on correlated,
|
||
sequential transitions—the agent stores experiences in a buffer and
|
||
samples them randomly for training. This has two benefits. First, each
|
||
experience is potentially used many times for training, making data
|
||
use more efficient. Second, random sampling breaks the correlations
|
||
between consecutive transitions, which would otherwise cause the
|
||
network's weight updates to interfere with each other.
|
||
|
||
``retro-gamer`` offers a standard replay buffer and an optional
|
||
*prioritized* replay buffer (PER). In PER, experiences with larger TD
|
||
errors—cases where the agent's prediction was most wrong—are sampled
|
||
more often. The intuition is that surprising transitions are more
|
||
informative. Prioritized replay often improves training efficiency but
|
||
introduces a bias that must be corrected with *importance sampling
|
||
weights* (Schaul et al. 2015).
|
||
|
||
The ``memory_capacity`` hyperparameter sets how many experiences the
|
||
buffer can hold. When the buffer is full, old experiences are
|
||
discarded. A larger buffer provides more diverse training data but
|
||
uses more memory.
|
||
|
||
Target networks
|
||
~~~~~~~~~~~~~~~
|
||
|
||
A subtle challenge in DQN training is that the Q-values computed by the
|
||
Bellman equation depend on the network's own estimates of the next
|
||
state's Q-values. If the network is updated constantly, its Q-value
|
||
estimates keep shifting, making the training target a moving one. This
|
||
can cause instability.
|
||
|
||
DQN addresses this with a *target network*: a copy of the main network
|
||
that is updated only every ``target_update_freq`` steps. The Bellman
|
||
target is computed using the target network, while the main network is
|
||
updated by gradient descent. Because the target network changes slowly,
|
||
training targets remain stable long enough for the main network to
|
||
make progress.
|
||
|
||
Exploration vs. exploitation
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A reinforcement learning agent faces a fundamental dilemma: should it
|
||
*exploit* what it already knows (taking the action with the highest
|
||
estimated Q-value) or *explore* (trying actions it is less certain
|
||
about, in case they lead to better outcomes it has not yet discovered)?
|
||
Exploiting too much early in training means the agent never discovers
|
||
better strategies; exploring too much later means the agent wastes time
|
||
on random behavior when it already knows what to do.
|
||
|
||
``retro-gamer`` uses *ε-greedy exploration*: with probability ε
|
||
(epsilon), the agent chooses a random action; with probability 1 − ε,
|
||
it exploits its current Q-function. ε starts at 1 (pure exploration)
|
||
and decays over training according to ``epsilon_decay``, reaching
|
||
a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
|
||
training log shows how exploration decreases as training progresses.
|
||
|
||
Representing the game board
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A neural network operates on numbers, not characters. Before the
|
||
game board can be fed to the Q-network, it must be converted to a
|
||
numerical representation. ``retro-gamer`` uses *one-hot encoding*.
|
||
|
||
For a character set of ``n`` distinct characters, each cell on the
|
||
board is represented by a vector of ``n`` numbers, all zero except for
|
||
the one position corresponding to the character in that cell, which is
|
||
set to 1. For example, with character set ``['@', '*', '>']``, the
|
||
character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
|
||
encoded as ``[0, 0, 0]``.
|
||
|
||
The full board representation is a three-dimensional array of shape
|
||
(H, W, C), where H is the board height, W is the board width, and
|
||
C is the number of characters in the character set. The total number
|
||
of numbers in this array—H × W × C—is the size of the board part of
|
||
the observation. For a 32×16 board with 6 characters, this is
|
||
32 × 16 × 6 = 3,072 numbers.
|
||
|
||
The ``character_set`` field in the game description determines which
|
||
characters the agent can distinguish. A character not in the set
|
||
appears as an all-zero vector—indistinguishable from an empty cell.
|
||
If the character set is not specified, ``retro-gamer`` runs a brief
|
||
exploration phase before training to observe which characters actually
|
||
appear.
|
||
|
||
In addition to the board, the agent can observe numerical values from
|
||
the game's state dictionary via ``observe_state``. These are
|
||
appended to the end of the observation vector. The reward key must
|
||
not be included in ``observe_state``: it would give the agent direct
|
||
access to its own performance signal, which is not a realistic observation
|
||
in most game contexts and can cause training pathologies.
|
||
|
||
Neural network architectures
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The architecture of the Q-network—the number and arrangement of its
|
||
layers—is one of the most consequential choices in DQN training.
|
||
``retro-gamer`` selects an architecture based on the ``spatial``
|
||
field in the game description and generates a plain-language rationale.
|
||
|
||
**Multilayer perceptrons (MLP)**
|
||
|
||
The simplest neural network architecture for fixed-size input is the
|
||
*multilayer perceptron* (MLP). An MLP is a sequence of *fully
|
||
connected layers*: every unit in one layer is connected to every unit
|
||
in the next. Each connection has a learnable *weight*; a unit computes
|
||
a weighted sum of its inputs, passes it through a nonlinear *activation
|
||
function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
|
||
``max(0, x)``), and sends the result to the next layer. The final
|
||
layer has one unit per action, producing Q-value estimates.
|
||
|
||
An MLP with two hidden layers of width 128, for an observation of size
|
||
3,072 and 5 possible actions, would have approximately 400,000 trainable
|
||
parameters. Training adjusts all of these parameters simultaneously to
|
||
reduce the TD error.
|
||
|
||
An MLP treats its input as a flat list of numbers. It does not know
|
||
that these numbers were arranged in a 2D grid, or that spatially
|
||
adjacent cells are related. This is appropriate when the game's
|
||
observation is better understood as a collection of independent
|
||
readings—a set of meters or status indicators—rather than as a spatial
|
||
scene. Set ``spatial = false`` in the game description to use this
|
||
architecture.
|
||
|
||
**Convolutional neural networks (CNN)**
|
||
|
||
When the game board is genuinely spatial—when the relative positions
|
||
of characters matter—a *convolutional neural network* (CNN) is a much
|
||
better fit. A CNN applies a set of learnable *filters* (small weight
|
||
matrices) across the board, computing a dot product of each filter with
|
||
every overlapping patch of the input. The result is a set of *feature
|
||
maps*: each feature map highlights where in the board a particular
|
||
pattern appears.
|
||
|
||
This is efficient for two reasons. First, the same filter is applied
|
||
at every board position: a filter that detects "apple to the right of
|
||
snake head" works the same way whether the apple is at position (10,5)
|
||
or (20,12). This *translational invariance* means the network can
|
||
generalize across positions without learning a separate rule for each
|
||
one. Second, each filter needs only a small number of parameters (the
|
||
filter size)—far fewer than the equivalent fully connected connections.
|
||
|
||
``retro-gamer`` uses two convolutional layers (with 32 and 64 output
|
||
channels respectively, kernel size 3, padding 1) followed by a
|
||
flattening step and an MLP head. The padding ensures that the spatial
|
||
dimensions are preserved through the convolution, so the output of the
|
||
second conv layer has shape (64, H, W), which is then flattened and
|
||
passed to the MLP. Set ``spatial = true`` (the default) to use this
|
||
architecture.
|
||
|
||
Connecting architecture to game metadata
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The architectural choices ``retro-gamer`` makes are not arbitrary: they
|
||
follow from the game description you provide. This connection is worth
|
||
making explicit, because understanding it is one of the main paths into
|
||
understanding why neural network architecture matters.
|
||
|
||
- If ``spatial = true``, the CNN can detect local patterns—which characters
|
||
are adjacent to which—without needing to see every possible arrangement.
|
||
This is appropriate for games like Snake, where the snake's direction
|
||
and the apple's relative position are spatially encoded.
|
||
|
||
- If ``spatial = false``, the MLP treats the board as a flat vector. This
|
||
may be appropriate for games that use the character grid primarily as a
|
||
display rather than a spatial field—for example, a game where characters
|
||
appear in fixed, non-interacting positions as status indicators.
|
||
|
||
- The ``character_set`` determines the depth (C) of the board tensor.
|
||
More characters mean more numbers per cell and a larger input to the
|
||
network. A character set that includes characters the game never uses
|
||
wastes capacity; a character set that omits relevant characters forces
|
||
the agent to treat different things as the same.
|
||
|
||
- The ``observe_state`` fields are appended to the flattened CNN output
|
||
before the MLP head. This allows the agent to use explicit state
|
||
variables—a timer, a lives count—alongside the visual board
|
||
representation.
|
||
|
||
These relationships are not incidental features of the implementation.
|
||
They are the reason the game description matters: every field you fill
|
||
in shapes what the agent can perceive and therefore what it can learn.
|