Initial commit
This commit is contained in:
442
docs/background.rst
Normal file
442
docs/background.rst
Normal file
@@ -0,0 +1,442 @@
|
||||
Background
|
||||
==========
|
||||
|
||||
Pedagogical framework
|
||||
---------------------
|
||||
|
||||
Making With Code and the games unit
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``retro-gamer`` is developed for use in
|
||||
`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
|
||||
computer science curriculum designed around the constructionist
|
||||
principle that students learn most durably by building things they care
|
||||
about. In MWC's games unit, students design and implement their own
|
||||
games using the ``retro-games`` framework: a Python library for
|
||||
building terminal-based, character-grid games in the style of early
|
||||
arcade software. Students start from concept, work through design,
|
||||
implement agents and game logic in Python, and end with a complete,
|
||||
playable game.
|
||||
|
||||
The games unit gives students deep familiarity with one particular
|
||||
game and its code. They know which characters appear on the board,
|
||||
what the state dictionary contains, how reward accumulates, and what
|
||||
strategies tend to work. This knowledge is ordinarily tacit—embedded
|
||||
in how they play—but it is exactly the kind of knowledge that
|
||||
``retro-gamer`` asks students to make explicit. The act of writing a
|
||||
``config.toml`` that accurately describes your game to a learning
|
||||
algorithm is a form of structured reflection: you have to articulate,
|
||||
in precise terms, what you know.
|
||||
|
||||
Objects to think with
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The educational psychologist and mathematician Seymour Papert
|
||||
introduced the concept of *objects to think with*: concrete artifacts
|
||||
that serve as anchors for otherwise abstract ideas (Papert 1980). A
|
||||
gear, for Papert, was an object to think with about mathematics. The
|
||||
turtle in Logo was an object to think with about procedural thinking.
|
||||
In each case, the learner's embodied, intuitive knowledge of the
|
||||
object—how gears mesh, how the turtle moves—provides traction on
|
||||
abstract relationships that might otherwise remain inaccessible.
|
||||
|
||||
A game that a student has built and played is a particularly rich
|
||||
object to think with. The student knows the game's behavior
|
||||
intimately: they have watched characters interact, experienced the
|
||||
score signal as meaningful, and developed intuitions about what makes
|
||||
a good move. These intuitions are not merely useful—they are
|
||||
*translatable* into the language of reinforcement learning. The reward
|
||||
signal the student experiences as a player is the same signal the
|
||||
trainer uses to evaluate actions. The patterns the student recognizes
|
||||
as meaningful on the board are precisely the patterns a convolutional
|
||||
neural network is designed to detect. The exploration-exploitation
|
||||
tradeoff the trainer navigates—trying new things versus sticking with
|
||||
what has worked—is analogous to the choices a student makes when
|
||||
learning a new game.
|
||||
|
||||
``retro-gamer`` is designed to make these translations visible. When
|
||||
the student reads the training log and sees that the trainer chose a
|
||||
CNN because the game is spatial, they can connect that decision to
|
||||
their own knowledge of how the board works. When they see the reward
|
||||
increasing episode by episode, they can reason about *why*—what the
|
||||
agent is learning to do—rather than watching an opaque number change.
|
||||
|
||||
Metadata as structured reflection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A student who has built a game knows things about it that its code does
|
||||
not make explicit. They know which characters matter—which ones indicate
|
||||
danger, opportunity, or neutral terrain. They know what game state
|
||||
changes signal success. They know whether the arrangement of pieces on
|
||||
the board is meaningful or incidental. This knowledge is usually tacit:
|
||||
embedded in how they play, not in anything they have written down.
|
||||
|
||||
``retro-gamer`` asks students to make this tacit knowledge explicit by
|
||||
writing a ``[tool.retro-gamer]`` section in their game's
|
||||
``pyproject.toml``. The choice of location is deliberate: placing game
|
||||
metadata in the game's own project file frames it as *a property of the
|
||||
game*, not as a configuration setting for the training tool. The student
|
||||
is not giving hints to the trainer; they are accurately describing what
|
||||
they built.
|
||||
|
||||
This framing matters for how students reason about the relationship
|
||||
between description and performance. A student who omits a character
|
||||
from the character set and then notices degraded training performance is
|
||||
not observing a failure of their trainer configuration—they are
|
||||
observing the consequence of having described the game inaccurately.
|
||||
The fix is not to adjust a hyperparameter; it is to write a more
|
||||
accurate description. The question "is my description of the game
|
||||
correct?" is precisely the kind of structured reflection that produces
|
||||
conceptual understanding, because it requires the student to connect
|
||||
what they know about the game to the representations the learning
|
||||
algorithm uses.
|
||||
|
||||
Knowledge building and discussion
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Making a game does not, by itself, guarantee conceptual understanding
|
||||
of reinforcement learning. Students may engage deeply with the
|
||||
implementation details of their game while remaining unable to
|
||||
articulate the big ideas that ``retro-gamer`` is meant to make
|
||||
salient. Research in the knowledge-building tradition (Scardamalia and
|
||||
Bereiter 2006) suggests that conceptual understanding deepens
|
||||
substantially when students discuss their ideas with others—explaining,
|
||||
questioning, and revising their understanding in dialogue.
|
||||
|
||||
``retro-gamer`` is designed to generate the kind of specific,
|
||||
grounded questions that productive discussion requires. "What happens
|
||||
if I leave a character out of the character set?" is not an abstract
|
||||
question; it is a question about a specific game the student knows
|
||||
well, and it has a specific, reasoned answer. "Why does training
|
||||
improve faster with prioritized experience replay?" connects a
|
||||
hyperparameter setting to a mechanism. These are better starting
|
||||
points for discussion than the generic questions that arise from
|
||||
reading about reinforcement learning without a concrete artifact to
|
||||
refer to.
|
||||
|
||||
Research design
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The pedagogical hypothesis underlying ``retro-gamer`` is being
|
||||
evaluated in a research study conducted in the context of MWC's games
|
||||
unit. The study investigates how two interventions—using
|
||||
``retro-gamer`` to train an agent, and discussing reinforcement
|
||||
learning with a large language model—interact to support conceptual
|
||||
understanding of reinforcement learning.
|
||||
|
||||
The key outcome is measured by a set of scenario-based conceptual
|
||||
questions. Representative examples include:
|
||||
|
||||
- *Imagine you were training an agent to play a game with a specified
|
||||
character set. If you forgot to include one of the characters which
|
||||
is used in the game, how would it affect the trained agent's
|
||||
performance? Explain your reasoning.*
|
||||
- *Imagine you are training an agent to play a game which has a
|
||||
specified character set. You realize that only half of the specified
|
||||
characters are actually used in the game. If you change the
|
||||
character set to include only the characters that actually appear,
|
||||
how would the training process change? Explain your reasoning.*
|
||||
- *Imagine you are creating a game where the goal is to win, and
|
||||
partial success has no value—for example, a game where the goal is
|
||||
to escape a maze. What would be the effect on agent training of
|
||||
adding artificial rewards for completing sub-goals such as reaching
|
||||
a milestone halfway to the exit? Explain your reasoning.*
|
||||
|
||||
Each question is evaluated using a rubric that rewards conceptual
|
||||
understanding, even where specific misconceptions remain.
|
||||
|
||||
Participants all receive a traditional classroom lesson on
|
||||
reinforcement learning before the study begins, ensuring that the same
|
||||
conceptual vocabulary is available to everyone. They then complete a
|
||||
pretest of the conceptual questions. Participants are randomly assigned
|
||||
to one of four conditions in a 2×2 design: the first factor is whether
|
||||
they use ``retro-gamer`` to train an agent on their game; the second
|
||||
is whether they discuss reinforcement learning with a large language
|
||||
model. One week later, participants complete the posttest. We
|
||||
hypothesize that the combination of ``retro-gamer`` and LLM discussion
|
||||
will produce the largest gains, mediated by more specific and more
|
||||
numerous questions to the LLM—a sign that students are reasoning more
|
||||
deeply about the underlying concepts.
|
||||
|
||||
Technical background
|
||||
--------------------
|
||||
|
||||
This section provides a conceptual introduction to the ideas underlying
|
||||
``retro-gamer``. It is intended to be accessible to students who have
|
||||
not studied machine learning before, while also connecting each concept
|
||||
to the specific choices you make when using the tool.
|
||||
|
||||
Reinforcement learning
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
*Reinforcement learning* (RL) is a framework for training an *agent*
|
||||
to make good decisions by interacting with an *environment*.
|
||||
|
||||
At every moment, the environment is in some *state*, and the agent
|
||||
observes something about that state. The agent chooses an *action*,
|
||||
the environment transitions to a new state in response, and the agent
|
||||
receives a *reward* signal—a number that indicates how well it is
|
||||
doing. The agent's goal is to learn a *policy*: a rule for choosing
|
||||
actions that maximizes the total reward it accumulates over time. In
|
||||
``retro-gamer``, the game is the environment, the character grid and
|
||||
state dictionary are what the agent observes, pressing a key is an
|
||||
action, and the change in score is the reward.
|
||||
|
||||
A distinctive feature of reinforcement learning—distinguishing it from
|
||||
supervised learning, where a model is trained on labeled examples—is
|
||||
that the agent must discover what good behavior looks like through
|
||||
experience. There is no teacher providing correct answers. The reward
|
||||
signal is all the agent has to go on. This makes reinforcement
|
||||
learning both powerful (it can find solutions no human designer would
|
||||
think to specify) and tricky (poorly chosen reward signals can produce
|
||||
strange or unintended behavior).
|
||||
|
||||
The total reward the agent receives from a given state onward—if it
|
||||
acts according to its current policy—is called the *return*. Because
|
||||
rewards in the far future are harder to predict and plan for, RL
|
||||
algorithms typically *discount* future rewards: a reward received
|
||||
``t`` turns from now is worth only ``γ^t`` times its face value, where
|
||||
``γ`` (gamma) is a number slightly less than 1. The ``gamma``
|
||||
hyperparameter in ``retro-gamer`` controls this discount. A value
|
||||
close to 1 means the agent values the distant future almost as much
|
||||
as the immediate present; a smaller value makes the agent more
|
||||
myopic.
|
||||
|
||||
Q-learning
|
||||
~~~~~~~~~~~
|
||||
|
||||
A natural way to formalize the agent's goal is to define the *Q-function*
|
||||
(or *Q-value*): Q(s, a) is the expected total discounted reward the
|
||||
agent will receive if it is in state ``s``, takes action ``a``, and
|
||||
then follows its current policy from that point on. If the agent knew
|
||||
the true Q-function, it could act optimally simply by choosing the
|
||||
action with the highest Q-value in each state.
|
||||
|
||||
Q-learning is an algorithm for learning the Q-function by experience.
|
||||
Starting from an arbitrary initial estimate, the agent uses the
|
||||
*Bellman equation* to update its Q-estimates after each transition.
|
||||
The key insight is that the Q-value of taking action ``a`` in state
|
||||
``s`` is related to the immediate reward and the best Q-value
|
||||
achievable from the next state:
|
||||
|
||||
.. math::
|
||||
|
||||
Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')
|
||||
|
||||
After each turn, the agent computes this *temporal difference* (TD)
|
||||
error—the gap between its current Q-estimate and what the Bellman
|
||||
equation says it should be—and adjusts its estimates to reduce the
|
||||
error. Over many iterations, the Q-estimates converge toward their
|
||||
true values.
|
||||
|
||||
Deep Q-networks
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Classical Q-learning stores the Q-function in a table: one entry for
|
||||
every possible (state, action) pair. This is feasible only when the
|
||||
number of possible states is small. For a game board with even modest
|
||||
dimensions—say 32×16 cells, each displaying one of a handful of
|
||||
characters—the number of possible board configurations is astronomically
|
||||
large. Storing a table of Q-values for every configuration is not
|
||||
practical.
|
||||
|
||||
*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
|
||||
problem by approximating the Q-function with a neural network. Instead
|
||||
of a table, the network takes the current state as input and outputs
|
||||
Q-value estimates for all possible actions simultaneously. The network
|
||||
*generalizes*: having learned that moving right is a good idea when
|
||||
the apple is to the right and nothing is in the way, it applies that
|
||||
knowledge to board configurations it has never seen before.
|
||||
|
||||
The training process in ``retro-gamer`` follows the DQN algorithm. At
|
||||
each turn, the agent uses its current network to estimate Q-values and
|
||||
selects an action. It stores the experience—(state, action, reward,
|
||||
next state)—in a *replay buffer*. Periodically, it samples a random
|
||||
batch of experiences from the buffer and uses them to compute TD
|
||||
errors, then adjusts the network weights to reduce those errors. This
|
||||
process continues for many episodes.
|
||||
|
||||
Experience replay
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
A key ingredient of DQN is *experience replay*. Rather than training
|
||||
on experiences as they arrive—which would mean training on correlated,
|
||||
sequential transitions—the agent stores experiences in a buffer and
|
||||
samples them randomly for training. This has two benefits. First, each
|
||||
experience is potentially used many times for training, making data
|
||||
use more efficient. Second, random sampling breaks the correlations
|
||||
between consecutive transitions, which would otherwise cause the
|
||||
network's weight updates to interfere with each other.
|
||||
|
||||
``retro-gamer`` offers a standard replay buffer and an optional
|
||||
*prioritized* replay buffer (PER). In PER, experiences with larger TD
|
||||
errors—cases where the agent's prediction was most wrong—are sampled
|
||||
more often. The intuition is that surprising transitions are more
|
||||
informative. Prioritized replay often improves training efficiency but
|
||||
introduces a bias that must be corrected with *importance sampling
|
||||
weights* (Schaul et al. 2015).
|
||||
|
||||
The ``memory_capacity`` hyperparameter sets how many experiences the
|
||||
buffer can hold. When the buffer is full, old experiences are
|
||||
discarded. A larger buffer provides more diverse training data but
|
||||
uses more memory.
|
||||
|
||||
Target networks
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
A subtle challenge in DQN training is that the Q-values computed by the
|
||||
Bellman equation depend on the network's own estimates of the next
|
||||
state's Q-values. If the network is updated constantly, its Q-value
|
||||
estimates keep shifting, making the training target a moving one. This
|
||||
can cause instability.
|
||||
|
||||
DQN addresses this with a *target network*: a copy of the main network
|
||||
that is updated only every ``target_update_freq`` steps. The Bellman
|
||||
target is computed using the target network, while the main network is
|
||||
updated by gradient descent. Because the target network changes slowly,
|
||||
training targets remain stable long enough for the main network to
|
||||
make progress.
|
||||
|
||||
Exploration vs. exploitation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A reinforcement learning agent faces a fundamental dilemma: should it
|
||||
*exploit* what it already knows (taking the action with the highest
|
||||
estimated Q-value) or *explore* (trying actions it is less certain
|
||||
about, in case they lead to better outcomes it has not yet discovered)?
|
||||
Exploiting too much early in training means the agent never discovers
|
||||
better strategies; exploring too much later means the agent wastes time
|
||||
on random behavior when it already knows what to do.
|
||||
|
||||
``retro-gamer`` uses *ε-greedy exploration*: with probability ε
|
||||
(epsilon), the agent chooses a random action; with probability 1 − ε,
|
||||
it exploits its current Q-function. ε starts at 1 (pure exploration)
|
||||
and decays over training according to ``epsilon_decay``, reaching
|
||||
a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
|
||||
training log shows how exploration decreases as training progresses.
|
||||
|
||||
Representing the game board
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A neural network operates on numbers, not characters. Before the
|
||||
game board can be fed to the Q-network, it must be converted to a
|
||||
numerical representation. ``retro-gamer`` uses *one-hot encoding*.
|
||||
|
||||
For a character set of ``n`` distinct characters, each cell on the
|
||||
board is represented by a vector of ``n`` numbers, all zero except for
|
||||
the one position corresponding to the character in that cell, which is
|
||||
set to 1. For example, with character set ``['@', '*', '>']``, the
|
||||
character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
|
||||
encoded as ``[0, 0, 0]``.
|
||||
|
||||
The full board representation is a three-dimensional array of shape
|
||||
(H, W, C), where H is the board height, W is the board width, and
|
||||
C is the number of characters in the character set. The total number
|
||||
of numbers in this array—H × W × C—is the size of the board part of
|
||||
the observation. For a 32×16 board with 6 characters, this is
|
||||
32 × 16 × 6 = 3,072 numbers.
|
||||
|
||||
The ``character_set`` field in the game description determines which
|
||||
characters the agent can distinguish. A character not in the set
|
||||
appears as an all-zero vector—indistinguishable from an empty cell.
|
||||
If the character set is not specified, ``retro-gamer`` runs a brief
|
||||
exploration phase before training to observe which characters actually
|
||||
appear.
|
||||
|
||||
In addition to the board, the agent can observe numerical values from
|
||||
the game's state dictionary via ``observe_state``. These are
|
||||
appended to the end of the observation vector. The reward key must
|
||||
not be included in ``observe_state``: it would give the agent direct
|
||||
access to its own performance signal, which is not a realistic observation
|
||||
in most game contexts and can cause training pathologies.
|
||||
|
||||
Neural network architectures
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The architecture of the Q-network—the number and arrangement of its
|
||||
layers—is one of the most consequential choices in DQN training.
|
||||
``retro-gamer`` selects an architecture based on the ``spatial``
|
||||
field in the game description and generates a plain-language rationale.
|
||||
|
||||
**Multilayer perceptrons (MLP)**
|
||||
|
||||
The simplest neural network architecture for fixed-size input is the
|
||||
*multilayer perceptron* (MLP). An MLP is a sequence of *fully
|
||||
connected layers*: every unit in one layer is connected to every unit
|
||||
in the next. Each connection has a learnable *weight*; a unit computes
|
||||
a weighted sum of its inputs, passes it through a nonlinear *activation
|
||||
function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
|
||||
``max(0, x)``), and sends the result to the next layer. The final
|
||||
layer has one unit per action, producing Q-value estimates.
|
||||
|
||||
An MLP with two hidden layers of width 128, for an observation of size
|
||||
3,072 and 5 possible actions, would have approximately 400,000 trainable
|
||||
parameters. Training adjusts all of these parameters simultaneously to
|
||||
reduce the TD error.
|
||||
|
||||
An MLP treats its input as a flat list of numbers. It does not know
|
||||
that these numbers were arranged in a 2D grid, or that spatially
|
||||
adjacent cells are related. This is appropriate when the game's
|
||||
observation is better understood as a collection of independent
|
||||
readings—a set of meters or status indicators—rather than as a spatial
|
||||
scene. Set ``spatial = false`` in the game description to use this
|
||||
architecture.
|
||||
|
||||
**Convolutional neural networks (CNN)**
|
||||
|
||||
When the game board is genuinely spatial—when the relative positions
|
||||
of characters matter—a *convolutional neural network* (CNN) is a much
|
||||
better fit. A CNN applies a set of learnable *filters* (small weight
|
||||
matrices) across the board, computing a dot product of each filter with
|
||||
every overlapping patch of the input. The result is a set of *feature
|
||||
maps*: each feature map highlights where in the board a particular
|
||||
pattern appears.
|
||||
|
||||
This is efficient for two reasons. First, the same filter is applied
|
||||
at every board position: a filter that detects "apple to the right of
|
||||
snake head" works the same way whether the apple is at position (10,5)
|
||||
or (20,12). This *translational invariance* means the network can
|
||||
generalize across positions without learning a separate rule for each
|
||||
one. Second, each filter needs only a small number of parameters (the
|
||||
filter size)—far fewer than the equivalent fully connected connections.
|
||||
|
||||
``retro-gamer`` uses two convolutional layers (with 32 and 64 output
|
||||
channels respectively, kernel size 3, padding 1) followed by a
|
||||
flattening step and an MLP head. The padding ensures that the spatial
|
||||
dimensions are preserved through the convolution, so the output of the
|
||||
second conv layer has shape (64, H, W), which is then flattened and
|
||||
passed to the MLP. Set ``spatial = true`` (the default) to use this
|
||||
architecture.
|
||||
|
||||
Connecting architecture to game metadata
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The architectural choices ``retro-gamer`` makes are not arbitrary: they
|
||||
follow from the game description you provide. This connection is worth
|
||||
making explicit, because understanding it is one of the main paths into
|
||||
understanding why neural network architecture matters.
|
||||
|
||||
- If ``spatial = true``, the CNN can detect local patterns—which characters
|
||||
are adjacent to which—without needing to see every possible arrangement.
|
||||
This is appropriate for games like Snake, where the snake's direction
|
||||
and the apple's relative position are spatially encoded.
|
||||
|
||||
- If ``spatial = false``, the MLP treats the board as a flat vector. This
|
||||
may be appropriate for games that use the character grid primarily as a
|
||||
display rather than a spatial field—for example, a game where characters
|
||||
appear in fixed, non-interacting positions as status indicators.
|
||||
|
||||
- The ``character_set`` determines the depth (C) of the board tensor.
|
||||
More characters mean more numbers per cell and a larger input to the
|
||||
network. A character set that includes characters the game never uses
|
||||
wastes capacity; a character set that omits relevant characters forces
|
||||
the agent to treat different things as the same.
|
||||
|
||||
- The ``observe_state`` fields are appended to the flattened CNN output
|
||||
before the MLP head. This allows the agent to use explicit state
|
||||
variables—a timer, a lives count—alongside the visual board
|
||||
representation.
|
||||
|
||||
These relationships are not incidental features of the implementation.
|
||||
They are the reason the game description matters: every field you fill
|
||||
in shapes what the agent can perceive and therefore what it can learn.
|
||||
Reference in New Issue
Block a user