Initial commit

This commit is contained in:
Chris Proctor
2026-05-08 14:07:17 -04:00
commit 5ca97dc5d0
36 changed files with 4147 additions and 0 deletions

12
docs/Makefile Normal file
View File

@@ -0,0 +1,12 @@
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

442
docs/background.rst Normal file
View File

@@ -0,0 +1,442 @@
Background
==========
Pedagogical framework
---------------------
Making With Code and the games unit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``retro-gamer`` is developed for use in
`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
computer science curriculum designed around the constructionist
principle that students learn most durably by building things they care
about. In MWC's games unit, students design and implement their own
games using the ``retro-games`` framework: a Python library for
building terminal-based, character-grid games in the style of early
arcade software. Students start from concept, work through design,
implement agents and game logic in Python, and end with a complete,
playable game.
The games unit gives students deep familiarity with one particular
game and its code. They know which characters appear on the board,
what the state dictionary contains, how reward accumulates, and what
strategies tend to work. This knowledge is ordinarily tacit—embedded
in how they play—but it is exactly the kind of knowledge that
``retro-gamer`` asks students to make explicit. The act of writing a
``config.toml`` that accurately describes your game to a learning
algorithm is a form of structured reflection: you have to articulate,
in precise terms, what you know.
Objects to think with
~~~~~~~~~~~~~~~~~~~~~
The educational psychologist and mathematician Seymour Papert
introduced the concept of *objects to think with*: concrete artifacts
that serve as anchors for otherwise abstract ideas (Papert 1980). A
gear, for Papert, was an object to think with about mathematics. The
turtle in Logo was an object to think with about procedural thinking.
In each case, the learner's embodied, intuitive knowledge of the
object—how gears mesh, how the turtle moves—provides traction on
abstract relationships that might otherwise remain inaccessible.
A game that a student has built and played is a particularly rich
object to think with. The student knows the game's behavior
intimately: they have watched characters interact, experienced the
score signal as meaningful, and developed intuitions about what makes
a good move. These intuitions are not merely useful—they are
*translatable* into the language of reinforcement learning. The reward
signal the student experiences as a player is the same signal the
trainer uses to evaluate actions. The patterns the student recognizes
as meaningful on the board are precisely the patterns a convolutional
neural network is designed to detect. The exploration-exploitation
tradeoff the trainer navigates—trying new things versus sticking with
what has worked—is analogous to the choices a student makes when
learning a new game.
``retro-gamer`` is designed to make these translations visible. When
the student reads the training log and sees that the trainer chose a
CNN because the game is spatial, they can connect that decision to
their own knowledge of how the board works. When they see the reward
increasing episode by episode, they can reason about *why*—what the
agent is learning to do—rather than watching an opaque number change.
Metadata as structured reflection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A student who has built a game knows things about it that its code does
not make explicit. They know which characters matter—which ones indicate
danger, opportunity, or neutral terrain. They know what game state
changes signal success. They know whether the arrangement of pieces on
the board is meaningful or incidental. This knowledge is usually tacit:
embedded in how they play, not in anything they have written down.
``retro-gamer`` asks students to make this tacit knowledge explicit by
writing a ``[tool.retro-gamer]`` section in their game's
``pyproject.toml``. The choice of location is deliberate: placing game
metadata in the game's own project file frames it as *a property of the
game*, not as a configuration setting for the training tool. The student
is not giving hints to the trainer; they are accurately describing what
they built.
This framing matters for how students reason about the relationship
between description and performance. A student who omits a character
from the character set and then notices degraded training performance is
not observing a failure of their trainer configuration—they are
observing the consequence of having described the game inaccurately.
The fix is not to adjust a hyperparameter; it is to write a more
accurate description. The question "is my description of the game
correct?" is precisely the kind of structured reflection that produces
conceptual understanding, because it requires the student to connect
what they know about the game to the representations the learning
algorithm uses.
Knowledge building and discussion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Making a game does not, by itself, guarantee conceptual understanding
of reinforcement learning. Students may engage deeply with the
implementation details of their game while remaining unable to
articulate the big ideas that ``retro-gamer`` is meant to make
salient. Research in the knowledge-building tradition (Scardamalia and
Bereiter 2006) suggests that conceptual understanding deepens
substantially when students discuss their ideas with others—explaining,
questioning, and revising their understanding in dialogue.
``retro-gamer`` is designed to generate the kind of specific,
grounded questions that productive discussion requires. "What happens
if I leave a character out of the character set?" is not an abstract
question; it is a question about a specific game the student knows
well, and it has a specific, reasoned answer. "Why does training
improve faster with prioritized experience replay?" connects a
hyperparameter setting to a mechanism. These are better starting
points for discussion than the generic questions that arise from
reading about reinforcement learning without a concrete artifact to
refer to.
Research design
~~~~~~~~~~~~~~~
The pedagogical hypothesis underlying ``retro-gamer`` is being
evaluated in a research study conducted in the context of MWC's games
unit. The study investigates how two interventions—using
``retro-gamer`` to train an agent, and discussing reinforcement
learning with a large language model—interact to support conceptual
understanding of reinforcement learning.
The key outcome is measured by a set of scenario-based conceptual
questions. Representative examples include:
- *Imagine you were training an agent to play a game with a specified
character set. If you forgot to include one of the characters which
is used in the game, how would it affect the trained agent's
performance? Explain your reasoning.*
- *Imagine you are training an agent to play a game which has a
specified character set. You realize that only half of the specified
characters are actually used in the game. If you change the
character set to include only the characters that actually appear,
how would the training process change? Explain your reasoning.*
- *Imagine you are creating a game where the goal is to win, and
partial success has no value—for example, a game where the goal is
to escape a maze. What would be the effect on agent training of
adding artificial rewards for completing sub-goals such as reaching
a milestone halfway to the exit? Explain your reasoning.*
Each question is evaluated using a rubric that rewards conceptual
understanding, even where specific misconceptions remain.
Participants all receive a traditional classroom lesson on
reinforcement learning before the study begins, ensuring that the same
conceptual vocabulary is available to everyone. They then complete a
pretest of the conceptual questions. Participants are randomly assigned
to one of four conditions in a 2×2 design: the first factor is whether
they use ``retro-gamer`` to train an agent on their game; the second
is whether they discuss reinforcement learning with a large language
model. One week later, participants complete the posttest. We
hypothesize that the combination of ``retro-gamer`` and LLM discussion
will produce the largest gains, mediated by more specific and more
numerous questions to the LLM—a sign that students are reasoning more
deeply about the underlying concepts.
Technical background
--------------------
This section provides a conceptual introduction to the ideas underlying
``retro-gamer``. It is intended to be accessible to students who have
not studied machine learning before, while also connecting each concept
to the specific choices you make when using the tool.
Reinforcement learning
~~~~~~~~~~~~~~~~~~~~~~
*Reinforcement learning* (RL) is a framework for training an *agent*
to make good decisions by interacting with an *environment*.
At every moment, the environment is in some *state*, and the agent
observes something about that state. The agent chooses an *action*,
the environment transitions to a new state in response, and the agent
receives a *reward* signal—a number that indicates how well it is
doing. The agent's goal is to learn a *policy*: a rule for choosing
actions that maximizes the total reward it accumulates over time. In
``retro-gamer``, the game is the environment, the character grid and
state dictionary are what the agent observes, pressing a key is an
action, and the change in score is the reward.
A distinctive feature of reinforcement learning—distinguishing it from
supervised learning, where a model is trained on labeled examples—is
that the agent must discover what good behavior looks like through
experience. There is no teacher providing correct answers. The reward
signal is all the agent has to go on. This makes reinforcement
learning both powerful (it can find solutions no human designer would
think to specify) and tricky (poorly chosen reward signals can produce
strange or unintended behavior).
The total reward the agent receives from a given state onward—if it
acts according to its current policy—is called the *return*. Because
rewards in the far future are harder to predict and plan for, RL
algorithms typically *discount* future rewards: a reward received
``t`` turns from now is worth only ``γ^t`` times its face value, where
``γ`` (gamma) is a number slightly less than 1. The ``gamma``
hyperparameter in ``retro-gamer`` controls this discount. A value
close to 1 means the agent values the distant future almost as much
as the immediate present; a smaller value makes the agent more
myopic.
Q-learning
~~~~~~~~~~~
A natural way to formalize the agent's goal is to define the *Q-function*
(or *Q-value*): Q(s, a) is the expected total discounted reward the
agent will receive if it is in state ``s``, takes action ``a``, and
then follows its current policy from that point on. If the agent knew
the true Q-function, it could act optimally simply by choosing the
action with the highest Q-value in each state.
Q-learning is an algorithm for learning the Q-function by experience.
Starting from an arbitrary initial estimate, the agent uses the
*Bellman equation* to update its Q-estimates after each transition.
The key insight is that the Q-value of taking action ``a`` in state
``s`` is related to the immediate reward and the best Q-value
achievable from the next state:
.. math::
Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')
After each turn, the agent computes this *temporal difference* (TD)
error—the gap between its current Q-estimate and what the Bellman
equation says it should be—and adjusts its estimates to reduce the
error. Over many iterations, the Q-estimates converge toward their
true values.
Deep Q-networks
~~~~~~~~~~~~~~~
Classical Q-learning stores the Q-function in a table: one entry for
every possible (state, action) pair. This is feasible only when the
number of possible states is small. For a game board with even modest
dimensions—say 32×16 cells, each displaying one of a handful of
characters—the number of possible board configurations is astronomically
large. Storing a table of Q-values for every configuration is not
practical.
*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
problem by approximating the Q-function with a neural network. Instead
of a table, the network takes the current state as input and outputs
Q-value estimates for all possible actions simultaneously. The network
*generalizes*: having learned that moving right is a good idea when
the apple is to the right and nothing is in the way, it applies that
knowledge to board configurations it has never seen before.
The training process in ``retro-gamer`` follows the DQN algorithm. At
each turn, the agent uses its current network to estimate Q-values and
selects an action. It stores the experience—(state, action, reward,
next state)—in a *replay buffer*. Periodically, it samples a random
batch of experiences from the buffer and uses them to compute TD
errors, then adjusts the network weights to reduce those errors. This
process continues for many episodes.
Experience replay
~~~~~~~~~~~~~~~~~
A key ingredient of DQN is *experience replay*. Rather than training
on experiences as they arrive—which would mean training on correlated,
sequential transitions—the agent stores experiences in a buffer and
samples them randomly for training. This has two benefits. First, each
experience is potentially used many times for training, making data
use more efficient. Second, random sampling breaks the correlations
between consecutive transitions, which would otherwise cause the
network's weight updates to interfere with each other.
``retro-gamer`` offers a standard replay buffer and an optional
*prioritized* replay buffer (PER). In PER, experiences with larger TD
errors—cases where the agent's prediction was most wrong—are sampled
more often. The intuition is that surprising transitions are more
informative. Prioritized replay often improves training efficiency but
introduces a bias that must be corrected with *importance sampling
weights* (Schaul et al. 2015).
The ``memory_capacity`` hyperparameter sets how many experiences the
buffer can hold. When the buffer is full, old experiences are
discarded. A larger buffer provides more diverse training data but
uses more memory.
Target networks
~~~~~~~~~~~~~~~
A subtle challenge in DQN training is that the Q-values computed by the
Bellman equation depend on the network's own estimates of the next
state's Q-values. If the network is updated constantly, its Q-value
estimates keep shifting, making the training target a moving one. This
can cause instability.
DQN addresses this with a *target network*: a copy of the main network
that is updated only every ``target_update_freq`` steps. The Bellman
target is computed using the target network, while the main network is
updated by gradient descent. Because the target network changes slowly,
training targets remain stable long enough for the main network to
make progress.
Exploration vs. exploitation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A reinforcement learning agent faces a fundamental dilemma: should it
*exploit* what it already knows (taking the action with the highest
estimated Q-value) or *explore* (trying actions it is less certain
about, in case they lead to better outcomes it has not yet discovered)?
Exploiting too much early in training means the agent never discovers
better strategies; exploring too much later means the agent wastes time
on random behavior when it already knows what to do.
``retro-gamer`` uses *ε-greedy exploration*: with probability ε
(epsilon), the agent chooses a random action; with probability 1 ε,
it exploits its current Q-function. ε starts at 1 (pure exploration)
and decays over training according to ``epsilon_decay``, reaching
a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
training log shows how exploration decreases as training progresses.
Representing the game board
~~~~~~~~~~~~~~~~~~~~~~~~~~~
A neural network operates on numbers, not characters. Before the
game board can be fed to the Q-network, it must be converted to a
numerical representation. ``retro-gamer`` uses *one-hot encoding*.
For a character set of ``n`` distinct characters, each cell on the
board is represented by a vector of ``n`` numbers, all zero except for
the one position corresponding to the character in that cell, which is
set to 1. For example, with character set ``['@', '*', '>']``, the
character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
encoded as ``[0, 0, 0]``.
The full board representation is a three-dimensional array of shape
(H, W, C), where H is the board height, W is the board width, and
C is the number of characters in the character set. The total number
of numbers in this array—H × W × C—is the size of the board part of
the observation. For a 32×16 board with 6 characters, this is
32 × 16 × 6 = 3,072 numbers.
The ``character_set`` field in the game description determines which
characters the agent can distinguish. A character not in the set
appears as an all-zero vector—indistinguishable from an empty cell.
If the character set is not specified, ``retro-gamer`` runs a brief
exploration phase before training to observe which characters actually
appear.
In addition to the board, the agent can observe numerical values from
the game's state dictionary via ``observe_state``. These are
appended to the end of the observation vector. The reward key must
not be included in ``observe_state``: it would give the agent direct
access to its own performance signal, which is not a realistic observation
in most game contexts and can cause training pathologies.
Neural network architectures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The architecture of the Q-network—the number and arrangement of its
layers—is one of the most consequential choices in DQN training.
``retro-gamer`` selects an architecture based on the ``spatial``
field in the game description and generates a plain-language rationale.
**Multilayer perceptrons (MLP)**
The simplest neural network architecture for fixed-size input is the
*multilayer perceptron* (MLP). An MLP is a sequence of *fully
connected layers*: every unit in one layer is connected to every unit
in the next. Each connection has a learnable *weight*; a unit computes
a weighted sum of its inputs, passes it through a nonlinear *activation
function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
``max(0, x)``), and sends the result to the next layer. The final
layer has one unit per action, producing Q-value estimates.
An MLP with two hidden layers of width 128, for an observation of size
3,072 and 5 possible actions, would have approximately 400,000 trainable
parameters. Training adjusts all of these parameters simultaneously to
reduce the TD error.
An MLP treats its input as a flat list of numbers. It does not know
that these numbers were arranged in a 2D grid, or that spatially
adjacent cells are related. This is appropriate when the game's
observation is better understood as a collection of independent
readings—a set of meters or status indicators—rather than as a spatial
scene. Set ``spatial = false`` in the game description to use this
architecture.
**Convolutional neural networks (CNN)**
When the game board is genuinely spatial—when the relative positions
of characters matter—a *convolutional neural network* (CNN) is a much
better fit. A CNN applies a set of learnable *filters* (small weight
matrices) across the board, computing a dot product of each filter with
every overlapping patch of the input. The result is a set of *feature
maps*: each feature map highlights where in the board a particular
pattern appears.
This is efficient for two reasons. First, the same filter is applied
at every board position: a filter that detects "apple to the right of
snake head" works the same way whether the apple is at position (10,5)
or (20,12). This *translational invariance* means the network can
generalize across positions without learning a separate rule for each
one. Second, each filter needs only a small number of parameters (the
filter size)—far fewer than the equivalent fully connected connections.
``retro-gamer`` uses two convolutional layers (with 32 and 64 output
channels respectively, kernel size 3, padding 1) followed by a
flattening step and an MLP head. The padding ensures that the spatial
dimensions are preserved through the convolution, so the output of the
second conv layer has shape (64, H, W), which is then flattened and
passed to the MLP. Set ``spatial = true`` (the default) to use this
architecture.
Connecting architecture to game metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The architectural choices ``retro-gamer`` makes are not arbitrary: they
follow from the game description you provide. This connection is worth
making explicit, because understanding it is one of the main paths into
understanding why neural network architecture matters.
- If ``spatial = true``, the CNN can detect local patterns—which characters
are adjacent to which—without needing to see every possible arrangement.
This is appropriate for games like Snake, where the snake's direction
and the apple's relative position are spatially encoded.
- If ``spatial = false``, the MLP treats the board as a flat vector. This
may be appropriate for games that use the character grid primarily as a
display rather than a spatial field—for example, a game where characters
appear in fixed, non-interacting positions as status indicators.
- The ``character_set`` determines the depth (C) of the board tensor.
More characters mean more numbers per cell and a larger input to the
network. A character set that includes characters the game never uses
wastes capacity; a character set that omits relevant characters forces
the agent to treat different things as the same.
- The ``observe_state`` fields are appended to the flattened CNN output
before the MLP head. This allows the agent to use explicit state
variables—a timer, a lives count—alongside the visual board
representation.
These relationships are not incidental features of the implementation.
They are the reason the game description matters: every field you fill
in shapes what the agent can perceive and therefore what it can learn.

13
docs/conf.py Normal file
View File

@@ -0,0 +1,13 @@
project = 'retro-gamer'
copyright = '2025, Chris Proctor'
author = 'Chris Proctor'
release = '0.1.0'
extensions = []
templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
html_theme_options = {}

19
docs/contributing.rst Normal file
View File

@@ -0,0 +1,19 @@
Contributing
============
``retro-gamer`` is developed as part of the
`Making With Code <https://makingwithcode.org>`__ project. Chris
Proctor (chrisp@buffalo.edu), the project lead, is interested in
hearing about your experience using the package, whether in a classroom,
as a research tool, or for personal exploration.
Bug reports, feature requests, and discussion of future directions take
place on the project repository's
`issues page <https://github.com/cproctor/retro-gamer/issues>`__. Code
contributions should be submitted as pull requests. Development follows
the `Contributor Covenant <https://www.contributor-covenant.org/>`__.
If you are a teacher or curriculum designer considering using
``retro-gamer`` in a course, or a researcher interested in collaborating
on studies of its educational effectiveness, please contact Chris
directly.

69
docs/index.rst Normal file
View File

@@ -0,0 +1,69 @@
retro-gamer: train agents to play retro games
==============================================
``retro-gamer`` is a Python package for training reinforcement learning
agents to play games implemented with the
`retro-games <https://retro-games.readthedocs.io/en/latest/>`__
framework. It is designed as a learning tool: rather than writing the
learning algorithm yourself, you describe the game to the trainer in a
structured way, adjust the training parameters, and then observe—through
a detailed log—how the trainer uses your description to build and run a
learning model.
The central idea is that the game becomes an *object to think with*
about reinforcement learning. The choices you make—which characters to
tell the trainer about, what counts as a reward, whether to treat the
board as a spatial scene or a readout—have direct, observable
consequences for how learning proceeds. Working out *why* a training run
behaves as it does is the kind of reasoning that leads to lasting
understanding of the underlying concepts.
.. _installation:
Installation
------------
Prerequisites
~~~~~~~~~~~~~
``retro-gamer`` requires Python 3.11 or higher and a game implemented
with `retro-games <https://retro-games.readthedocs.io/en/latest/>`__.
The retro-games framework must also be installed; see its documentation
for instructions.
.. code-block:: console
% pip install retro-gamer
To install from source (for development or to use the latest changes):
.. code-block:: console
% git clone https://github.com/cproctor/retro-gamer
% cd retro-gamer
% pip install -e .
Verify the installation by checking the command-line tool:
.. code-block:: console
% retro-gamer --help
Usage: retro-gamer [OPTIONS] COMMAND [ARGS]...
Train and run RL agents for retro games.
Commands:
create Create a new training run directory with config.toml.
info Print a summary of a training run.
play Watch a trained agent play the game.
train Train (or resume training) a DQN agent.
.. toctree::
:maxdepth: 1
:caption: Contents:
introduction
background
walkthrough
reference
contributing

160
docs/introduction.rst Normal file
View File

@@ -0,0 +1,160 @@
Introduction
============
``retro-gamer`` grew out of a question about how students learn
difficult ideas in computer science. Reinforcement learning—the branch
of machine learning in which an agent learns to act well by interacting
with an environment and receiving rewards—is one of the most powerful
and widely-deployed ideas in modern computing. It underlies systems that
play chess and Go at superhuman levels, control industrial robots,
optimize power grids, and personalize recommendation feeds. It is also
genuinely hard to understand, not because the core ideas are especially
abstract, but because the feedback between a student's understanding and
the system's behavior is usually invisible. You adjust a hyperparameter,
run a training loop, and get a number. What happened inside, and why,
remains opaque.
The design hypothesis of ``retro-gamer`` is that this opacity is not
inevitable. If a student already knows a game well—how it works, what
the pieces mean, what counts as doing well—then training an agent on
that game gives them a concrete anchor for reasoning about what the
learning algorithm is doing and why. When the trainer decides to use a
convolutional neural network instead of a simpler model, it explains its
reasoning. When training stalls, the student can ask: did I describe the
game accurately? Is the reward signal sending the right signal? Would a
different exploration strategy help? These are exactly the questions that
build genuine conceptual understanding.
``retro-gamer`` is developed as part of the
`Making With Code <https://makingwithcode.org>`__ curriculum, a
project-based high school computer science curriculum emphasizing
personally meaningful creation and deep conceptual engagement. In the
games unit, students design and implement their own games using the
``retro-games`` framework. The extension into reinforcement learning is
a natural next step: you built the game; now let's see if a machine can
learn to play it.
How retro-gamer works
---------------------
Rather than asking you to write a training algorithm yourself,
``retro-gamer`` asks you to describe the game you want to train on.
This description—written in your game project's ``pyproject.toml``—tells
the trainer things the game's code alone doesn't make obvious: which
characters matter, which piece of game state represents success, whether
the board should be understood spatially or as a flat data display.
From this description, the trainer constructs a deep Q-learning model
suited to the game. It writes out a plain-language explanation of every
architectural decision it makes, then begins training. As training
proceeds, it logs each episode's reward, loss, and exploration rate.
Trained model snapshots—checkpoints—are saved periodically, so you can
watch how the agent's skill develops over time. When you're done
training, you can load any checkpoint and watch the agent play.
A typical workflow looks like this. First, describe your game in the
``[tool.retro-gamer]`` section of your game project's ``pyproject.toml``:
.. code-block:: toml
[tool.retro-gamer]
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"
character_set = ["@", "*", ">", "<", "^", "v"]
Then create a training run, train, and watch the result:
.. code-block:: console
% retro-gamer create --game my_game --output runs/snake/
% retro-gamer train runs/snake/
% retro-gamer play runs/snake/ --checkpoint ep_0500
The ``create`` command sets up the training run directory; ``train``
runs the learning algorithm; ``play`` loads a checkpoint and lets you
watch the trained agent live in the terminal.
What you will learn
-------------------
Working with ``retro-gamer`` is designed to build understanding of a
cluster of related ideas:
**Reinforcement learning** is the framework in which an agent
interacts with an environment, receiving observations and rewards, and
learns to choose actions that maximize its long-term reward. The
``retro-gamer`` training loop is a concrete instance of this framework:
the agent is the neural network, the environment is the game, the
observation is the encoded board and game state, and the reward is
the change in score from one turn to the next.
**Neural network architecture** shapes what a model can and cannot
learn. When you declare a game ``spatial``, the trainer builds a
convolutional neural network that can detect patterns in the relative
positions of game pieces. When you declare it non-spatial, it builds a
simpler network that ignores position. Seeing the consequence of this
choice in training behavior is a direct experience of why architecture
matters.
**Observation design** determines what information is available to the
agent. If you leave a character out of the ``character_set``, the agent
will not distinguish it from empty space. If you include a game-state
variable in ``observe_state``, the agent can see it directly rather than
having to infer it from the board. The consequences of these choices for
what the agent can learn are reasonably predictable—and making and
checking those predictions is exactly the kind of reasoning the tool is
designed to support.
**Reward engineering** is the craft of specifying what counts as doing
well in a way the agent can actually optimize. Using score as the reward
is natural for many games, but some games have sparse rewards (the agent
rarely earns points), and some have reward signals that are easy to
game. Experimenting with what to use as a reward—and observing how that
choice shapes training—is one of the richest paths into understanding
what reinforcement learning is actually optimizing.
**Hyperparameter tuning** is the practice of adjusting training settings
such as learning rate, exploration probability, and network size to
improve training efficiency and final performance. ``retro-gamer``
exposes these settings explicitly and explains their role in the
training log, so tuning them is connected to conceptual understanding
rather than uninformed search.
The interpretable training log
------------------------------
A key feature of ``retro-gamer`` is its training log. When training
begins, the trainer writes a complete, plain-language account of the
model it built: why it chose the architecture it did, what the
observation vector contains, what actions the agent can take, and how
the exploration and learning schedules are set up. Here is an example
from training a snake agent:
.. code-block:: text
[INIT] === Network Architecture ===
[INIT] Board: 32×16, character set: 6 chars (one-hot per cell)
[INIT] Observed state keys: 0 | Actions (incl. no-op): 5
[INIT] spatial=True → using CNN architecture
[INIT] Rationale: the board is a 2-D spatial scene; a CNN captures
[INIT] local patterns (walls, items nearby) more efficiently than an MLP.
[INIT] CNN: Conv2d(6→32, k=3, pad=1) → ReLU → Conv2d(32→64, k=3, pad=1) → ReLU
[INIT] CNN output: 64 channels × 16×32 = 32768 features (flattened)
[INIT] MLP head input: 32768 (conv) + 0 (state) = 32768
[INIT] MLP: 32768 → 128 → 128 → 5
[INIT] Hidden layers: 2 | Layer width: 128
[INIT] Output: 5 Q-values
[INIT] Actions: ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN'] + (no-op)
...
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
[EP 0500] total_reward=9.0 steps=1203 epsilon=0.0821 avg_loss=0.000387
The episode log shows total reward (score earned), how many turns the
episode lasted, the current exploration rate (``epsilon``), and the
average prediction error (``avg_loss``). Reading this log—and
connecting changes in these numbers to what you know about the game and
the algorithm—is one of the main activities the tool is designed to
support.

344
docs/reference.rst Normal file
View File

@@ -0,0 +1,344 @@
Reference
=========
Game description fields
-----------------------
Game descriptions are written in the ``[tool.retro-gamer]`` section of
your game project's ``pyproject.toml``. ``retro-gamer create`` reads
this section and copies the metadata into the training run's
``config.toml``, where it can also be inspected or hand-edited.
A complete example for the Snake game:
.. code-block:: toml
[tool.retro-gamer]
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"
character_set = ["@", "*", ">", "<", "^", "v"]
spatial = true
observe_state = []
You do not need to specify the board size: ``retro-gamer`` reads it
directly from your game's ``board_size`` attribute.
The fields are described below.
``actions``
~~~~~~~~~~~
**Required.** A list of keystroke names the agent may send to the game
each turn. Use arrow key names for directional games, or single
characters for character-key games.
.. code-block:: toml
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
The agent also has access to a no-op action (doing nothing). The total
number of actions in the Q-network output is ``len(actions) + 1``.
``reward``
~~~~~~~~~~
**Required.** The key in the game's state dictionary to use as the
reward signal. The reward computed for each turn is the *change* in
this value from the previous turn.
.. code-block:: toml
reward = "score"
``character_set``
~~~~~~~~~~~~~~~~~
**Optional.** A list of single characters that may appear on the board.
Each character occupies one "slot" in the one-hot encoding. Characters
not in this list are treated as empty space.
.. code-block:: toml
character_set = ["@", "*", ">", "<", "^", "v"]
If omitted, ``retro-gamer`` runs an exploration phase to discover the
characters that appear in practice. The length of this phase is
controlled by the ``exploration_turns`` hyperparameter.
``spatial``
~~~~~~~~~~~
**Optional; default ``true``.** Whether to treat the board as a 2D
spatial scene. When ``true``, the trainer uses a convolutional neural
network (CNN) that can detect patterns in the relative positions of
characters. When ``false``, the trainer uses a multilayer perceptron
(MLP) that sees the board as a flat list of numbers without positional
structure.
.. code-block:: toml
spatial = true
``observe_state``
~~~~~~~~~~~~~~~~~
**Optional; default ``[]``.** A list of keys from the game's state
dictionary to append to the observation vector. The values must be
numbers (integers, floats, or booleans). The reward key must not
appear in this list.
.. code-block:: toml
observe_state = ["lives", "level"]
.. _hyperparameters:
Hyperparameters
---------------
Hyperparameters are stored in the ``[hyperparameters]`` section of
``config.toml``. They can be set via ``retro-gamer create`` options or
edited directly.
Learning and optimization
~~~~~~~~~~~~~~~~~~~~~~~~~
``learning_rate`` (default: ``0.001``)
The step size used by the Adam optimizer when updating network
weights. Larger values converge faster but may be unstable; smaller
values are more stable but slower.
``lr_decay`` (default: ``0.995``)
Multiplicative decay applied to the learning rate after each
episode. The learning rate decreases geometrically over training,
helping the network fine-tune later without destabilizing early
progress.
``gamma`` (default: ``0.99``)
The discount factor for future rewards. A value of 1.0 makes the
agent value all future rewards equally; smaller values make the
agent increasingly myopic.
Exploration
~~~~~~~~~~~
``epsilon`` (default: ``1.0``)
The initial exploration rate. At each turn, the agent takes a
random action with probability ``epsilon`` and exploits its current
Q-function with probability ``1 - epsilon``.
``epsilon_decay`` (default: ``0.995``)
Multiplicative decay applied to ``epsilon`` after each episode.
``epsilon_min`` (default: ``0.05``)
The floor below which ``epsilon`` will not fall. A small amount of
continued exploration prevents the agent from becoming permanently
committed to a suboptimal policy.
Memory and sampling
~~~~~~~~~~~~~~~~~~~
``batch_size`` (default: ``64``)
The number of experiences sampled from the replay buffer per
training step.
``memory_capacity`` (default: ``10000``)
The maximum number of experiences the replay buffer can hold. When
full, the oldest experiences are discarded.
``prioritize_experiences`` (default: ``false``)
Whether to use prioritized experience replay. When ``true``,
experiences with larger TD errors are sampled more frequently.
This often improves sample efficiency at a modest computational
cost.
Network architecture
~~~~~~~~~~~~~~~~~~~~
``n_layers`` (default: ``2``)
The number of hidden layers in the MLP head (for spatial games,
this follows the CNN; for non-spatial games, it is the full
network).
``layer_size`` (default: ``128``)
The width (number of units) in each hidden layer.
Training duration
~~~~~~~~~~~~~~~~~
``training_episodes`` (default: ``1000``)
The total number of game episodes to run. Each episode runs until
the game ends or ``max_turns_per_episode`` turns have elapsed.
``max_turns_per_episode`` (default: ``2000``)
A safety cutoff preventing a single episode from running
indefinitely (for example, if the agent finds a way to avoid
dying).
``target_update_freq`` (default: ``100``)
How many training steps between updates of the target network.
More frequent updates make training targets move faster (less
stable); less frequent updates make them more stable but slower
to reflect new learning.
Character discovery
~~~~~~~~~~~~~~~~~~~
``exploration_turns`` (default: ``200``)
When ``character_set`` is not specified, the number of random
turns to run at the start of training to discover which
characters appear on the board.
``unknown_character_strategy`` (default: ``"ignore"``)
What to do when a character appears during training that is not
in the established ``character_set``. ``"ignore"`` treats it as
an empty cell; ``"extend"`` rebuilds the model with an extended
character set.
CLI reference
-------------
``retro-gamer create``
~~~~~~~~~~~~~~~~~~~~~~
Create a new training run directory with ``config.toml``. Game metadata
is read automatically from the ``[tool.retro-gamer]`` section of your
game's ``pyproject.toml``; you do not pass it on the command line.
.. code-block:: console
% retro-gamer create --game MODULE --output DIR [OPTIONS]
**Required options:**
- ``--game MODULE`` — Python module containing ``create_game()``
(e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
is read from the ``pyproject.toml`` found in or above the module's
source directory.
- ``--output DIR`` — Directory to create for this training run.
**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
- ``--training-episodes N``
- ``--n-layers N``
- ``--layer-size N``
- ``--learning-rate F``
- ``--lr-decay F``
- ``--gamma F``
- ``--epsilon-decay F``
- ``--epsilon-min F``
- ``--batch-size N``
- ``--memory-capacity N``
- ``--target-update-freq N``
- ``--max-turns-per-episode N``
- ``--exploration-turns N``
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
``retro-gamer train``
~~~~~~~~~~~~~~~~~~~~~
Train (or resume training) a DQN agent.
.. code-block:: console
% retro-gamer train RUN_DIR [--resume CHECKPOINT]
``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
create``. If ``--resume`` is given, training resumes from the specified
checkpoint file (relative or absolute path).
``retro-gamer play``
~~~~~~~~~~~~~~~~~~~~
Watch a trained agent play the game in the terminal.
.. code-block:: console
% retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
``--framerate`` sets the target frames per second (default: 12). Press
Enter or Escape to quit.
``retro-gamer info``
~~~~~~~~~~~~~~~~~~~~~
Print a summary of a training run: metadata, hyperparameters, recent
episode log, and available checkpoints.
.. code-block:: console
% retro-gamer info RUN_DIR
Training run directory structure
---------------------------------
A training run is a self-contained directory with the following
contents:
.. code-block:: text
runs/snake/
├── config.toml # game description + hyperparameters
├── training.log # architecture rationale + per-episode log
└── checkpoints/
├── ep_0100.pt # model weights at episode 100
├── ep_0200.pt
├── ...
└── final.pt # model weights at training completion
``config.toml`` is written by ``retro-gamer create`` and updated (with
the discovered character set and resolved hyperparameters) when
``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
and ``train`` is the recommended way to adjust hyperparameters.
``training.log`` begins with the full architecture description
generated at training startup, followed by one line per episode in the
format::
[EP NNNN] total_reward=F steps=N epsilon=F avg_loss=F
Checkpoint files are PyTorch state dictionaries containing model
weights, optimizer state, the current epsilon, and the total number of
training steps completed. They can be loaded with
``retro-gamer play`` or directly with the Python API.
Python API
----------
For advanced use, ``retro-gamer``'s components are importable as a
library.
.. code-block:: python
from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
from retro.examples.snake import create_game
# Read metadata from [tool.retro-gamer] in the game's pyproject.toml
metadata = GameMetadata.from_pyproject("retro.examples.snake")
trainer = DQNTrainer(
create_game, metadata, "runs/snake/",
training_episodes=500,
n_layers=2,
layer_size=128,
)
trainer.train()
``GameEnvironment`` provides a gym-style interface for stepping through
a game programmatically:
.. code-block:: python
from retro_gamer import GameEnvironment
env = GameEnvironment(create_game, metadata)
obs = env.reset() # returns initial observation vector
obs, reward, done = env.step("KEY_RIGHT")
The observation is a flat NumPy array of dtype ``float32``. For spatial
games, the first ``C × H × W`` elements are the board (channel-first
one-hot encoding); for non-spatial games, the board is encoded
``H × W × C`` and then flattened. Any ``observe_state`` values are
appended at the end.

299
docs/walkthrough.rst Normal file
View File

@@ -0,0 +1,299 @@
Walkthrough
===========
This section walks through a complete ``retro-gamer`` workflow, from
preparing a game to watching a trained agent play. The game used here
is the Snake example included with the ``retro-games`` framework, but
the same steps apply to any game you build.
Prerequisites
-------------
You will need:
- Python 3.11 or higher.
- The ``retro-games`` framework installed and a game you have written
(or the built-in Snake example). See the
`retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
for help writing games.
- ``retro-gamer`` installed (see :ref:`installation`).
Preparing your game
-------------------
``retro-gamer`` loads your game by importing a Python module and
calling a function named ``create_game``. The ``create_game`` function
must take no arguments and return a new ``Game`` instance.
Here is the ``create_game`` function for Snake:
.. code-block:: python
def create_game():
head = SnakeHead()
apple = Apple()
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
apple.relocate(game)
return game
If your game module does not already have a ``create_game`` function,
add one following this pattern.
Describing your game
--------------------
Every training run begins with a description of your game. This
description belongs in the ``[tool.retro-gamer]`` section of your game
project's ``pyproject.toml``—the same file that defines the project's
name, version, and dependencies. Placing it there keeps the description
with the game itself, where it belongs.
Here is the ``[tool.retro-gamer]`` section for the Snake example:
.. code-block:: toml
[tool.retro-gamer]
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"
character_set = ["@", "*", ">", "<", "^", "v"]
spatial = true
observe_state = []
Let's go through each field.
``actions``
~~~~~~~~~~~
A list of the keystrokes the agent may send to the game. For Snake,
the four arrow keys control the direction of travel. The agent also
implicitly has access to a no-op (doing nothing).
.. note::
Only include actions that the game actually responds to. Listing
unreachable keys wastes part of the agent's action space and may slow
training.
``reward``
~~~~~~~~~~
The key in the game's state dictionary to use as the reward signal.
``retro-gamer`` computes the reward for each turn as the *change* in
this value from one turn to the next. For Snake, score increases by 1
(or more) each time the apple is eaten, so the agent receives a reward
of 1 when it eats an apple and 0 otherwise.
Choosing an appropriate reward is one of the most consequential
decisions in RL. Some considerations:
- A reward that is too sparse—where the agent goes many turns without
receiving any signal—makes learning slow. A snake that dies without
ever eating an apple receives no positive reward at all in the first
episodes, giving the learning algorithm almost nothing to work with.
- A reward that is too dense—assigned every turn—may not reflect the
true goal of the game.
- An artificial reward, such as giving a point for moving toward the
apple, can accelerate early training but may cause the agent to
optimize the proxy rather than the real objective.
``character_set``
~~~~~~~~~~~~~~~~~
The characters that can appear on the board, as a list of
single-character strings. Each cell of the board will be *one-hot
encoded* using this list: the agent represents the content of each cell
as a vector of zeros with a single 1 at the position corresponding to
the character. A cell containing a character not in this list is treated
as empty.
For Snake, the characters are: ``@`` (the apple), ``*`` (body
segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).
If you omit this field, ``retro-gamer`` will run a brief exploration
phase before training to discover which characters actually appear.
The number of exploration turns is controlled by the
``exploration_turns`` hyperparameter.
``spatial``
~~~~~~~~~~~
Whether to treat the board as a spatial scene (default: ``true``). A
spatial game uses a *convolutional neural network* (CNN) that can
detect patterns in the relative arrangement of characters. A
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
ignores positional relationships. Set to ``false`` for games where
position is irrelevant.
Once you have written this section, create the training run directory:
.. code-block:: console
% retro-gamer create \
--game retro.examples.snake \
--output runs/snake/
Created training run at runs/snake/config.toml
game : retro.examples.snake
board_size : 32×16
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
reward : score
characters : ['@', '*', '>', '<', '^', 'v']
architecture: CNN (spatial)
``retro-gamer create`` reads your game metadata directly from
``pyproject.toml`` and writes it—along with all hyperparameters—to
``runs/snake/config.toml``.
Training the agent
------------------
With the ``config.toml`` in place, start training:
.. code-block:: console
% retro-gamer train runs/snake/
Training for 1000 episodes…
Done. Checkpoints in runs/snake/checkpoints/
Training saves checkpoints every 100 episodes and a ``final.pt``
checkpoint when complete. You can follow progress in the training log:
.. code-block:: console
% tail -f runs/snake/training.log
The log shows one line per episode:
.. code-block:: text
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
- **total_reward**: the total score earned during the episode (how many
apples the snake ate, for Snake).
- **steps**: how many turns the episode lasted.
- **epsilon**: the current exploration rate. Early in training this is
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
- **avg_loss**: the average temporal-difference error across training
steps in this episode. A decreasing loss generally indicates that the
Q-value estimates are converging.
Resuming training
~~~~~~~~~~~~~~~~~
Training can be resumed from a checkpoint:
.. code-block:: console
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
Watching a trained agent play
------------------------------
To watch a trained agent play the game in your terminal:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint final
You can substitute any checkpoint name:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint ep_0100
Press Enter or Escape to quit.
Comparing agents trained at different checkpoints is a useful activity:
the agent at episode 100 has learned *something*, but typically much
less than the agent at episode 500. Articulating *what* the earlier
agent has and has not learned, and *why*, is productive reasoning about
the training process.
Inspecting a run
----------------
To review the configuration and recent training progress for a run:
.. code-block:: console
% retro-gamer info runs/snake/
Game module : retro.examples.snake
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
Last 5 episodes:
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
Adjusting hyperparameters
--------------------------
The training hyperparameters can be changed by editing ``config.toml``
before training, or by passing them as options to ``retro-gamer
create``. Common adjustments and their effects:
**``training_episodes``** — How long to train. More episodes give the
agent more time to learn, but also take longer to run.
**``epsilon_decay``** — How quickly exploration decreases. A faster
decay (smaller ``epsilon_decay``) means the agent commits to its early
Q-estimates before they are fully reliable. A slower decay (larger
``epsilon_decay``, closer to 1) gives the agent more time to explore
but may waste training time on random actions.
**``learning_rate``** — How large the weight updates are at each
training step. A large learning rate learns fast but may overshoot;
a small learning rate is stable but slow.
**``gamma``** — The discount factor for future rewards. Closer to 1
means the agent values long-term consequences; closer to 0 makes the
agent focus on immediate reward.
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
head. Larger networks can represent more complex Q-functions but are
slower to train and may overfit.
**``prioritize_experiences``** — Whether to use prioritized experience
replay. This often improves sample efficiency but is slightly slower
per step.
Questions for investigation
----------------------------
The following questions are intended to guide productive investigation
using ``retro-gamer``. They are chosen because they have specific,
reasoned answers that connect what you know about the game to the
concepts underlying the training algorithm.
1. **Character set completeness.** Train two agents: one with the full
character set, one missing a character that frequently appears on the
board. Compare their performance. What did the second agent lose the
ability to perceive, and how did that affect its behavior?
2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
true`` and ``spatial = false``. How does training efficiency differ?
Can you explain the difference in terms of what each architecture
can and cannot learn?
3. **Reward shaping.** If the game currently rewards only the final
objective (e.g., reaching a goal), add intermediate rewards for
sub-goals. How does this change the early training curve? Does it
change the agent's final strategy?
4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
(so the agent commits to exploiting early) and a very slow one (so
exploration continues for a long time). How do the training curves
differ? What is the agent doing in each case when ``epsilon`` is low?
5. **Checkpoint comparison.** Load the agent at episode 100 and at
episode 1000 and watch each play the same game. What has the later
agent learned that the earlier one has not? How would you describe
this difference to someone who does not know about neural networks?