Initial commit
This commit is contained in:
12
docs/Makefile
Normal file
12
docs/Makefile
Normal file
@@ -0,0 +1,12 @@
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = .
|
||||
BUILDDIR = _build
|
||||
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
%: Makefile
|
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
442
docs/background.rst
Normal file
442
docs/background.rst
Normal file
@@ -0,0 +1,442 @@
|
||||
Background
|
||||
==========
|
||||
|
||||
Pedagogical framework
|
||||
---------------------
|
||||
|
||||
Making With Code and the games unit
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``retro-gamer`` is developed for use in
|
||||
`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
|
||||
computer science curriculum designed around the constructionist
|
||||
principle that students learn most durably by building things they care
|
||||
about. In MWC's games unit, students design and implement their own
|
||||
games using the ``retro-games`` framework: a Python library for
|
||||
building terminal-based, character-grid games in the style of early
|
||||
arcade software. Students start from concept, work through design,
|
||||
implement agents and game logic in Python, and end with a complete,
|
||||
playable game.
|
||||
|
||||
The games unit gives students deep familiarity with one particular
|
||||
game and its code. They know which characters appear on the board,
|
||||
what the state dictionary contains, how reward accumulates, and what
|
||||
strategies tend to work. This knowledge is ordinarily tacit—embedded
|
||||
in how they play—but it is exactly the kind of knowledge that
|
||||
``retro-gamer`` asks students to make explicit. The act of writing a
|
||||
``config.toml`` that accurately describes your game to a learning
|
||||
algorithm is a form of structured reflection: you have to articulate,
|
||||
in precise terms, what you know.
|
||||
|
||||
Objects to think with
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The educational psychologist and mathematician Seymour Papert
|
||||
introduced the concept of *objects to think with*: concrete artifacts
|
||||
that serve as anchors for otherwise abstract ideas (Papert 1980). A
|
||||
gear, for Papert, was an object to think with about mathematics. The
|
||||
turtle in Logo was an object to think with about procedural thinking.
|
||||
In each case, the learner's embodied, intuitive knowledge of the
|
||||
object—how gears mesh, how the turtle moves—provides traction on
|
||||
abstract relationships that might otherwise remain inaccessible.
|
||||
|
||||
A game that a student has built and played is a particularly rich
|
||||
object to think with. The student knows the game's behavior
|
||||
intimately: they have watched characters interact, experienced the
|
||||
score signal as meaningful, and developed intuitions about what makes
|
||||
a good move. These intuitions are not merely useful—they are
|
||||
*translatable* into the language of reinforcement learning. The reward
|
||||
signal the student experiences as a player is the same signal the
|
||||
trainer uses to evaluate actions. The patterns the student recognizes
|
||||
as meaningful on the board are precisely the patterns a convolutional
|
||||
neural network is designed to detect. The exploration-exploitation
|
||||
tradeoff the trainer navigates—trying new things versus sticking with
|
||||
what has worked—is analogous to the choices a student makes when
|
||||
learning a new game.
|
||||
|
||||
``retro-gamer`` is designed to make these translations visible. When
|
||||
the student reads the training log and sees that the trainer chose a
|
||||
CNN because the game is spatial, they can connect that decision to
|
||||
their own knowledge of how the board works. When they see the reward
|
||||
increasing episode by episode, they can reason about *why*—what the
|
||||
agent is learning to do—rather than watching an opaque number change.
|
||||
|
||||
Metadata as structured reflection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A student who has built a game knows things about it that its code does
|
||||
not make explicit. They know which characters matter—which ones indicate
|
||||
danger, opportunity, or neutral terrain. They know what game state
|
||||
changes signal success. They know whether the arrangement of pieces on
|
||||
the board is meaningful or incidental. This knowledge is usually tacit:
|
||||
embedded in how they play, not in anything they have written down.
|
||||
|
||||
``retro-gamer`` asks students to make this tacit knowledge explicit by
|
||||
writing a ``[tool.retro-gamer]`` section in their game's
|
||||
``pyproject.toml``. The choice of location is deliberate: placing game
|
||||
metadata in the game's own project file frames it as *a property of the
|
||||
game*, not as a configuration setting for the training tool. The student
|
||||
is not giving hints to the trainer; they are accurately describing what
|
||||
they built.
|
||||
|
||||
This framing matters for how students reason about the relationship
|
||||
between description and performance. A student who omits a character
|
||||
from the character set and then notices degraded training performance is
|
||||
not observing a failure of their trainer configuration—they are
|
||||
observing the consequence of having described the game inaccurately.
|
||||
The fix is not to adjust a hyperparameter; it is to write a more
|
||||
accurate description. The question "is my description of the game
|
||||
correct?" is precisely the kind of structured reflection that produces
|
||||
conceptual understanding, because it requires the student to connect
|
||||
what they know about the game to the representations the learning
|
||||
algorithm uses.
|
||||
|
||||
Knowledge building and discussion
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Making a game does not, by itself, guarantee conceptual understanding
|
||||
of reinforcement learning. Students may engage deeply with the
|
||||
implementation details of their game while remaining unable to
|
||||
articulate the big ideas that ``retro-gamer`` is meant to make
|
||||
salient. Research in the knowledge-building tradition (Scardamalia and
|
||||
Bereiter 2006) suggests that conceptual understanding deepens
|
||||
substantially when students discuss their ideas with others—explaining,
|
||||
questioning, and revising their understanding in dialogue.
|
||||
|
||||
``retro-gamer`` is designed to generate the kind of specific,
|
||||
grounded questions that productive discussion requires. "What happens
|
||||
if I leave a character out of the character set?" is not an abstract
|
||||
question; it is a question about a specific game the student knows
|
||||
well, and it has a specific, reasoned answer. "Why does training
|
||||
improve faster with prioritized experience replay?" connects a
|
||||
hyperparameter setting to a mechanism. These are better starting
|
||||
points for discussion than the generic questions that arise from
|
||||
reading about reinforcement learning without a concrete artifact to
|
||||
refer to.
|
||||
|
||||
Research design
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The pedagogical hypothesis underlying ``retro-gamer`` is being
|
||||
evaluated in a research study conducted in the context of MWC's games
|
||||
unit. The study investigates how two interventions—using
|
||||
``retro-gamer`` to train an agent, and discussing reinforcement
|
||||
learning with a large language model—interact to support conceptual
|
||||
understanding of reinforcement learning.
|
||||
|
||||
The key outcome is measured by a set of scenario-based conceptual
|
||||
questions. Representative examples include:
|
||||
|
||||
- *Imagine you were training an agent to play a game with a specified
|
||||
character set. If you forgot to include one of the characters which
|
||||
is used in the game, how would it affect the trained agent's
|
||||
performance? Explain your reasoning.*
|
||||
- *Imagine you are training an agent to play a game which has a
|
||||
specified character set. You realize that only half of the specified
|
||||
characters are actually used in the game. If you change the
|
||||
character set to include only the characters that actually appear,
|
||||
how would the training process change? Explain your reasoning.*
|
||||
- *Imagine you are creating a game where the goal is to win, and
|
||||
partial success has no value—for example, a game where the goal is
|
||||
to escape a maze. What would be the effect on agent training of
|
||||
adding artificial rewards for completing sub-goals such as reaching
|
||||
a milestone halfway to the exit? Explain your reasoning.*
|
||||
|
||||
Each question is evaluated using a rubric that rewards conceptual
|
||||
understanding, even where specific misconceptions remain.
|
||||
|
||||
Participants all receive a traditional classroom lesson on
|
||||
reinforcement learning before the study begins, ensuring that the same
|
||||
conceptual vocabulary is available to everyone. They then complete a
|
||||
pretest of the conceptual questions. Participants are randomly assigned
|
||||
to one of four conditions in a 2×2 design: the first factor is whether
|
||||
they use ``retro-gamer`` to train an agent on their game; the second
|
||||
is whether they discuss reinforcement learning with a large language
|
||||
model. One week later, participants complete the posttest. We
|
||||
hypothesize that the combination of ``retro-gamer`` and LLM discussion
|
||||
will produce the largest gains, mediated by more specific and more
|
||||
numerous questions to the LLM—a sign that students are reasoning more
|
||||
deeply about the underlying concepts.
|
||||
|
||||
Technical background
|
||||
--------------------
|
||||
|
||||
This section provides a conceptual introduction to the ideas underlying
|
||||
``retro-gamer``. It is intended to be accessible to students who have
|
||||
not studied machine learning before, while also connecting each concept
|
||||
to the specific choices you make when using the tool.
|
||||
|
||||
Reinforcement learning
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
*Reinforcement learning* (RL) is a framework for training an *agent*
|
||||
to make good decisions by interacting with an *environment*.
|
||||
|
||||
At every moment, the environment is in some *state*, and the agent
|
||||
observes something about that state. The agent chooses an *action*,
|
||||
the environment transitions to a new state in response, and the agent
|
||||
receives a *reward* signal—a number that indicates how well it is
|
||||
doing. The agent's goal is to learn a *policy*: a rule for choosing
|
||||
actions that maximizes the total reward it accumulates over time. In
|
||||
``retro-gamer``, the game is the environment, the character grid and
|
||||
state dictionary are what the agent observes, pressing a key is an
|
||||
action, and the change in score is the reward.
|
||||
|
||||
A distinctive feature of reinforcement learning—distinguishing it from
|
||||
supervised learning, where a model is trained on labeled examples—is
|
||||
that the agent must discover what good behavior looks like through
|
||||
experience. There is no teacher providing correct answers. The reward
|
||||
signal is all the agent has to go on. This makes reinforcement
|
||||
learning both powerful (it can find solutions no human designer would
|
||||
think to specify) and tricky (poorly chosen reward signals can produce
|
||||
strange or unintended behavior).
|
||||
|
||||
The total reward the agent receives from a given state onward—if it
|
||||
acts according to its current policy—is called the *return*. Because
|
||||
rewards in the far future are harder to predict and plan for, RL
|
||||
algorithms typically *discount* future rewards: a reward received
|
||||
``t`` turns from now is worth only ``γ^t`` times its face value, where
|
||||
``γ`` (gamma) is a number slightly less than 1. The ``gamma``
|
||||
hyperparameter in ``retro-gamer`` controls this discount. A value
|
||||
close to 1 means the agent values the distant future almost as much
|
||||
as the immediate present; a smaller value makes the agent more
|
||||
myopic.
|
||||
|
||||
Q-learning
|
||||
~~~~~~~~~~~
|
||||
|
||||
A natural way to formalize the agent's goal is to define the *Q-function*
|
||||
(or *Q-value*): Q(s, a) is the expected total discounted reward the
|
||||
agent will receive if it is in state ``s``, takes action ``a``, and
|
||||
then follows its current policy from that point on. If the agent knew
|
||||
the true Q-function, it could act optimally simply by choosing the
|
||||
action with the highest Q-value in each state.
|
||||
|
||||
Q-learning is an algorithm for learning the Q-function by experience.
|
||||
Starting from an arbitrary initial estimate, the agent uses the
|
||||
*Bellman equation* to update its Q-estimates after each transition.
|
||||
The key insight is that the Q-value of taking action ``a`` in state
|
||||
``s`` is related to the immediate reward and the best Q-value
|
||||
achievable from the next state:
|
||||
|
||||
.. math::
|
||||
|
||||
Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')
|
||||
|
||||
After each turn, the agent computes this *temporal difference* (TD)
|
||||
error—the gap between its current Q-estimate and what the Bellman
|
||||
equation says it should be—and adjusts its estimates to reduce the
|
||||
error. Over many iterations, the Q-estimates converge toward their
|
||||
true values.
|
||||
|
||||
Deep Q-networks
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Classical Q-learning stores the Q-function in a table: one entry for
|
||||
every possible (state, action) pair. This is feasible only when the
|
||||
number of possible states is small. For a game board with even modest
|
||||
dimensions—say 32×16 cells, each displaying one of a handful of
|
||||
characters—the number of possible board configurations is astronomically
|
||||
large. Storing a table of Q-values for every configuration is not
|
||||
practical.
|
||||
|
||||
*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
|
||||
problem by approximating the Q-function with a neural network. Instead
|
||||
of a table, the network takes the current state as input and outputs
|
||||
Q-value estimates for all possible actions simultaneously. The network
|
||||
*generalizes*: having learned that moving right is a good idea when
|
||||
the apple is to the right and nothing is in the way, it applies that
|
||||
knowledge to board configurations it has never seen before.
|
||||
|
||||
The training process in ``retro-gamer`` follows the DQN algorithm. At
|
||||
each turn, the agent uses its current network to estimate Q-values and
|
||||
selects an action. It stores the experience—(state, action, reward,
|
||||
next state)—in a *replay buffer*. Periodically, it samples a random
|
||||
batch of experiences from the buffer and uses them to compute TD
|
||||
errors, then adjusts the network weights to reduce those errors. This
|
||||
process continues for many episodes.
|
||||
|
||||
Experience replay
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
A key ingredient of DQN is *experience replay*. Rather than training
|
||||
on experiences as they arrive—which would mean training on correlated,
|
||||
sequential transitions—the agent stores experiences in a buffer and
|
||||
samples them randomly for training. This has two benefits. First, each
|
||||
experience is potentially used many times for training, making data
|
||||
use more efficient. Second, random sampling breaks the correlations
|
||||
between consecutive transitions, which would otherwise cause the
|
||||
network's weight updates to interfere with each other.
|
||||
|
||||
``retro-gamer`` offers a standard replay buffer and an optional
|
||||
*prioritized* replay buffer (PER). In PER, experiences with larger TD
|
||||
errors—cases where the agent's prediction was most wrong—are sampled
|
||||
more often. The intuition is that surprising transitions are more
|
||||
informative. Prioritized replay often improves training efficiency but
|
||||
introduces a bias that must be corrected with *importance sampling
|
||||
weights* (Schaul et al. 2015).
|
||||
|
||||
The ``memory_capacity`` hyperparameter sets how many experiences the
|
||||
buffer can hold. When the buffer is full, old experiences are
|
||||
discarded. A larger buffer provides more diverse training data but
|
||||
uses more memory.
|
||||
|
||||
Target networks
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
A subtle challenge in DQN training is that the Q-values computed by the
|
||||
Bellman equation depend on the network's own estimates of the next
|
||||
state's Q-values. If the network is updated constantly, its Q-value
|
||||
estimates keep shifting, making the training target a moving one. This
|
||||
can cause instability.
|
||||
|
||||
DQN addresses this with a *target network*: a copy of the main network
|
||||
that is updated only every ``target_update_freq`` steps. The Bellman
|
||||
target is computed using the target network, while the main network is
|
||||
updated by gradient descent. Because the target network changes slowly,
|
||||
training targets remain stable long enough for the main network to
|
||||
make progress.
|
||||
|
||||
Exploration vs. exploitation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A reinforcement learning agent faces a fundamental dilemma: should it
|
||||
*exploit* what it already knows (taking the action with the highest
|
||||
estimated Q-value) or *explore* (trying actions it is less certain
|
||||
about, in case they lead to better outcomes it has not yet discovered)?
|
||||
Exploiting too much early in training means the agent never discovers
|
||||
better strategies; exploring too much later means the agent wastes time
|
||||
on random behavior when it already knows what to do.
|
||||
|
||||
``retro-gamer`` uses *ε-greedy exploration*: with probability ε
|
||||
(epsilon), the agent chooses a random action; with probability 1 − ε,
|
||||
it exploits its current Q-function. ε starts at 1 (pure exploration)
|
||||
and decays over training according to ``epsilon_decay``, reaching
|
||||
a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
|
||||
training log shows how exploration decreases as training progresses.
|
||||
|
||||
Representing the game board
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A neural network operates on numbers, not characters. Before the
|
||||
game board can be fed to the Q-network, it must be converted to a
|
||||
numerical representation. ``retro-gamer`` uses *one-hot encoding*.
|
||||
|
||||
For a character set of ``n`` distinct characters, each cell on the
|
||||
board is represented by a vector of ``n`` numbers, all zero except for
|
||||
the one position corresponding to the character in that cell, which is
|
||||
set to 1. For example, with character set ``['@', '*', '>']``, the
|
||||
character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
|
||||
encoded as ``[0, 0, 0]``.
|
||||
|
||||
The full board representation is a three-dimensional array of shape
|
||||
(H, W, C), where H is the board height, W is the board width, and
|
||||
C is the number of characters in the character set. The total number
|
||||
of numbers in this array—H × W × C—is the size of the board part of
|
||||
the observation. For a 32×16 board with 6 characters, this is
|
||||
32 × 16 × 6 = 3,072 numbers.
|
||||
|
||||
The ``character_set`` field in the game description determines which
|
||||
characters the agent can distinguish. A character not in the set
|
||||
appears as an all-zero vector—indistinguishable from an empty cell.
|
||||
If the character set is not specified, ``retro-gamer`` runs a brief
|
||||
exploration phase before training to observe which characters actually
|
||||
appear.
|
||||
|
||||
In addition to the board, the agent can observe numerical values from
|
||||
the game's state dictionary via ``observe_state``. These are
|
||||
appended to the end of the observation vector. The reward key must
|
||||
not be included in ``observe_state``: it would give the agent direct
|
||||
access to its own performance signal, which is not a realistic observation
|
||||
in most game contexts and can cause training pathologies.
|
||||
|
||||
Neural network architectures
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The architecture of the Q-network—the number and arrangement of its
|
||||
layers—is one of the most consequential choices in DQN training.
|
||||
``retro-gamer`` selects an architecture based on the ``spatial``
|
||||
field in the game description and generates a plain-language rationale.
|
||||
|
||||
**Multilayer perceptrons (MLP)**
|
||||
|
||||
The simplest neural network architecture for fixed-size input is the
|
||||
*multilayer perceptron* (MLP). An MLP is a sequence of *fully
|
||||
connected layers*: every unit in one layer is connected to every unit
|
||||
in the next. Each connection has a learnable *weight*; a unit computes
|
||||
a weighted sum of its inputs, passes it through a nonlinear *activation
|
||||
function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
|
||||
``max(0, x)``), and sends the result to the next layer. The final
|
||||
layer has one unit per action, producing Q-value estimates.
|
||||
|
||||
An MLP with two hidden layers of width 128, for an observation of size
|
||||
3,072 and 5 possible actions, would have approximately 400,000 trainable
|
||||
parameters. Training adjusts all of these parameters simultaneously to
|
||||
reduce the TD error.
|
||||
|
||||
An MLP treats its input as a flat list of numbers. It does not know
|
||||
that these numbers were arranged in a 2D grid, or that spatially
|
||||
adjacent cells are related. This is appropriate when the game's
|
||||
observation is better understood as a collection of independent
|
||||
readings—a set of meters or status indicators—rather than as a spatial
|
||||
scene. Set ``spatial = false`` in the game description to use this
|
||||
architecture.
|
||||
|
||||
**Convolutional neural networks (CNN)**
|
||||
|
||||
When the game board is genuinely spatial—when the relative positions
|
||||
of characters matter—a *convolutional neural network* (CNN) is a much
|
||||
better fit. A CNN applies a set of learnable *filters* (small weight
|
||||
matrices) across the board, computing a dot product of each filter with
|
||||
every overlapping patch of the input. The result is a set of *feature
|
||||
maps*: each feature map highlights where in the board a particular
|
||||
pattern appears.
|
||||
|
||||
This is efficient for two reasons. First, the same filter is applied
|
||||
at every board position: a filter that detects "apple to the right of
|
||||
snake head" works the same way whether the apple is at position (10,5)
|
||||
or (20,12). This *translational invariance* means the network can
|
||||
generalize across positions without learning a separate rule for each
|
||||
one. Second, each filter needs only a small number of parameters (the
|
||||
filter size)—far fewer than the equivalent fully connected connections.
|
||||
|
||||
``retro-gamer`` uses two convolutional layers (with 32 and 64 output
|
||||
channels respectively, kernel size 3, padding 1) followed by a
|
||||
flattening step and an MLP head. The padding ensures that the spatial
|
||||
dimensions are preserved through the convolution, so the output of the
|
||||
second conv layer has shape (64, H, W), which is then flattened and
|
||||
passed to the MLP. Set ``spatial = true`` (the default) to use this
|
||||
architecture.
|
||||
|
||||
Connecting architecture to game metadata
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The architectural choices ``retro-gamer`` makes are not arbitrary: they
|
||||
follow from the game description you provide. This connection is worth
|
||||
making explicit, because understanding it is one of the main paths into
|
||||
understanding why neural network architecture matters.
|
||||
|
||||
- If ``spatial = true``, the CNN can detect local patterns—which characters
|
||||
are adjacent to which—without needing to see every possible arrangement.
|
||||
This is appropriate for games like Snake, where the snake's direction
|
||||
and the apple's relative position are spatially encoded.
|
||||
|
||||
- If ``spatial = false``, the MLP treats the board as a flat vector. This
|
||||
may be appropriate for games that use the character grid primarily as a
|
||||
display rather than a spatial field—for example, a game where characters
|
||||
appear in fixed, non-interacting positions as status indicators.
|
||||
|
||||
- The ``character_set`` determines the depth (C) of the board tensor.
|
||||
More characters mean more numbers per cell and a larger input to the
|
||||
network. A character set that includes characters the game never uses
|
||||
wastes capacity; a character set that omits relevant characters forces
|
||||
the agent to treat different things as the same.
|
||||
|
||||
- The ``observe_state`` fields are appended to the flattened CNN output
|
||||
before the MLP head. This allows the agent to use explicit state
|
||||
variables—a timer, a lives count—alongside the visual board
|
||||
representation.
|
||||
|
||||
These relationships are not incidental features of the implementation.
|
||||
They are the reason the game description matters: every field you fill
|
||||
in shapes what the agent can perceive and therefore what it can learn.
|
||||
13
docs/conf.py
Normal file
13
docs/conf.py
Normal file
@@ -0,0 +1,13 @@
|
||||
project = 'retro-gamer'
|
||||
copyright = '2025, Chris Proctor'
|
||||
author = 'Chris Proctor'
|
||||
release = '0.1.0'
|
||||
|
||||
extensions = []
|
||||
|
||||
templates_path = ['_templates']
|
||||
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
|
||||
|
||||
html_theme = 'sphinx_rtd_theme'
|
||||
html_static_path = ['_static']
|
||||
html_theme_options = {}
|
||||
19
docs/contributing.rst
Normal file
19
docs/contributing.rst
Normal file
@@ -0,0 +1,19 @@
|
||||
Contributing
|
||||
============
|
||||
|
||||
``retro-gamer`` is developed as part of the
|
||||
`Making With Code <https://makingwithcode.org>`__ project. Chris
|
||||
Proctor (chrisp@buffalo.edu), the project lead, is interested in
|
||||
hearing about your experience using the package, whether in a classroom,
|
||||
as a research tool, or for personal exploration.
|
||||
|
||||
Bug reports, feature requests, and discussion of future directions take
|
||||
place on the project repository's
|
||||
`issues page <https://github.com/cproctor/retro-gamer/issues>`__. Code
|
||||
contributions should be submitted as pull requests. Development follows
|
||||
the `Contributor Covenant <https://www.contributor-covenant.org/>`__.
|
||||
|
||||
If you are a teacher or curriculum designer considering using
|
||||
``retro-gamer`` in a course, or a researcher interested in collaborating
|
||||
on studies of its educational effectiveness, please contact Chris
|
||||
directly.
|
||||
69
docs/index.rst
Normal file
69
docs/index.rst
Normal file
@@ -0,0 +1,69 @@
|
||||
retro-gamer: train agents to play retro games
|
||||
==============================================
|
||||
|
||||
``retro-gamer`` is a Python package for training reinforcement learning
|
||||
agents to play games implemented with the
|
||||
`retro-games <https://retro-games.readthedocs.io/en/latest/>`__
|
||||
framework. It is designed as a learning tool: rather than writing the
|
||||
learning algorithm yourself, you describe the game to the trainer in a
|
||||
structured way, adjust the training parameters, and then observe—through
|
||||
a detailed log—how the trainer uses your description to build and run a
|
||||
learning model.
|
||||
|
||||
The central idea is that the game becomes an *object to think with*
|
||||
about reinforcement learning. The choices you make—which characters to
|
||||
tell the trainer about, what counts as a reward, whether to treat the
|
||||
board as a spatial scene or a readout—have direct, observable
|
||||
consequences for how learning proceeds. Working out *why* a training run
|
||||
behaves as it does is the kind of reasoning that leads to lasting
|
||||
understanding of the underlying concepts.
|
||||
|
||||
.. _installation:
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
Prerequisites
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
``retro-gamer`` requires Python 3.11 or higher and a game implemented
|
||||
with `retro-games <https://retro-games.readthedocs.io/en/latest/>`__.
|
||||
The retro-games framework must also be installed; see its documentation
|
||||
for instructions.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% pip install retro-gamer
|
||||
|
||||
To install from source (for development or to use the latest changes):
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% git clone https://github.com/cproctor/retro-gamer
|
||||
% cd retro-gamer
|
||||
% pip install -e .
|
||||
|
||||
Verify the installation by checking the command-line tool:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer --help
|
||||
Usage: retro-gamer [OPTIONS] COMMAND [ARGS]...
|
||||
|
||||
Train and run RL agents for retro games.
|
||||
|
||||
Commands:
|
||||
create Create a new training run directory with config.toml.
|
||||
info Print a summary of a training run.
|
||||
play Watch a trained agent play the game.
|
||||
train Train (or resume training) a DQN agent.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Contents:
|
||||
|
||||
introduction
|
||||
background
|
||||
walkthrough
|
||||
reference
|
||||
contributing
|
||||
160
docs/introduction.rst
Normal file
160
docs/introduction.rst
Normal file
@@ -0,0 +1,160 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
``retro-gamer`` grew out of a question about how students learn
|
||||
difficult ideas in computer science. Reinforcement learning—the branch
|
||||
of machine learning in which an agent learns to act well by interacting
|
||||
with an environment and receiving rewards—is one of the most powerful
|
||||
and widely-deployed ideas in modern computing. It underlies systems that
|
||||
play chess and Go at superhuman levels, control industrial robots,
|
||||
optimize power grids, and personalize recommendation feeds. It is also
|
||||
genuinely hard to understand, not because the core ideas are especially
|
||||
abstract, but because the feedback between a student's understanding and
|
||||
the system's behavior is usually invisible. You adjust a hyperparameter,
|
||||
run a training loop, and get a number. What happened inside, and why,
|
||||
remains opaque.
|
||||
|
||||
The design hypothesis of ``retro-gamer`` is that this opacity is not
|
||||
inevitable. If a student already knows a game well—how it works, what
|
||||
the pieces mean, what counts as doing well—then training an agent on
|
||||
that game gives them a concrete anchor for reasoning about what the
|
||||
learning algorithm is doing and why. When the trainer decides to use a
|
||||
convolutional neural network instead of a simpler model, it explains its
|
||||
reasoning. When training stalls, the student can ask: did I describe the
|
||||
game accurately? Is the reward signal sending the right signal? Would a
|
||||
different exploration strategy help? These are exactly the questions that
|
||||
build genuine conceptual understanding.
|
||||
|
||||
``retro-gamer`` is developed as part of the
|
||||
`Making With Code <https://makingwithcode.org>`__ curriculum, a
|
||||
project-based high school computer science curriculum emphasizing
|
||||
personally meaningful creation and deep conceptual engagement. In the
|
||||
games unit, students design and implement their own games using the
|
||||
``retro-games`` framework. The extension into reinforcement learning is
|
||||
a natural next step: you built the game; now let's see if a machine can
|
||||
learn to play it.
|
||||
|
||||
How retro-gamer works
|
||||
---------------------
|
||||
|
||||
Rather than asking you to write a training algorithm yourself,
|
||||
``retro-gamer`` asks you to describe the game you want to train on.
|
||||
This description—written in your game project's ``pyproject.toml``—tells
|
||||
the trainer things the game's code alone doesn't make obvious: which
|
||||
characters matter, which piece of game state represents success, whether
|
||||
the board should be understood spatially or as a flat data display.
|
||||
|
||||
From this description, the trainer constructs a deep Q-learning model
|
||||
suited to the game. It writes out a plain-language explanation of every
|
||||
architectural decision it makes, then begins training. As training
|
||||
proceeds, it logs each episode's reward, loss, and exploration rate.
|
||||
Trained model snapshots—checkpoints—are saved periodically, so you can
|
||||
watch how the agent's skill develops over time. When you're done
|
||||
training, you can load any checkpoint and watch the agent play.
|
||||
|
||||
A typical workflow looks like this. First, describe your game in the
|
||||
``[tool.retro-gamer]`` section of your game project's ``pyproject.toml``:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
[tool.retro-gamer]
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
|
||||
Then create a training run, train, and watch the result:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game my_game --output runs/snake/
|
||||
|
||||
% retro-gamer train runs/snake/
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0500
|
||||
|
||||
The ``create`` command sets up the training run directory; ``train``
|
||||
runs the learning algorithm; ``play`` loads a checkpoint and lets you
|
||||
watch the trained agent live in the terminal.
|
||||
|
||||
What you will learn
|
||||
-------------------
|
||||
|
||||
Working with ``retro-gamer`` is designed to build understanding of a
|
||||
cluster of related ideas:
|
||||
|
||||
**Reinforcement learning** is the framework in which an agent
|
||||
interacts with an environment, receiving observations and rewards, and
|
||||
learns to choose actions that maximize its long-term reward. The
|
||||
``retro-gamer`` training loop is a concrete instance of this framework:
|
||||
the agent is the neural network, the environment is the game, the
|
||||
observation is the encoded board and game state, and the reward is
|
||||
the change in score from one turn to the next.
|
||||
|
||||
**Neural network architecture** shapes what a model can and cannot
|
||||
learn. When you declare a game ``spatial``, the trainer builds a
|
||||
convolutional neural network that can detect patterns in the relative
|
||||
positions of game pieces. When you declare it non-spatial, it builds a
|
||||
simpler network that ignores position. Seeing the consequence of this
|
||||
choice in training behavior is a direct experience of why architecture
|
||||
matters.
|
||||
|
||||
**Observation design** determines what information is available to the
|
||||
agent. If you leave a character out of the ``character_set``, the agent
|
||||
will not distinguish it from empty space. If you include a game-state
|
||||
variable in ``observe_state``, the agent can see it directly rather than
|
||||
having to infer it from the board. The consequences of these choices for
|
||||
what the agent can learn are reasonably predictable—and making and
|
||||
checking those predictions is exactly the kind of reasoning the tool is
|
||||
designed to support.
|
||||
|
||||
**Reward engineering** is the craft of specifying what counts as doing
|
||||
well in a way the agent can actually optimize. Using score as the reward
|
||||
is natural for many games, but some games have sparse rewards (the agent
|
||||
rarely earns points), and some have reward signals that are easy to
|
||||
game. Experimenting with what to use as a reward—and observing how that
|
||||
choice shapes training—is one of the richest paths into understanding
|
||||
what reinforcement learning is actually optimizing.
|
||||
|
||||
**Hyperparameter tuning** is the practice of adjusting training settings
|
||||
such as learning rate, exploration probability, and network size to
|
||||
improve training efficiency and final performance. ``retro-gamer``
|
||||
exposes these settings explicitly and explains their role in the
|
||||
training log, so tuning them is connected to conceptual understanding
|
||||
rather than uninformed search.
|
||||
|
||||
The interpretable training log
|
||||
------------------------------
|
||||
|
||||
A key feature of ``retro-gamer`` is its training log. When training
|
||||
begins, the trainer writes a complete, plain-language account of the
|
||||
model it built: why it chose the architecture it did, what the
|
||||
observation vector contains, what actions the agent can take, and how
|
||||
the exploration and learning schedules are set up. Here is an example
|
||||
from training a snake agent:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
[INIT] === Network Architecture ===
|
||||
[INIT] Board: 32×16, character set: 6 chars (one-hot per cell)
|
||||
[INIT] Observed state keys: 0 | Actions (incl. no-op): 5
|
||||
[INIT] spatial=True → using CNN architecture
|
||||
[INIT] Rationale: the board is a 2-D spatial scene; a CNN captures
|
||||
[INIT] local patterns (walls, items nearby) more efficiently than an MLP.
|
||||
[INIT] CNN: Conv2d(6→32, k=3, pad=1) → ReLU → Conv2d(32→64, k=3, pad=1) → ReLU
|
||||
[INIT] CNN output: 64 channels × 16×32 = 32768 features (flattened)
|
||||
[INIT] MLP head input: 32768 (conv) + 0 (state) = 32768
|
||||
[INIT] MLP: 32768 → 128 → 128 → 5
|
||||
[INIT] Hidden layers: 2 | Layer width: 128
|
||||
[INIT] Output: 5 Q-values
|
||||
[INIT] Actions: ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN'] + (no-op)
|
||||
...
|
||||
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
|
||||
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
|
||||
[EP 0500] total_reward=9.0 steps=1203 epsilon=0.0821 avg_loss=0.000387
|
||||
|
||||
The episode log shows total reward (score earned), how many turns the
|
||||
episode lasted, the current exploration rate (``epsilon``), and the
|
||||
average prediction error (``avg_loss``). Reading this log—and
|
||||
connecting changes in these numbers to what you know about the game and
|
||||
the algorithm—is one of the main activities the tool is designed to
|
||||
support.
|
||||
344
docs/reference.rst
Normal file
344
docs/reference.rst
Normal file
@@ -0,0 +1,344 @@
|
||||
Reference
|
||||
=========
|
||||
|
||||
Game description fields
|
||||
-----------------------
|
||||
|
||||
Game descriptions are written in the ``[tool.retro-gamer]`` section of
|
||||
your game project's ``pyproject.toml``. ``retro-gamer create`` reads
|
||||
this section and copies the metadata into the training run's
|
||||
``config.toml``, where it can also be inspected or hand-edited.
|
||||
|
||||
A complete example for the Snake game:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
[tool.retro-gamer]
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
spatial = true
|
||||
observe_state = []
|
||||
|
||||
You do not need to specify the board size: ``retro-gamer`` reads it
|
||||
directly from your game's ``board_size`` attribute.
|
||||
|
||||
The fields are described below.
|
||||
|
||||
``actions``
|
||||
~~~~~~~~~~~
|
||||
|
||||
**Required.** A list of keystroke names the agent may send to the game
|
||||
each turn. Use arrow key names for directional games, or single
|
||||
characters for character-key games.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
|
||||
The agent also has access to a no-op action (doing nothing). The total
|
||||
number of actions in the Q-network output is ``len(actions) + 1``.
|
||||
|
||||
``reward``
|
||||
~~~~~~~~~~
|
||||
|
||||
**Required.** The key in the game's state dictionary to use as the
|
||||
reward signal. The reward computed for each turn is the *change* in
|
||||
this value from the previous turn.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
reward = "score"
|
||||
|
||||
``character_set``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Optional.** A list of single characters that may appear on the board.
|
||||
Each character occupies one "slot" in the one-hot encoding. Characters
|
||||
not in this list are treated as empty space.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
|
||||
If omitted, ``retro-gamer`` runs an exploration phase to discover the
|
||||
characters that appear in practice. The length of this phase is
|
||||
controlled by the ``exploration_turns`` hyperparameter.
|
||||
|
||||
``spatial``
|
||||
~~~~~~~~~~~
|
||||
|
||||
**Optional; default ``true``.** Whether to treat the board as a 2D
|
||||
spatial scene. When ``true``, the trainer uses a convolutional neural
|
||||
network (CNN) that can detect patterns in the relative positions of
|
||||
characters. When ``false``, the trainer uses a multilayer perceptron
|
||||
(MLP) that sees the board as a flat list of numbers without positional
|
||||
structure.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
spatial = true
|
||||
|
||||
``observe_state``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Optional; default ``[]``.** A list of keys from the game's state
|
||||
dictionary to append to the observation vector. The values must be
|
||||
numbers (integers, floats, or booleans). The reward key must not
|
||||
appear in this list.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
observe_state = ["lives", "level"]
|
||||
|
||||
.. _hyperparameters:
|
||||
|
||||
Hyperparameters
|
||||
---------------
|
||||
|
||||
Hyperparameters are stored in the ``[hyperparameters]`` section of
|
||||
``config.toml``. They can be set via ``retro-gamer create`` options or
|
||||
edited directly.
|
||||
|
||||
Learning and optimization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``learning_rate`` (default: ``0.001``)
|
||||
The step size used by the Adam optimizer when updating network
|
||||
weights. Larger values converge faster but may be unstable; smaller
|
||||
values are more stable but slower.
|
||||
|
||||
``lr_decay`` (default: ``0.995``)
|
||||
Multiplicative decay applied to the learning rate after each
|
||||
episode. The learning rate decreases geometrically over training,
|
||||
helping the network fine-tune later without destabilizing early
|
||||
progress.
|
||||
|
||||
``gamma`` (default: ``0.99``)
|
||||
The discount factor for future rewards. A value of 1.0 makes the
|
||||
agent value all future rewards equally; smaller values make the
|
||||
agent increasingly myopic.
|
||||
|
||||
Exploration
|
||||
~~~~~~~~~~~
|
||||
|
||||
``epsilon`` (default: ``1.0``)
|
||||
The initial exploration rate. At each turn, the agent takes a
|
||||
random action with probability ``epsilon`` and exploits its current
|
||||
Q-function with probability ``1 - epsilon``.
|
||||
|
||||
``epsilon_decay`` (default: ``0.995``)
|
||||
Multiplicative decay applied to ``epsilon`` after each episode.
|
||||
|
||||
``epsilon_min`` (default: ``0.05``)
|
||||
The floor below which ``epsilon`` will not fall. A small amount of
|
||||
continued exploration prevents the agent from becoming permanently
|
||||
committed to a suboptimal policy.
|
||||
|
||||
Memory and sampling
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``batch_size`` (default: ``64``)
|
||||
The number of experiences sampled from the replay buffer per
|
||||
training step.
|
||||
|
||||
``memory_capacity`` (default: ``10000``)
|
||||
The maximum number of experiences the replay buffer can hold. When
|
||||
full, the oldest experiences are discarded.
|
||||
|
||||
``prioritize_experiences`` (default: ``false``)
|
||||
Whether to use prioritized experience replay. When ``true``,
|
||||
experiences with larger TD errors are sampled more frequently.
|
||||
This often improves sample efficiency at a modest computational
|
||||
cost.
|
||||
|
||||
Network architecture
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``n_layers`` (default: ``2``)
|
||||
The number of hidden layers in the MLP head (for spatial games,
|
||||
this follows the CNN; for non-spatial games, it is the full
|
||||
network).
|
||||
|
||||
``layer_size`` (default: ``128``)
|
||||
The width (number of units) in each hidden layer.
|
||||
|
||||
Training duration
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
``training_episodes`` (default: ``1000``)
|
||||
The total number of game episodes to run. Each episode runs until
|
||||
the game ends or ``max_turns_per_episode`` turns have elapsed.
|
||||
|
||||
``max_turns_per_episode`` (default: ``2000``)
|
||||
A safety cutoff preventing a single episode from running
|
||||
indefinitely (for example, if the agent finds a way to avoid
|
||||
dying).
|
||||
|
||||
``target_update_freq`` (default: ``100``)
|
||||
How many training steps between updates of the target network.
|
||||
More frequent updates make training targets move faster (less
|
||||
stable); less frequent updates make them more stable but slower
|
||||
to reflect new learning.
|
||||
|
||||
Character discovery
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``exploration_turns`` (default: ``200``)
|
||||
When ``character_set`` is not specified, the number of random
|
||||
turns to run at the start of training to discover which
|
||||
characters appear on the board.
|
||||
|
||||
``unknown_character_strategy`` (default: ``"ignore"``)
|
||||
What to do when a character appears during training that is not
|
||||
in the established ``character_set``. ``"ignore"`` treats it as
|
||||
an empty cell; ``"extend"`` rebuilds the model with an extended
|
||||
character set.
|
||||
|
||||
CLI reference
|
||||
-------------
|
||||
|
||||
``retro-gamer create``
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Create a new training run directory with ``config.toml``. Game metadata
|
||||
is read automatically from the ``[tool.retro-gamer]`` section of your
|
||||
game's ``pyproject.toml``; you do not pass it on the command line.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game MODULE --output DIR [OPTIONS]
|
||||
|
||||
**Required options:**
|
||||
|
||||
- ``--game MODULE`` — Python module containing ``create_game()``
|
||||
(e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
|
||||
is read from the ``pyproject.toml`` found in or above the module's
|
||||
source directory.
|
||||
- ``--output DIR`` — Directory to create for this training run.
|
||||
|
||||
**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
|
||||
|
||||
- ``--training-episodes N``
|
||||
- ``--n-layers N``
|
||||
- ``--layer-size N``
|
||||
- ``--learning-rate F``
|
||||
- ``--lr-decay F``
|
||||
- ``--gamma F``
|
||||
- ``--epsilon-decay F``
|
||||
- ``--epsilon-min F``
|
||||
- ``--batch-size N``
|
||||
- ``--memory-capacity N``
|
||||
- ``--target-update-freq N``
|
||||
- ``--max-turns-per-episode N``
|
||||
- ``--exploration-turns N``
|
||||
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
|
||||
|
||||
``retro-gamer train``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Train (or resume training) a DQN agent.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train RUN_DIR [--resume CHECKPOINT]
|
||||
|
||||
``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
|
||||
create``. If ``--resume`` is given, training resumes from the specified
|
||||
checkpoint file (relative or absolute path).
|
||||
|
||||
``retro-gamer play``
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Watch a trained agent play the game in the terminal.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
|
||||
|
||||
``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
|
||||
name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
|
||||
``--framerate`` sets the target frames per second (default: 12). Press
|
||||
Enter or Escape to quit.
|
||||
|
||||
``retro-gamer info``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Print a summary of a training run: metadata, hyperparameters, recent
|
||||
episode log, and available checkpoints.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer info RUN_DIR
|
||||
|
||||
Training run directory structure
|
||||
---------------------------------
|
||||
|
||||
A training run is a self-contained directory with the following
|
||||
contents:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
runs/snake/
|
||||
├── config.toml # game description + hyperparameters
|
||||
├── training.log # architecture rationale + per-episode log
|
||||
└── checkpoints/
|
||||
├── ep_0100.pt # model weights at episode 100
|
||||
├── ep_0200.pt
|
||||
├── ...
|
||||
└── final.pt # model weights at training completion
|
||||
|
||||
``config.toml`` is written by ``retro-gamer create`` and updated (with
|
||||
the discovered character set and resolved hyperparameters) when
|
||||
``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
|
||||
and ``train`` is the recommended way to adjust hyperparameters.
|
||||
|
||||
``training.log`` begins with the full architecture description
|
||||
generated at training startup, followed by one line per episode in the
|
||||
format::
|
||||
|
||||
[EP NNNN] total_reward=F steps=N epsilon=F avg_loss=F
|
||||
|
||||
Checkpoint files are PyTorch state dictionaries containing model
|
||||
weights, optimizer state, the current epsilon, and the total number of
|
||||
training steps completed. They can be loaded with
|
||||
``retro-gamer play`` or directly with the Python API.
|
||||
|
||||
Python API
|
||||
----------
|
||||
|
||||
For advanced use, ``retro-gamer``'s components are importable as a
|
||||
library.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
|
||||
from retro.examples.snake import create_game
|
||||
|
||||
# Read metadata from [tool.retro-gamer] in the game's pyproject.toml
|
||||
metadata = GameMetadata.from_pyproject("retro.examples.snake")
|
||||
|
||||
trainer = DQNTrainer(
|
||||
create_game, metadata, "runs/snake/",
|
||||
training_episodes=500,
|
||||
n_layers=2,
|
||||
layer_size=128,
|
||||
)
|
||||
trainer.train()
|
||||
|
||||
``GameEnvironment`` provides a gym-style interface for stepping through
|
||||
a game programmatically:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameEnvironment
|
||||
|
||||
env = GameEnvironment(create_game, metadata)
|
||||
obs = env.reset() # returns initial observation vector
|
||||
obs, reward, done = env.step("KEY_RIGHT")
|
||||
|
||||
The observation is a flat NumPy array of dtype ``float32``. For spatial
|
||||
games, the first ``C × H × W`` elements are the board (channel-first
|
||||
one-hot encoding); for non-spatial games, the board is encoded
|
||||
``H × W × C`` and then flattened. Any ``observe_state`` values are
|
||||
appended at the end.
|
||||
299
docs/walkthrough.rst
Normal file
299
docs/walkthrough.rst
Normal file
@@ -0,0 +1,299 @@
|
||||
Walkthrough
|
||||
===========
|
||||
|
||||
This section walks through a complete ``retro-gamer`` workflow, from
|
||||
preparing a game to watching a trained agent play. The game used here
|
||||
is the Snake example included with the ``retro-games`` framework, but
|
||||
the same steps apply to any game you build.
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
|
||||
You will need:
|
||||
|
||||
- Python 3.11 or higher.
|
||||
- The ``retro-games`` framework installed and a game you have written
|
||||
(or the built-in Snake example). See the
|
||||
`retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
|
||||
for help writing games.
|
||||
- ``retro-gamer`` installed (see :ref:`installation`).
|
||||
|
||||
Preparing your game
|
||||
-------------------
|
||||
|
||||
``retro-gamer`` loads your game by importing a Python module and
|
||||
calling a function named ``create_game``. The ``create_game`` function
|
||||
must take no arguments and return a new ``Game`` instance.
|
||||
|
||||
Here is the ``create_game`` function for Snake:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def create_game():
|
||||
head = SnakeHead()
|
||||
apple = Apple()
|
||||
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
|
||||
apple.relocate(game)
|
||||
return game
|
||||
|
||||
If your game module does not already have a ``create_game`` function,
|
||||
add one following this pattern.
|
||||
|
||||
|
||||
Describing your game
|
||||
--------------------
|
||||
|
||||
Every training run begins with a description of your game. This
|
||||
description belongs in the ``[tool.retro-gamer]`` section of your game
|
||||
project's ``pyproject.toml``—the same file that defines the project's
|
||||
name, version, and dependencies. Placing it there keeps the description
|
||||
with the game itself, where it belongs.
|
||||
|
||||
Here is the ``[tool.retro-gamer]`` section for the Snake example:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
[tool.retro-gamer]
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
spatial = true
|
||||
observe_state = []
|
||||
|
||||
Let's go through each field.
|
||||
|
||||
``actions``
|
||||
~~~~~~~~~~~
|
||||
|
||||
A list of the keystrokes the agent may send to the game. For Snake,
|
||||
the four arrow keys control the direction of travel. The agent also
|
||||
implicitly has access to a no-op (doing nothing).
|
||||
|
||||
.. note::
|
||||
|
||||
Only include actions that the game actually responds to. Listing
|
||||
unreachable keys wastes part of the agent's action space and may slow
|
||||
training.
|
||||
|
||||
``reward``
|
||||
~~~~~~~~~~
|
||||
|
||||
The key in the game's state dictionary to use as the reward signal.
|
||||
``retro-gamer`` computes the reward for each turn as the *change* in
|
||||
this value from one turn to the next. For Snake, score increases by 1
|
||||
(or more) each time the apple is eaten, so the agent receives a reward
|
||||
of 1 when it eats an apple and 0 otherwise.
|
||||
|
||||
Choosing an appropriate reward is one of the most consequential
|
||||
decisions in RL. Some considerations:
|
||||
|
||||
- A reward that is too sparse—where the agent goes many turns without
|
||||
receiving any signal—makes learning slow. A snake that dies without
|
||||
ever eating an apple receives no positive reward at all in the first
|
||||
episodes, giving the learning algorithm almost nothing to work with.
|
||||
- A reward that is too dense—assigned every turn—may not reflect the
|
||||
true goal of the game.
|
||||
- An artificial reward, such as giving a point for moving toward the
|
||||
apple, can accelerate early training but may cause the agent to
|
||||
optimize the proxy rather than the real objective.
|
||||
|
||||
``character_set``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The characters that can appear on the board, as a list of
|
||||
single-character strings. Each cell of the board will be *one-hot
|
||||
encoded* using this list: the agent represents the content of each cell
|
||||
as a vector of zeros with a single 1 at the position corresponding to
|
||||
the character. A cell containing a character not in this list is treated
|
||||
as empty.
|
||||
|
||||
For Snake, the characters are: ``@`` (the apple), ``*`` (body
|
||||
segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).
|
||||
|
||||
If you omit this field, ``retro-gamer`` will run a brief exploration
|
||||
phase before training to discover which characters actually appear.
|
||||
The number of exploration turns is controlled by the
|
||||
``exploration_turns`` hyperparameter.
|
||||
|
||||
``spatial``
|
||||
~~~~~~~~~~~
|
||||
|
||||
Whether to treat the board as a spatial scene (default: ``true``). A
|
||||
spatial game uses a *convolutional neural network* (CNN) that can
|
||||
detect patterns in the relative arrangement of characters. A
|
||||
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
|
||||
ignores positional relationships. Set to ``false`` for games where
|
||||
position is irrelevant.
|
||||
|
||||
Once you have written this section, create the training run directory:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create \
|
||||
--game retro.examples.snake \
|
||||
--output runs/snake/
|
||||
|
||||
Created training run at runs/snake/config.toml
|
||||
game : retro.examples.snake
|
||||
board_size : 32×16
|
||||
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
|
||||
reward : score
|
||||
characters : ['@', '*', '>', '<', '^', 'v']
|
||||
architecture: CNN (spatial)
|
||||
|
||||
``retro-gamer create`` reads your game metadata directly from
|
||||
``pyproject.toml`` and writes it—along with all hyperparameters—to
|
||||
``runs/snake/config.toml``.
|
||||
|
||||
Training the agent
|
||||
------------------
|
||||
|
||||
With the ``config.toml`` in place, start training:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/
|
||||
Training for 1000 episodes…
|
||||
Done. Checkpoints in runs/snake/checkpoints/
|
||||
|
||||
Training saves checkpoints every 100 episodes and a ``final.pt``
|
||||
checkpoint when complete. You can follow progress in the training log:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% tail -f runs/snake/training.log
|
||||
|
||||
The log shows one line per episode:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
|
||||
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
|
||||
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
|
||||
|
||||
- **total_reward**: the total score earned during the episode (how many
|
||||
apples the snake ate, for Snake).
|
||||
- **steps**: how many turns the episode lasted.
|
||||
- **epsilon**: the current exploration rate. Early in training this is
|
||||
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
|
||||
- **avg_loss**: the average temporal-difference error across training
|
||||
steps in this episode. A decreasing loss generally indicates that the
|
||||
Q-value estimates are converging.
|
||||
|
||||
Resuming training
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Training can be resumed from a checkpoint:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
|
||||
|
||||
Watching a trained agent play
|
||||
------------------------------
|
||||
|
||||
To watch a trained agent play the game in your terminal:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint final
|
||||
|
||||
You can substitute any checkpoint name:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||||
|
||||
Press Enter or Escape to quit.
|
||||
|
||||
Comparing agents trained at different checkpoints is a useful activity:
|
||||
the agent at episode 100 has learned *something*, but typically much
|
||||
less than the agent at episode 500. Articulating *what* the earlier
|
||||
agent has and has not learned, and *why*, is productive reasoning about
|
||||
the training process.
|
||||
|
||||
Inspecting a run
|
||||
----------------
|
||||
|
||||
To review the configuration and recent training progress for a run:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer info runs/snake/
|
||||
Game module : retro.examples.snake
|
||||
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
|
||||
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
|
||||
|
||||
Last 5 episodes:
|
||||
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
|
||||
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
|
||||
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
|
||||
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
|
||||
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
|
||||
|
||||
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
|
||||
|
||||
Adjusting hyperparameters
|
||||
--------------------------
|
||||
|
||||
The training hyperparameters can be changed by editing ``config.toml``
|
||||
before training, or by passing them as options to ``retro-gamer
|
||||
create``. Common adjustments and their effects:
|
||||
|
||||
**``training_episodes``** — How long to train. More episodes give the
|
||||
agent more time to learn, but also take longer to run.
|
||||
|
||||
**``epsilon_decay``** — How quickly exploration decreases. A faster
|
||||
decay (smaller ``epsilon_decay``) means the agent commits to its early
|
||||
Q-estimates before they are fully reliable. A slower decay (larger
|
||||
``epsilon_decay``, closer to 1) gives the agent more time to explore
|
||||
but may waste training time on random actions.
|
||||
|
||||
**``learning_rate``** — How large the weight updates are at each
|
||||
training step. A large learning rate learns fast but may overshoot;
|
||||
a small learning rate is stable but slow.
|
||||
|
||||
**``gamma``** — The discount factor for future rewards. Closer to 1
|
||||
means the agent values long-term consequences; closer to 0 makes the
|
||||
agent focus on immediate reward.
|
||||
|
||||
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
|
||||
head. Larger networks can represent more complex Q-functions but are
|
||||
slower to train and may overfit.
|
||||
|
||||
**``prioritize_experiences``** — Whether to use prioritized experience
|
||||
replay. This often improves sample efficiency but is slightly slower
|
||||
per step.
|
||||
|
||||
Questions for investigation
|
||||
----------------------------
|
||||
|
||||
The following questions are intended to guide productive investigation
|
||||
using ``retro-gamer``. They are chosen because they have specific,
|
||||
reasoned answers that connect what you know about the game to the
|
||||
concepts underlying the training algorithm.
|
||||
|
||||
1. **Character set completeness.** Train two agents: one with the full
|
||||
character set, one missing a character that frequently appears on the
|
||||
board. Compare their performance. What did the second agent lose the
|
||||
ability to perceive, and how did that affect its behavior?
|
||||
|
||||
2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
|
||||
true`` and ``spatial = false``. How does training efficiency differ?
|
||||
Can you explain the difference in terms of what each architecture
|
||||
can and cannot learn?
|
||||
|
||||
3. **Reward shaping.** If the game currently rewards only the final
|
||||
objective (e.g., reaching a goal), add intermediate rewards for
|
||||
sub-goals. How does this change the early training curve? Does it
|
||||
change the agent's final strategy?
|
||||
|
||||
4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
|
||||
(so the agent commits to exploiting early) and a very slow one (so
|
||||
exploration continues for a long time). How do the training curves
|
||||
differ? What is the agent doing in each case when ``epsilon`` is low?
|
||||
|
||||
5. **Checkpoint comparison.** Load the agent at episode 100 and at
|
||||
episode 1000 and watch each play the same game. What has the later
|
||||
agent learned that the earlier one has not? How would you describe
|
||||
this difference to someone who does not know about neural networks?
|
||||
Reference in New Issue
Block a user