Updates across the board

This commit is contained in:
Chris Proctor
2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions

30
docs/api.rst Normal file
View File

@@ -0,0 +1,30 @@
API Reference
=============
All classes below are importable directly from ``retro_gamer``.
Game description
----------------
.. autoclass:: retro_gamer.GameMetadata
:members: from_pyproject, from_dict, validate
Training
--------
.. autoclass:: retro_gamer.DQNTrainer
:members: train, load_checkpoint
Environment
-----------
.. autoclass:: retro_gamer.GameEnvironment
:members: reset, step
Using a trained model
---------------------
.. autoclass:: retro_gamer.TrainedPolicy
:members: get_action
.. autoclass:: retro_gamer.PolicyInput

View File

@@ -343,12 +343,13 @@ If the character set is not specified, ``retro-gamer`` runs a brief
exploration phase before training to observe which characters actually
appear.
In addition to the board, the agent can observe numerical values from
the game's state dictionary via ``observe_state``. These are
appended to the end of the observation vector. The reward key must
not be included in ``observe_state``: it would give the agent direct
access to its own performance signal, which is not a realistic observation
in most game contexts and can cause training pathologies.
In addition to the board, the agent can observe extra computed values
from ``game.state``. Listing keys in the ``observe_state`` option of
``[preprocessing]`` causes those values to be appended to the
observation vector after the board encoding. This is where feature
engineering decisions live: what derived quantities should the agent
see, and does giving it those values give it an advantage a human
player would not have?
Neural network architectures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -356,7 +357,8 @@ Neural network architectures
The architecture of the Q-network—the number and arrangement of its
layers—is one of the most consequential choices in DQN training.
``retro-gamer`` selects an architecture based on the ``spatial``
field in the game description and generates a plain-language rationale.
option in ``[preprocessing]`` of ``config.toml`` and generates a
plain-language rationale.
**Multilayer perceptrons (MLP)**
@@ -379,8 +381,7 @@ that these numbers were arranged in a 2D grid, or that spatially
adjacent cells are related. This is appropriate when the game's
observation is better understood as a collection of independent
readings—a set of meters or status indicators—rather than as a spatial
scene. Set ``spatial = false`` in the game description to use this
architecture.
scene. ``spatial = false`` (the default) selects this architecture.
**Convolutional neural networks (CNN)**
@@ -405,8 +406,8 @@ channels respectively, kernel size 3, padding 1) followed by a
flattening step and an MLP head. The padding ensures that the spatial
dimensions are preserved through the convolution, so the output of the
second conv layer has shape (64, H, W), which is then flattened and
passed to the MLP. Set ``spatial = true`` (the default) to use this
architecture.
passed to the MLP. Set ``spatial = true`` in ``[preprocessing]`` to
use this architecture.
Connecting architecture to game metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -416,15 +417,17 @@ follow from the game description you provide. This connection is worth
making explicit, because understanding it is one of the main paths into
understanding why neural network architecture matters.
- If ``spatial = true``, the CNN can detect local patterns—which characters
are adjacent to which—without needing to see every possible arrangement.
This is appropriate for games like Snake, where the snake's direction
and the apple's relative position are spatially encoded.
- If ``spatial = true`` (in ``[preprocessing]``), the CNN can detect
local patterns—which characters are adjacent to which—without needing
to see every possible arrangement. This is appropriate for games like
Snake, where the snake's direction and the apple's relative position
are spatially encoded.
- If ``spatial = false``, the MLP treats the board as a flat vector. This
may be appropriate for games that use the character grid primarily as a
display rather than a spatial field—for example, a game where characters
appear in fixed, non-interacting positions as status indicators.
- If ``spatial = false`` (the default), the MLP treats the board as a
flat vector. This may be appropriate for games that use the character
grid primarily as a display rather than a spatial field—for example,
a game where characters appear in fixed, non-interacting positions as
status indicators.
- The ``character_set`` determines the depth (C) of the board tensor.
More characters mean more numbers per cell and a larger input to the
@@ -432,11 +435,185 @@ understanding why neural network architecture matters.
wastes capacity; a character set that omits relevant characters forces
the agent to treat different things as the same.
- The ``observe_state`` fields are appended to the flattened CNN output
before the MLP head. This allows the agent to use explicit state
variables—a timer, a lives count—alongside the visual board
representation.
- Keys listed in ``observe_state`` (in ``[preprocessing]``) are appended
to the flattened board output before the MLP head. This allows the
agent to use computed values—a direction to the goal, a distance, a
timer—alongside the visual board representation.
These relationships are not incidental features of the implementation.
They are the reason the game description matters: every field you fill
in shapes what the agent can perceive and therefore what it can learn.
Design rationale
----------------
This section explains the reasoning behind several design decisions in
``retro-gamer`` that go beyond technical necessity. Each choice was
made with a specific pedagogical goal: to create a tool that not only
trains agents, but also helps students build genuine understanding of
how and why the training process works.
Checkpoint compatibility and the "start fresh" workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a student changes the game description or network architecture
mid-training, ``retro-gamer`` refuses to resume and explains exactly
which fields changed and why they are incompatible. This behavior is
deliberate.
The immediate practical reason is correctness: if the character set
changes, the network's input layer changes size, and the saved weights
no longer correspond to any meaningful function. Loading them would
produce garbage behavior. If the reward signal changes, the Q-values
the network has accumulated are estimates of a *different* objective;
resuming would mislead the network, not help it.
But the deeper reason is pedagogical. The incompatibility check is a
moment of forced reflection. When a student sees::
character_set
was : ['@', '*', '>', '<', '^', 'v']
now : ['@', '*', '>', '<', '^', 'v', '#']
why : the set of board characters (changes input layer size)
they are confronted with the concrete consequence of a description
change. The character set is not a label; it determines the shape of
the tensor the network operates on. Changing it invalidates the
network the same way changing the rules of chess would invalidate a
chess engine. The error message is designed to make this connection
legible, not just to block a problematic action.
The ``retro-gamer clean`` command exists to make the recovery path
explicit: you can start fresh, and you should. There is no partial
salvage. This mirrors an important truth about RL training: some
decisions are foundational, and changing them means starting over.
Students who encounter this—who have to decide whether a change is
worth the cost of retraining—are reasoning about the architecture in
a way that purely reading about it does not produce.
The distinction between incompatible changes (game description,
network architecture) and safe changes (hyperparameters like learning
rate and epsilon) is also pedagogically useful. It encodes, in the
tool itself, the distinction between *what the agent is learning* and
*how it is learning*. Students who ask "can I change the learning rate
without retraining?" are asking a question with a precise answer, and
answering it correctly requires understanding why the learning rate is
different in kind from the character set.
Checkpoint-level logging
~~~~~~~~~~~~~~~~~~~~~~~~~
Early versions of ``retro-gamer`` logged one line per episode. This
was accurate but not very useful: a run of 1,000 episodes produces
1,000 log lines, most of which are noise. Individual episodes vary
widely due to randomness in both the game and the agent's exploration,
making it hard to see the underlying trend.
The current format logs one line per checkpoint—once every 100
episodes—using averages over that window. This design serves several
goals:
**Noise reduction.** Single-episode rewards are highly variable,
especially when epsilon is high and the agent is behaving randomly.
Averaging over 100 episodes smooths out this variance and makes
genuine trends visible.
**Interpretive scaffolding.** The log line includes ``epsilon``
alongside ``avg_reward``, so students can directly see the
relationship between exploration rate and performance. Early entries
with low ``avg_reward`` and high ``epsilon`` invite the question:
"is this bad performance, or just exploration?" The answer—that random
behavior is expected when epsilon is near 1—is readable from the log
itself.
**Timing information.** Each log line records both the elapsed time
for that 100-episode interval and the total training time accumulated
across all sessions. This serves two purposes. Practically, it lets
students estimate how long continued training will take. Conceptually,
it makes the cost of training tangible: RL is not instant, and the
log makes the time investment visible.
**Session continuity.** When training resumes from a checkpoint, a
header line marks the break (``=== Resumed from ep_0500.pt ===``).
This lets the full log tell the story of a run across multiple
sessions, preserving the history of when training happened even if the
student stops and restarts many times.
The stop-watch-adjust-resume workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``retro-gamer`` is designed around a workflow that the log format and
checkpoint system both support: stop training, watch the agent play,
decide what to change, and resume.
This workflow is pedagogically productive because it gives students
a *reason* to look at the log and a *reason* to think about
hyperparameters. Watching the agent at episode 100 play erratically,
then watching the agent at episode 500 navigate toward the apple more
consistently, is not just satisfying—it raises concrete questions.
Why did the agent improve? What changed between those two checkpoints?
What would happen if we gave it more time, or adjusted the reward?
These questions are best answered by consulting the log. The log in
turn connects the behavior the student observed to numbers they can
reason about: a decreasing loss, a declining epsilon, a rising average
reward. The three—visual observation, log interpretation, and
conceptual understanding—form a feedback loop that is much harder to
close if training is treated as a black box that produces only a final
model.
The fact that training can be stopped and resumed freely, with no
penalty and no extra flags, removes friction from this cycle. Students
who feel they can experiment—stop, look, think, resume—are more
likely to do so than students who feel they have to commit to a full
training run before seeing results.
Reward design as game description
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``reward`` field in ``[tool.retro-gamer]`` specifies a key from
the game's state dictionary, not a function or a formula. This is
another deliberate design choice. The reward signal is defined in the
game code—in how the score changes when certain events occur—not in
the training configuration.
This forces students to engage with the reward where it lives: in the
game logic. If a student wants to change the reward structure, they
must change the game. This connects the RL concept of reward shaping
to the concrete act of writing Python code that updates a score. The
question "what reward should the agent get for moving toward the
apple?" becomes "what code should run when the snake moves?"—and
answering it requires reasoning about what behavior you want to
encourage and how a small, frequent signal compares to a large,
infrequent one.
The distinction between reward-signal design (a pedagogically rich
question with many possible answers) and reward-field specification
(a technical detail) is preserved in the interface. Students configure
the *key* to track; they design the *signal* in the game itself.
Metadata as game description, not training configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The game description lives in ``[tool.retro-gamer]`` inside the
game's own ``pyproject.toml``, not in a separate training
configuration file. This placement encodes a claim: the character set,
the action space, and the reward signal are *properties of the game*,
not settings for the trainer.
A student who edits the character set is not tweaking the trainer;
they are more accurately describing their game. This framing matters
because it positions the student as the expert on the game—which they
are—and the trainer as a tool that depends on the accuracy of that
description. Errors in the description are not configuration mistakes;
they are inaccurate descriptions of something the student knows.
When a student omits a character from the character set and the agent
fails to notice that character on the board, the diagnostic question
is not "what went wrong with training?" but "is my description of the
game correct?" This is a more productive question, because it connects
the student's domain knowledge (they know what characters appear and
why they matter) to the technical representation (one-hot encoding
requires knowing in advance which characters to encode). The fix is
not to adjust a hyperparameter; it is to describe the game more
accurately.

View File

@@ -1,9 +1,18 @@
import os
import sys
sys.path.insert(0, os.path.abspath('..'))
project = 'retro-gamer'
copyright = '2025, Chris Proctor'
author = 'Chris Proctor'
release = '0.1.0'
extensions = []
extensions = [
'sphinx.ext.autodoc',
]
autodoc_member_order = 'bysource'
templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']

View File

@@ -31,17 +31,22 @@ with `retro-games <https://retro-games.readthedocs.io/en/latest/>`__.
The retro-games framework must also be installed; see its documentation
for instructions.
If you are working through a Making With Code lab, ``retro-gamer`` is
already installed in your project environment — skip ahead to
:ref:`installation`.
**Add to a project** using ``uv`` or ``pip``:
.. code-block:: console
% uv add retro-gamer
% pip install retro-gamer
To install from source (for development or to use the latest changes):
**Install as a global tool** (available everywhere, no project needed):
.. code-block:: console
% git clone https://github.com/cproctor/retro-gamer
% cd retro-gamer
% pip install -e .
% uv tool install retro-gamer
Verify the installation by checking the command-line tool:
@@ -65,5 +70,8 @@ Verify the installation by checking the command-line tool:
introduction
background
walkthrough
troubleshooting
reference
integration
api
contributing

186
docs/integration.rst Normal file
View File

@@ -0,0 +1,186 @@
Integrating a Trained Model
===========================
Once you have trained a model, you can use it in two ways:
- **PolicyInput** — the model replaces the keyboard, driving an existing
player-controlled agent. Use this to watch a trained agent play, or to
run automated evaluations.
- **TrainedPolicy in play_turn** — call ``get_action(game)`` from inside any
agent's ``play_turn`` to embed the model as an autonomous character (for
example, a smart enemy) alongside human-controlled or other agents.
Loading a trained model
-----------------------
Both approaches start by creating a :class:`retro_gamer.TrainedPolicy`:
.. code-block:: python
from retro_gamer import TrainedPolicy
ai = TrainedPolicy("runs/snake/")
This reads ``config.toml``, rebuilds the network, and loads the latest
checkpoint. To load a specific checkpoint instead:
.. code-block:: python
ai = TrainedPolicy("runs/snake/", checkpoint="ep_0500")
PolicyInput: model as player
----------------------------
:class:`retro_gamer.PolicyInput` is an input source — it implements the same
interface as keyboard input, but chooses actions using the trained model. Pass
it to ``game.play()`` and everything else works exactly as usual:
.. code-block:: python
from retro.examples.snake import create_game
from retro_gamer import TrainedPolicy, PolicyInput
ai = TrainedPolicy("runs/snake/")
game = create_game()
game.play(input_source=PolicyInput(ai, game))
On each turn, ``PolicyInput`` observes the current board and game state, runs
the model, and sends the chosen action to the game exactly as if the player
had pressed that key.
TrainedPolicy in play_turn: model as autonomous character
---------------------------------------------------------
To embed a trained model as an autonomous game character, create a
``TrainedPolicy`` at module level and call ``get_action(game)`` from inside
the agent's ``play_turn``. Placing it at module level means the model is
loaded from disk once — not once per episode.
.. code-block:: python
from retro.game import Game
from retro.examples.snake import Apple, SnakeHead
from retro_gamer import TrainedPolicy
_ai = TrainedPolicy("runs/snake/")
class AISnake(SnakeHead):
def handle_keystroke(self, k, game): pass # ignore keyboard
def play_turn(self, game):
key = _ai.get_action(game)
if key == 'KEY_RIGHT': self.direction = (1, 0)
elif key == 'KEY_LEFT': self.direction = (-1, 0)
elif key == 'KEY_UP': self.direction = (0, -1)
elif key == 'KEY_DOWN': self.direction = (0, 1)
super().play_turn(game)
human_snake = SnakeHead()
ai_snake = AISnake()
ai_snake.position = (16, 8)
apple = Apple()
game = Game([human_snake, ai_snake, apple], {"score": 0}, board_size=(32, 16))
apple.relocate(game)
game.play()
Training an enemy model
~~~~~~~~~~~~~~~~~~~~~~~~
You can use the same training pipeline to produce a model for an enemy agent.
``retro-gamer`` does not care *which* character it is training — it only cares
that it can control one character through the keyboard and read a reward signal
from the game state. To train an enemy:
1. **Create an enemy-perspective game variant.** Write (or add) a
``create_game`` function — in a separate file, or alongside your main one —
where the enemy agent is the keyboard-driven character and the reward key
in the game state reflects the enemy's objective (for example, a bonus for
catching the player). The human player can be absent, replaced by a
random-moving agent, or driven by a ``TrainedPolicy`` once you have a trained
player model.
.. code-block:: python
def create_enemy_training_game():
enemy = EnemyAgent() # the character the trainer will control
player = RandomPlayer() # a stand-in; no human involved
game = Game([enemy, player], {'enemy_reward': 0}, board_size=(32, 16))
return game
2. **Train normally against this variant.**
.. code-block:: console
% retro-gamer create --game my_game:create_enemy_training_game \
--output runs/enemy/
% retro-gamer train runs/enemy/
3. **Embed the trained model in your main game** using ``get_action``, exactly
as shown above.
.. note::
Because ``retro-gamer`` injects actions through the game's global input
source, *all* keyboard-listening agents in the training game will receive
the trainer's keystrokes. The cleanest approach is to make the enemy the
only keyboard-driven character in the training variant — any other
characters should advance on their own without reading from the keyboard.
Adversarial training
~~~~~~~~~~~~~~~~~~~~~
Once you have separate training runs for the player and the enemy, you can
train them *against each other* iteratively. The idea is simple: train the
player against the current enemy model, then train the enemy against the
updated player model, and repeat. Each side is forced to improve against an
increasingly capable opponent.
The key technique is to load the opponent's model at module level in each
training game variant, so it is loaded from disk once per run rather than
once per episode:
.. code-block:: python
# enemy_training_game.py
from retro_gamer import TrainedPolicy
_player = TrainedPolicy("runs/player/") # loaded once when the module is imported
def create_game():
enemy = EnemyAgent()
player = AIPlayer(_player) # uses _player.get_action in play_turn
return Game([enemy, player], {'enemy_reward': 0}, board_size=(32, 16))
You then alternate training runs:
.. code-block:: console
% retro-gamer train runs/player/ # train player against current enemy
% retro-gamer train runs/enemy/ # train enemy against updated player
% retro-gamer train runs/player/ # train player again
# ...
How many episodes to run before switching is itself a design decision: too
few and neither model has time to adapt; too many and each side overfits to
its current opponent. Watching how the strategies evolve — and asking *why*
each model behaves as it does at each stage — connects directly to concepts
in multi-agent reinforcement learning and adversarial training.
Differences between the two approaches
---------------------------------------
.. list-table::
:header-rows: 1
:widths: 35 65
* - ``PolicyInput``
- ``TrainedPolicy`` in ``play_turn``
* - Replaces human input for the whole game
- One autonomous agent among many
* - Game code is unchanged
- Agent's ``play_turn`` calls ``get_action``
* - One model drives all player-controlled agents
- Each agent instance has its own model
* - Simpler — just pass to ``game.play()``
- More flexible — mix human and AI characters

View File

@@ -100,12 +100,12 @@ matters.
**Observation design** determines what information is available to the
agent. If you leave a character out of the ``character_set``, the agent
will not distinguish it from empty space. If you include a game-state
variable in ``observe_state``, the agent can see it directly rather than
having to infer it from the board. The consequences of these choices for
what the agent can learn are reasonably predictableand making and
checking those predictions is exactly the kind of reasoning the tool is
designed to support.
will not distinguish it from empty space. If the game module defines a
``get_state()`` function, the agent also receives those computed values
as part of its observation. The consequences of these choices for what
the agent can learn are reasonably predictableand making and checking
those predictions is exactly the kind of reasoning the tool is designed
to support.
**Reward engineering** is the craft of specifying what counts as doing
well in a way the agent can actually optimize. Using score as the reward

View File

@@ -17,8 +17,6 @@ A complete example for the Snake game:
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"
character_set = ["@", "*", ">", "<", "^", "v"]
spatial = true
observe_state = []
You do not need to specify the board size: ``retro-gamer`` reads it
directly from your game's ``board_size`` attribute.
@@ -65,54 +63,156 @@ If omitted, ``retro-gamer`` runs an exploration phase to discover the
characters that appear in practice. The length of this phase is
controlled by the ``exploration_turns`` hyperparameter.
``spatial``
~~~~~~~~~~~
Preprocessing options
---------------------
**Optional; default ``true``.** Whether to treat the board as a 2D
spatial scene. When ``true``, the trainer uses a convolutional neural
network (CNN) that can detect patterns in the relative positions of
characters. When ``false``, the trainer uses a multilayer perceptron
(MLP) that sees the board as a flat list of numbers without positional
structure.
Preprocessing options live in the ``[preprocessing]`` section of a run's
``config.toml``. They control how the game's board and state are
transformed into the observation vector that the neural network sees.
``retro-gamer create`` writes sensible defaults; you can edit them by
hand before running ``retro-gamer train``.
.. note::
Changes to any ``[preprocessing]`` option—or to the game description
fields above—make existing checkpoints incompatible. Run
``retro-gamer clean`` before retraining after such changes.
``spatial`` (default: ``false``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Whether to treat the board as a 2D spatial scene. When ``true``, the
trainer uses a convolutional neural network (CNN); when ``false``, a
multilayer perceptron (MLP) that sees the board as a flat list of
numbers.
``board`` (default: ``true``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Whether to include the board encoding in the observation vector. Set
to ``false`` to train on game state variables only, with no board at
all. This is useful for games with small, enumerable state spaces where
a lookup table (classic Q-learning) is sufficient.
When ``board = false``:
- ``spatial`` must also be ``false`` (no board means no 2D scene for a CNN).
- At least one key must be listed in ``observe_state``.
- ``character_set`` is not required and character discovery is skipped.
.. code-block:: toml
spatial = true
[preprocessing]
board = false
observe_state = ["board_state"]
``observe_state``
~~~~~~~~~~~~~~~~~
``observe_state`` (default: ``[]``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Optional; default ``[]``.** A list of keys from the game's state
dictionary to append to the observation vector. The values must be
numbers (integers, floats, or booleans). The reward key must not
appear in this list.
A list of keys from ``game.state`` to include in the observation
vector, appended after the board encoding (or as the entire
observation when ``board = false``). Scalar values contribute one
element each; list or tuple values are flattened.
.. code-block:: toml
observe_state = ["lives", "level"]
observe_state = ["apple_dx", "apple_dy"]
The keys must be present in ``game.state`` at every step, initialized
in ``create_game()`` before the game starts. All values that are lists
or tuples must always have the same length from episode to episode.
.. warning::
``observe_state`` keys must be initialized to their final shape in
``create_game()`` before the game starts. If a key is absent or its
list length changes between episodes, training will crash with an
error explaining which key changed and by how much. This happens
because the neural network's input layer has a fixed size determined
at the start of training; it cannot adapt to a changing observation
shape mid-run.
Always initialize every observed key with a placeholder of the
correct type and length before the first ``game.step()`` call.
``observe_state_sizes`` (auto-discovered)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A table mapping each ``observe_state`` key to its flat size (``1`` for
scalars, ``N`` for sequences of length N). This is written automatically
to ``config.toml`` the first time ``retro-gamer train`` runs, after the
trainer samples ``game.state`` to discover the actual sizes:
.. code-block:: toml
observe_state_sizes = {board_state = 9}
You do not need to set this manually. Once written, it is used to
detect changes in state shape when resuming training—an incompatible
change here requires running ``retro-gamer clean`` and starting fresh.
``egocentric`` (default: ``false``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When ``true``, the board observation is cropped to a square window
centred on a specific agent rather than the full board. This gives the
agent a local, first-person-like view and makes the observation
invariant to the agent's absolute position on the board.
Requires ``egocentric_player`` and ``egocentric_radius``.
``egocentric_player``
~~~~~~~~~~~~~~~~~~~~~~
The name of the agent to use as the centre of the egocentric crop.
Must match the ``name`` attribute of one of the game's agents.
.. code-block:: toml
egocentric_player = "Snake head"
``egocentric_radius``
~~~~~~~~~~~~~~~~~~~~~~
The half-side-length of the egocentric crop window, in cells. The
resulting observation covers a ``(2r+1) × (2r+1)`` region. Larger
values give the agent a wider view; smaller values focus it on the
immediate vicinity.
.. code-block:: toml
egocentric_radius = 8 # 17×17 window
When ``egocentric_radius`` is set, ``board_size`` in ``[metadata]`` is
automatically updated to ``[2r+1, 2r+1]`` so the network is sized
correctly.
.. _hyperparameters:
Hyperparameters
---------------
Hyperparameters are stored in the ``[hyperparameters]`` section of
``config.toml``. They can be set via ``retro-gamer create`` options or
edited directly.
Hyperparameters are split across two sections of ``config.toml``:
- ``[model]`` — network architecture (changing these requires starting fresh)
- ``[training]`` — learning algorithm parameters (safe to change at any time)
Both sections can be set via ``retro-gamer create`` options or edited directly.
Learning and optimization
~~~~~~~~~~~~~~~~~~~~~~~~~
``learning_rate`` (default: ``0.001``)
``learning_rate`` (default: ``0.0001``)
The step size used by the Adam optimizer when updating network
weights. Larger values converge faster but may be unstable; smaller
values are more stable but slower.
``lr_decay`` (default: ``0.995``)
``learning_rate_decay`` (default: ``0.9999``)
Multiplicative decay applied to the learning rate after each
episode. The learning rate decreases geometrically over training,
helping the network fine-tune later without destabilizing early
progress.
progress. With the default value, the learning rate decays to about
13 % of its starting value after 20 000 episodes.
``gamma`` (default: ``0.99``)
The discount factor for future rewards. A value of 1.0 makes the
@@ -127,7 +227,7 @@ Exploration
random action with probability ``epsilon`` and exploits its current
Q-function with probability ``1 - epsilon``.
``epsilon_decay`` (default: ``0.995``)
``epsilon_decay`` (default: ``0.9997``)
Multiplicative decay applied to ``epsilon`` after each episode.
``epsilon_min`` (default: ``0.05``)
@@ -142,31 +242,33 @@ Memory and sampling
The number of experiences sampled from the replay buffer per
training step.
``memory_capacity`` (default: ``10000``)
``memory_capacity`` (default: ``50000``)
The maximum number of experiences the replay buffer can hold. When
full, the oldest experiences are discarded.
``prioritize_experiences`` (default: ``false``)
``prioritize_experiences`` (default: ``true``)
Whether to use prioritized experience replay. When ``true``,
experiences with larger TD errors are sampled more frequently.
This often improves sample efficiency at a modest computational
cost.
Network architecture
~~~~~~~~~~~~~~~~~~~~
Model architecture
~~~~~~~~~~~~~~~~~~
``n_layers`` (default: ``2``)
The number of hidden layers in the MLP head (for spatial games,
this follows the CNN; for non-spatial games, it is the full
network).
These live in the ``[model]`` section. Changing them requires starting fresh
(run ``retro-gamer clean`` before retraining).
``layer_size`` (default: ``128``)
The width (number of units) in each hidden layer.
``hidden_sizes`` (default: ``[128, 64]``)
A list of integers giving the size of each hidden layer in the MLP
head. The default creates two layers: 128 units then 64. For spatial
games this follows the CNN; for non-spatial games it is the full
network. Larger or deeper networks can represent more complex
Q-functions but train more slowly and may need more episodes.
Training duration
~~~~~~~~~~~~~~~~~
``training_episodes`` (default: ``1000``)
``training_episodes`` (default: ``20000``)
The total number of game episodes to run. Each episode runs until
the game ends or ``max_turns_per_episode`` turns have elapsed.
@@ -175,12 +277,18 @@ Training duration
indefinitely (for example, if the agent finds a way to avoid
dying).
``target_update_freq`` (default: ``100``)
``target_update_freq`` (default: ``500``)
How many training steps between updates of the target network.
More frequent updates make training targets move faster (less
stable); less frequent updates make them more stable but slower
to reflect new learning.
``train_every`` (default: ``4``)
Run one training step every N game steps. Higher values speed up
episode collection at the cost of fewer gradient updates per
experience. The default of 4 is a good balance for most games;
set to 1 to train on every step.
Character discovery
~~~~~~~~~~~~~~~~~~~
@@ -207,23 +315,26 @@ game's ``pyproject.toml``; you do not pass it on the command line.
.. code-block:: console
% retro-gamer create --game MODULE --output DIR [OPTIONS]
% retro-gamer create --game GAME --output DIR [OPTIONS]
**Required options:**
- ``--game MODULE``Python module containing ``create_game()``
(e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
is read from the ``pyproject.toml`` found in or above the module's
source directory.
- ``--game GAME``Your game, specified as a file path or a Python
module name:
- File path: ``--game my_game.py`` or ``--game my_game/``
- Module name: ``--game retro.examples.snake``
The ``[tool.retro-gamer]`` section is read from the ``pyproject.toml``
found in or above the game file.
- ``--output DIR`` — Directory to create for this training run.
**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
- ``--training-episodes N``
- ``--n-layers N``
- ``--layer-size N``
- ``--hidden-sizes SIZES`` — comma-separated, e.g. ``512,256``
- ``--learning-rate F``
- ``--lr-decay F``
- ``--learning-rate-decay F``
- ``--gamma F``
- ``--epsilon-decay F``
- ``--epsilon-min F``
@@ -232,20 +343,40 @@ game's ``pyproject.toml``; you do not pass it on the command line.
- ``--target-update-freq N``
- ``--max-turns-per-episode N``
- ``--exploration-turns N``
- ``--train-every N``
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
``retro-gamer train``
~~~~~~~~~~~~~~~~~~~~~
Train (or resume training) a DQN agent.
Train a DQN agent.
.. code-block:: console
% retro-gamer train RUN_DIR [--resume CHECKPOINT]
% retro-gamer train RUN_DIR
``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
create``. If ``--resume`` is given, training resumes from the specified
checkpoint file (relative or absolute path).
create``. If checkpoints already exist in ``RUN_DIR``, training
automatically resumes from the latest one so prior work is never lost.
If all configured episodes have already been completed, the command
prints a message and exits immediately. To keep training, increase
``training_episodes`` in ``config.toml`` and run again.
**Incompatible changes.** Some config changes make existing checkpoints
unusable. If you change any of the following, ``retro-gamer train`` will
detect the mismatch and refuse to resume, with a clear explanation:
- ``actions``, ``reward``, ``character_set``, ``board_size``
(``[metadata]``) — game description
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
``egocentric``, ``egocentric_player``, ``egocentric_radius``
(``[preprocessing]``) — observation encoding
- ``hidden_sizes`` (``[model]``) — network architecture
Run ``retro-gamer clean RUN_DIR`` to remove the old checkpoints and start
fresh. Other hyperparameter changes (learning rate, epsilon, etc.) are
safe and take effect immediately on the next training run.
``retro-gamer play``
~~~~~~~~~~~~~~~~~~~~
@@ -256,16 +387,32 @@ Watch a trained agent play the game in the terminal.
% retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
By default, the latest available checkpoint is loaded. Use
``--checkpoint`` to load a specific one by name (e.g. ``ep_0100``).
``--framerate`` sets the target frames per second (default: 12). Press
Enter or Escape to quit.
``retro-gamer clean``
~~~~~~~~~~~~~~~~~~~~~
Remove all checkpoints and the training log from a run directory.
.. code-block:: console
% retro-gamer clean RUN_DIR
Prompts for confirmation before deleting. Use ``--yes`` / ``-y`` to skip
the prompt. The ``config.toml`` is preserved so you can run
``retro-gamer train`` immediately to start fresh with the same settings.
Use this after making an incompatible change (see ``retro-gamer train``
above) or any time you want to restart training from scratch.
``retro-gamer info``
~~~~~~~~~~~~~~~~~~~~~
Print a summary of a training run: metadata, hyperparameters, recent
episode log, and available checkpoints.
checkpoint log, and available checkpoints.
.. code-block:: console
@@ -285,60 +432,49 @@ contents:
└── checkpoints/
├── ep_0100.pt # model weights at episode 100
├── ep_0200.pt
── ...
└── final.pt # model weights at training completion
── ... # one file saved every 100 episodes
``config.toml`` is written by ``retro-gamer create`` and updated (with
the discovered character set and resolved hyperparameters) when
``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
and ``train`` is the recommended way to adjust hyperparameters.
``retro-gamer train`` begins. It has five sections: ``[game]``,
``[metadata]``, ``[preprocessing]``, ``[model]``, and ``[training]``.
Editing ``config.toml`` between ``create`` and ``train`` is the
recommended way to adjust hyperparameters.
``training.log`` begins with the full architecture description
generated at training startup, followed by one line per episode in the
format::
``training.log`` begins with the full network architecture description,
then one line per checkpoint (every 100 episodes) in the format::
[EP NNNN] total_reward=F steps=N epsilon=F avg_loss=F
[ep_NNNN] ep=SSSS-NNNN avg_reward=F avg_steps=N epsilon=F avg_loss=F time=Xm Xs total=Xm Xs
Checkpoint files are PyTorch state dictionaries containing model
weights, optimizer state, the current epsilon, and the total number of
training steps completed. They can be loaded with
``retro-gamer play`` or directly with the Python API.
Each field averages over the episodes since the previous checkpoint:
- ``ep=SSSS-NNNN`` — episode range covered by this entry
- ``avg_reward`` — mean total reward per episode (positive = good)
- ``avg_steps`` — mean episode length in game turns
- ``epsilon`` — current exploration rate (approaches ``epsilon_min`` over time)
- ``avg_loss`` — mean Huber loss across training steps (should decrease as learning
stabilises). Huber loss equals ½·(qt)² for small errors and |qt|−½ for large
ones, so it stays bounded even when Q-values are large. Values in the range
010 are typical; a slow downward trend over thousands of episodes is the
healthy pattern. A loss that grows without bound indicates a learning rate
that is too high.
- ``time`` — wall-clock time for this checkpoint interval
- ``total`` — cumulative training time across all sessions
When training is resumed, a ``=== Resumed from ... ===`` line is appended
so the log records the full history of a run across multiple sessions.
Python API
----------
For advanced use, ``retro-gamer``'s components are importable as a
library.
library. See the :doc:`api` reference for full details.
.. code-block:: python
from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
from retro_gamer import GameMetadata, DQNTrainer
from retro.examples.snake import create_game
# Read metadata from [tool.retro-gamer] in the game's pyproject.toml
metadata = GameMetadata.from_pyproject("retro.examples.snake")
trainer = DQNTrainer(
create_game, metadata, "runs/snake/",
training_episodes=500,
n_layers=2,
layer_size=128,
)
trainer = DQNTrainer(create_game, metadata, "runs/snake/")
trainer.train()
``GameEnvironment`` provides a gym-style interface for stepping through
a game programmatically:
.. code-block:: python
from retro_gamer import GameEnvironment
env = GameEnvironment(create_game, metadata)
obs = env.reset() # returns initial observation vector
obs, reward, done = env.step("KEY_RIGHT")
The observation is a flat NumPy array of dtype ``float32``. For spatial
games, the first ``C × H × W`` elements are the board (channel-first
one-hot encoding); for non-spatial games, the board is encoded
``H × W × C`` and then flattened. Any ``observe_state`` values are
appended at the end.

287
docs/troubleshooting.rst Normal file
View File

@@ -0,0 +1,287 @@
Troubleshooting
===============
This section describes problems that commonly arise when training an agent
with ``retro-gamer``. Each entry names the issue, describes what you will
see in the training log or when watching the agent play, explains what is
happening in terms of the underlying reinforcement learning, and suggests
how to fix it.
.. contents:: Issues
:local:
:depth: 1
Loss grows rapidly over training
---------------------------------
**Symptoms**
The ``avg_loss`` column in the training log grows steadily from one
checkpoint to the next, often at an accelerating rate::
[ep_0100] avg_loss=22.2
[ep_0200] avg_loss=128.5
[ep_0300] avg_loss=2918.5
[ep_0400] avg_loss=163825.1
Left unchecked, the loss eventually reaches extreme values and the agent's
behavior becomes erratic or degenerates entirely.
**Why this happens**
This is called *Q-value divergence*. The Q-network is trained to predict
the total future reward of each action. To do that, it computes a *target*
for each prediction — but the target itself is computed using the
Q-network's own current predictions. This creates a feedback loop: if
the predictions are slightly off, the targets drift, which makes the next
predictions slightly more off, which drifts the targets further.
Under normal conditions, the learning rate is small enough and the target
network stable enough that this loop stays controlled. Divergence happens
when the learning rate is too high, causing each update to overshoot.
The problem is amplified by larger networks (more parameters to overshoot)
and by prioritized experience replay, which deliberately samples the
experiences the network is most wrong about — exactly the experiences most
likely to destabilize it.
**How to fix it**
Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction
(for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize
training. If you recently increased the size of the network (via
``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning
rate than you used before is likely necessary — larger, more capable
networks need smaller, more careful updates.
Also consider increasing ``target_update_freq``. The target network is a
frozen copy of the Q-network used to compute stable training targets; the
less frequently it is updated, the more stable those targets are. The
default is 200 steps; raising it to 500 or 1000 slows learning slightly
but reduces the chance of divergence.
Because divergence compounds over many episodes, a run that has begun
diverging cannot simply be resumed with a lower learning rate — the
weights have already drifted far from useful values. Use
``retro-gamer clean`` to remove the existing checkpoints and start fresh.
Agent ignores some actions entirely
-------------------------------------
**Symptoms**
After training, the agent never (or almost never) turns in certain
directions, regardless of the board state. If you compare checkpoints at
different stages of training, the missing directions are absent from the
very beginning and never appear. The agent may survive for a while but
always move in only a subset of the possible directions.
**Why this happens**
If some actions lead to immediate death every time they are tried early in
training, the Q-network quickly learns to assign them very low values.
This is correct in the specific situation where those actions are always
fatal — but the network then generalizes that association across *all*
board positions, even positions where those actions would be safe.
A common cause is a fixed starting position at the edge or corner of the
board. A snake that always starts in the top-left corner and always begins
moving downward will die immediately whenever it turns up or left in the
first step. After thousands of early episodes where those actions produce
instant death, the network has seen so much evidence that "turn left →
die" and "turn up → die" that it assigns them low Q-values everywhere.
**How to fix it**
Make sure the game's starting conditions give the agent a chance to try
every action safely. For a snake game, this means randomizing both the
starting position (keeping at least one cell away from every edge) and
the starting direction at the beginning of each episode. An agent that
starts in different places and orientations each time will quickly learn
that all four directions can be appropriate depending on context.
Agent survives but never moves toward the goal
-----------------------------------------------
**Symptoms**
The ``avg_steps`` column in the training log increases steadily — the
agent is surviving longer — but ``avg_reward`` stays negative or barely
improves. When you watch the agent play, it wanders around the board
without ever approaching the target object. Episodes end because the
agent runs into a wall, not because it reached the goal.
**Why this happens**
The reward signal is *asymmetric*: it penalizes moving away from the goal
but gives no reward for moving toward it. With this signal, the agent
learns to avoid the penalty by surviving, but it has no positive gradient
pointing it in the right direction. The eventual goal-reaching reward
(eating the apple, reaching the exit, etc.) is too rare — especially
early in training when the agent is mostly acting randomly — to provide
meaningful learning signal on its own.
From the Q-network's perspective, all directions look roughly equivalent:
moving toward the goal is 0 reward, moving away is 1. On a large board,
the probability of eating the apple by chance is small enough that the
network may never see the positive terminal reward at all during the
exploration phase.
**How to fix it**
Make the distance-based reward symmetric: give **+1 for moving toward the
goal** and **1 for moving away**. This way, every single step provides a
meaningful signal in the correct direction, and the agent does not need to
reach the goal by chance in order to start learning. In a snake game,
computing this signal requires only one line of arithmetic — the change
in Manhattan distance between the head and the apple from one step to the
next.
Note that the shaped ±1 signal is a *proxy* for the real objective. If the
agent learns to follow it too literally, it may take direct paths that run
through its own body. The 10 death penalty and +50 apple reward are still
necessary; the shaping only accelerates early learning.
Exploration ends before learning is complete
---------------------------------------------
**Symptoms**
The ``epsilon`` column in the training log reaches ``epsilon_min`` well
before training is finished. After that point, ``avg_reward`` stops
improving even though many episodes remain. When you watch the agent play,
it commits to the same strategy regardless of what is happening on the
board.
**Why this happens**
Epsilon controls the balance between exploration (random actions) and
exploitation (using the learned policy). Early in training, when the
Q-network has seen little data, exploration is essential: the agent needs
to try different things to accumulate the varied experiences that make
Q-value estimates reliable. Once epsilon reaches its minimum, the agent
stops exploring and commits fully to whatever policy it has learned so far.
If ``training_episodes`` is too small relative to ``epsilon_decay``, the
exploration phase ends while the Q-network is still unreliable. The agent
then exploits a half-learned policy that cannot improve because it never
tries anything new.
You can calculate when epsilon will reach its minimum:
.. code-block:: python
import math
episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay)
With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``,
``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The
agent should have substantial training time *after* the exploration phase
ends — so ``training_episodes`` should be at least several times this
number.
**How to fix it**
Increase ``training_episodes`` so that the agent has many episodes of
exploitation after the exploration phase ends. For simple games on small
boards, 10,000 episodes is a reasonable starting point; for complex games
or large boards, 50,000100,000 may be needed.
This is always safe to change. Because ``training_episodes`` does not
affect the network architecture or the reward signal, you can increase it
in ``config.toml`` and resume training from the latest checkpoint without
starting fresh.
Death penalty dominates all other signals
-------------------------------------------
**Symptoms**
After a period of training, the agent survives for many steps but rarely
or never scores. It tends to circle, hug walls, or otherwise avoid the
goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains
persistently negative. The agent behaves as if staying alive is the only
objective.
**Why this happens**
When the penalty for dying is much larger than any other reward in the
game, the Q-network learns that staying alive is overwhelmingly the most
important thing to do. Scoring — which requires taking some risk —
becomes unattractive because a single death outweighs many successful
goal-reaching events.
For example, if the death penalty is 1000 and each successful apple is
+50, then dying once costs the equivalent of twenty apples. The agent
learns that the safest strategy is to avoid risk entirely, even if that
means never eating. From the Q-network's perspective, this is rational:
it is correctly optimizing the reward signal you gave it.
**How to fix it**
Keep all reward magnitudes in the same order of magnitude. If per-step
shaping gives ±1 and the goal reward is +50, a death penalty of 10 is
appropriate: death is clearly bad (ten times worse than a bad step) but
not so catastrophic that it crowds out everything else. As a rule of
thumb, no single reward should be more than ten to twenty times larger
than the typical per-step reward.
Increasing ``gamma`` (the discount factor) is a better way to make the
agent care more about long-term consequences. A higher gamma causes
future rewards — including the eventual death penalty — to count more
heavily in the agent's current decisions, without distorting the relative
scale of the rewards.
Reward signal and human score interfere with each other
---------------------------------------------------------
**Symptoms**
Human players see scores that go negative, or that include penalties and
adjustments that make no sense in the context of a normal game. Conversely,
adjustments made to improve training (removing a per-step shaping penalty,
changing a death penalty) change the game's visible score in ways that
affect the experience for human players.
**Why this happens**
Using the same state variable for both the training reward and the
human-visible score conflates two separate concerns. Training rewards
benefit from shaping — intermediate signals like "moved toward the goal"
and "died" that accelerate learning. Scores for human players should
reflect only the game's actual objectives (apples eaten, enemies defeated,
distance covered) so that they are legible and motivating.
When these are the same variable, every design decision about one
necessarily affects the other.
**How to fix it**
Use two separate keys in the game's state dictionary: one for the
human-facing score (updated only by meaningful in-game events) and one
for the training reward (updated every step with shaping signals and
penalties). In the game code:
.. code-block:: python
# Only updated when the snake eats an apple — clean for human players.
game.state['score'] += 50
# Updated every step — used only by the trainer.
game.state['reward'] += old_dist - new_dist # +1 toward apple, -1 away
game.state['reward'] += 50 # also reward eating
game.state['reward'] -= 10 # death penalty
Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of
``pyproject.toml`` so the trainer watches the right key. The score display
remains clean for human players, and you can adjust the training reward
freely without affecting it.
Note that changing the ``reward`` key is an incompatible change: existing
checkpoints trained on the old signal will be rejected when you try to
resume. Run ``retro-gamer clean`` and start fresh after making this change.

View File

@@ -21,9 +21,9 @@ You will need:
Preparing your game
-------------------
``retro-gamer`` loads your game by importing a Python module and
calling a function named ``create_game``. The ``create_game`` function
must take no arguments and return a new ``Game`` instance.
``retro-gamer`` loads your game by calling a function named
``create_game``. The function must take no arguments and return a new
``Game`` instance.
Here is the ``create_game`` function for Snake:
@@ -32,12 +32,20 @@ Here is the ``create_game`` function for Snake:
def create_game():
head = SnakeHead()
apple = Apple()
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
apple.relocate(game)
return game
If your game module does not already have a ``create_game`` function,
add one following this pattern.
If your game file does not already have a ``create_game`` function, add
one following this pattern.
When you run ``retro-gamer create``, you can point to your game file
directly by path or by Python module name:
.. code-block:: console
% retro-gamer create --game my_game.py --output runs/my_game/
% retro-gamer create --game retro.examples.snake --output runs/snake/
Describing your game
@@ -57,8 +65,6 @@ Here is the ``[tool.retro-gamer]`` section for the Snake example:
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"
character_set = ["@", "*", ">", "<", "^", "v"]
spatial = true
observe_state = []
Let's go through each field.
@@ -80,9 +86,10 @@ implicitly has access to a no-op (doing nothing).
The key in the game's state dictionary to use as the reward signal.
``retro-gamer`` computes the reward for each turn as the *change* in
this value from one turn to the next. For Snake, score increases by 1
(or more) each time the apple is eaten, so the agent receives a reward
of 1 when it eats an apple and 0 otherwise.
this value from one turn to the next. For Snake, the score changes when
the snake eats an apple (+50), when it moves away from the apple (1),
and when it dies (10). These incremental changes are what the agent
tries to maximize.
Choosing an appropriate reward is one of the most consequential
decisions in RL. Some considerations:
@@ -115,15 +122,48 @@ phase before training to discover which characters actually appear.
The number of exploration turns is controlled by the
``exploration_turns`` hyperparameter.
``spatial``
~~~~~~~~~~~
``spatial`` and other preprocessing options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Whether to treat the board as a spatial scene (default: ``true``). A
spatial game uses a *convolutional neural network* (CNN) that can
detect patterns in the relative arrangement of characters. A
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
ignores positional relationships. Set to ``false`` for games where
position is irrelevant.
The ``[tool.retro-gamer]`` section describes the game. Preprocessing
options—such as ``spatial`` (whether to use a CNN or MLP, default:
``false``), ``egocentric``, and ``observe_state``—live in the
``[preprocessing]`` section of the generated ``config.toml``. You can
edit them there after running ``retro-gamer create``.
``observe_state``
~~~~~~~~~~~~~~~~~
By default the agent only sees the board. You can also give it access
to computed values from ``game.state`` by listing the relevant keys in
the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
For example, Snake exposes the normalized direction to the apple:
.. code-block:: toml
[preprocessing]
observe_state = ["apple_dx", "apple_dy"]
The trainer appends these values to the observation vector after the
board encoding (or uses them as the entire observation when
``board = false``).
These values must be set in ``game.state`` at the start of every
episode—typically inside ``create_game()``—and must keep the same
type and length from episode to episode.
.. warning::
Always initialize every key listed in ``observe_state`` before the
game starts. If a key is missing or its length changes between
episodes, training stops immediately with a clear error explaining
what changed. The neural network's input size is fixed when training
begins; it cannot adapt to a changing observation shape mid-run.
This is a good place to ask: *can a human player see this information?*
The apple's location is visible on screen; the normalized distance vector
is not. Whether that asymmetry is appropriate is a design choice worth
examining.
Once you have written this section, create the training run directory:
@@ -139,7 +179,7 @@ Once you have written this section, create the training run directory:
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
reward : score
characters : ['@', '*', '>', '<', '^', 'v']
architecture: CNN (spatial)
architecture: MLP
``retro-gamer create`` reads your game metadata directly from
``pyproject.toml`` and writes it—along with all hyperparameters—to
@@ -153,64 +193,141 @@ With the ``config.toml`` in place, start training:
.. code-block:: console
% retro-gamer train runs/snake/
Training for 1000 episodes…
Done. Checkpoints in runs/snake/checkpoints/
100%|████████████████████| 1000/1000 [12:34<00:00, 1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
Done. Checkpoints saved in runs/snake/checkpoints/
Training saves checkpoints every 100 episodes and a ``final.pt``
checkpoint when complete. You can follow progress in the training log:
A progress bar shows how far training has gone, along with the most
recent episode's reward, the current exploration rate (``eps``), and
the average prediction error (``loss``).
Training saves a checkpoint every 100 episodes to
``runs/snake/checkpoints/``. You can stop training at any time with
Ctrl-C and resume it later—the next ``retro-gamer train`` command will
automatically pick up from the latest checkpoint.
Reading the training log
~~~~~~~~~~~~~~~~~~~~~~~~
For a longer view of how training is progressing, inspect the training
log:
.. code-block:: console
% tail -f runs/snake/training.log
% cat runs/snake/training.log
The log shows one line per episode:
The log begins with the full network architecture, followed by one line
per checkpoint (every 100 episodes):
.. code-block:: text
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
[ep_0100] ep=0001-0100 avg_reward=-31.4 avg_steps=47 epsilon=0.938 avg_loss=7.2 time=0m12s total=0m12s
[ep_0200] ep=0101-0200 avg_reward=-18.6 avg_steps=89 epsilon=0.879 avg_loss=6.8 time=0m14s total=0m26s
[ep_0300] ep=0201-0300 avg_reward= -4.1 avg_steps=134 epsilon=0.824 avg_loss=6.1 time=0m15s total=0m41s
[ep_0500] ep=0401-0500 avg_reward= +8.7 avg_steps=210 epsilon=0.724 avg_loss=5.4 time=0m16s total=1m12s
[ep_1000] ep=0901-1000 avg_reward=+22.3 avg_steps=389 epsilon=0.557 avg_loss=4.9 time=0m18s total=2m30s
- **total_reward**: the total score earned during the episode (how many
apples the snake ate, for Snake).
- **steps**: how many turns the episode lasted.
- **epsilon**: the current exploration rate. Early in training this is
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
- **avg_loss**: the average temporal-difference error across training
steps in this episode. A decreasing loss generally indicates that the
Q-value estimates are converging.
Here is what each field means:
Resuming training
~~~~~~~~~~~~~~~~~
- **avg_reward**: Average total reward per episode over the past 100 episodes.
Positive values mean the agent is accumulating reward; negative values mean
it is accumulating penalties. An upward trend over time is the main signal
that learning is working.
- **avg_steps**: Average number of turns per episode. If episodes are ending
quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
generally indicate the agent is surviving longer.
- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
behavior is expected.
- **avg_loss**: Average Huber loss across training steps. Huber loss is
quadratic for small prediction errors and linear for large ones, which keeps
it stable even when rewards have a wide range (such as a large bonus for
reaching a goal). Values in the range 010 are typical for most games.
A slow downward trend is the healthy pattern. A loss that grows without bound
indicates the learning rate is too high.
- **time**: Wall-clock time for this 100-episode interval.
- **total**: Cumulative training time across all sessions.
Training can be resumed from a checkpoint:
When training is resumed after a stop, a header line marks the break::
=== Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===
This lets you track exactly when each session took place.
Stopping training to watch the agent play
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You do not need to wait for training to finish before watching the
agent. Training can be stopped at any time with Ctrl-C, and the latest
checkpoint is always available immediately:
.. code-block:: console
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
% retro-gamer play runs/snake/
Watching a trained agent play
------------------------------
This loads the most recent checkpoint and runs the agent in your
terminal. Press Enter or Escape to quit.
To watch a trained agent play the game in your terminal:
.. note::
.. code-block:: console
The game is rendered directly in your terminal. If the window is
smaller than the board plus borders, ``retro-gamer play`` will raise
a ``TerminalTooSmall`` error — enlarge the terminal window and try
again.
% retro-gamer play runs/snake/ --checkpoint final
You can substitute any checkpoint name:
To watch an earlier stage of training, use ``--checkpoint``:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint ep_0100
Press Enter or Escape to quit.
Comparing what the agent at episode 100 does versus the agent at episode
500 can reveal exactly what the agent has (and has not) learned. For
Snake, you might notice the episode-100 agent moving somewhat randomly,
while the episode-500 agent consistently navigates toward the apple.
Articulating *why* the later agent behaves differently—what the training
process produced—connects observation directly to the concepts underlying
DQN.
Comparing agents trained at different checkpoints is a useful activity:
the agent at episode 100 has learned *something*, but typically much
less than the agent at episode 500. Articulating *what* the earlier
agent has and has not learned, and *why*, is productive reasoning about
the training process.
Resuming training after watching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After watching the agent play, resume training with exactly the same
command you used before:
.. code-block:: console
% retro-gamer train runs/snake/
``retro-gamer`` automatically detects and resumes from the latest
checkpoint. No extra flags are needed. If all configured episodes have
already been completed, it prints a message and exits:
.. code-block:: console
Training already complete (1000 episodes). To keep training,
increase training_episodes in config.toml.
To continue training, open ``runs/snake/config.toml``, increase the
``training_episodes`` value, and run ``retro-gamer train`` again.
Watching a trained agent play
------------------------------
Once training is complete, watch the final agent:
.. code-block:: console
% retro-gamer play runs/snake/
By default the latest checkpoint is loaded. You can also compare the
agent's performance at different stages of training:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint ep_0100
% retro-gamer play runs/snake/ --checkpoint ep_0500
Press Enter or Escape to quit.
Inspecting a run
----------------
@@ -220,18 +337,20 @@ To review the configuration and recent training progress for a run:
.. code-block:: console
% retro-gamer info runs/snake/
Game module : retro.examples.snake
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
Game module : retro.examples.snake
Metadata : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
Preprocessing : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
Model : {'hidden_sizes': [128, 64]}
Training : {'learning_rate': 0.0001, 'gamma': 0.99, ...}
Last 5 episodes:
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
Last 5 checkpoints:
[ep_0600] ep=0501-0600 avg_reward=+12.1 ...
[ep_0700] ep=0601-0700 avg_reward=+14.8 ...
[ep_0800] ep=0701-0800 avg_reward=+16.3 ...
[ep_0900] ep=0801-0900 avg_reward=+19.0 ...
[ep_1000] ep=0901-1000 avg_reward=+22.3 ...
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']
Adjusting hyperparameters
--------------------------
@@ -241,7 +360,8 @@ before training, or by passing them as options to ``retro-gamer
create``. Common adjustments and their effects:
**``training_episodes``** — How long to train. More episodes give the
agent more time to learn, but also take longer to run.
agent more time to learn, but also take longer to run. This is always
safe to increase.
**``epsilon_decay``** — How quickly exploration decreases. A faster
decay (smaller ``epsilon_decay``) means the agent commits to its early
@@ -257,14 +377,124 @@ a small learning rate is stable but slow.
means the agent values long-term consequences; closer to 0 makes the
agent focus on immediate reward.
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
head. Larger networks can represent more complex Q-functions but are
slower to train and may overfit.
**``hidden_sizes``** — The shape of the MLP head as a list of layer
sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
more complex Q-functions but are slower to train and may overfit.
**``prioritize_experiences``** — Whether to use prioritized experience
replay. This often improves sample efficiency but is slightly slower
per step.
.. _incompatible-changes:
Why some changes require starting fresh
----------------------------------------
Not all changes to ``config.toml`` are equal. Some can be applied
immediately to an existing training run; others make the existing
checkpoints unusable.
**Safe to change at any time** (``[training]`` section) — These affect
*how* the agent learns, not *what* it is learning to do. Existing
checkpoints remain valid:
- ``training_episodes``, ``max_turns_per_episode``
- ``learning_rate``, ``learning_rate_decay``, ``gamma``
- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
- ``target_update_freq``, ``train_every``
**Requires starting fresh** — These changes alter the shape of the
game or the shape of the network. The saved model weights are
incompatible with the new configuration:
- ``actions``, ``reward``, ``character_set``, ``board_size``
(``[metadata]``) — These define what the agent perceives and what it
can do. Changing them changes the size of the network's input or
output layers; the existing weights no longer fit.
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
``egocentric``, ``egocentric_player``, ``egocentric_radius``
(``[preprocessing]``) — These control how the observation is
constructed. Any change here alters the input shape or meaning and
makes existing weights invalid.
- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
layers. Changing it changes the shape of the network; the existing
weights no longer fit.
If you try to resume training after making one of these changes,
``retro-gamer train`` detects the mismatch and stops with a clear
explanation, for example::
Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.
The following changes require starting fresh. The existing model was
trained on a different problem and its weights cannot be reused:
character_set
was : ['@', '*', '>', '<', '^', 'v']
now : ['@', '*', '>', '<', '^', 'v', '#']
why : the set of board characters (changes input layer size)
Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
training log, then run 'retro-gamer train RUN_DIR' to start fresh.
To clear out the old checkpoints and begin again:
.. code-block:: console
% retro-gamer clean runs/snake/
Will remove 5 checkpoint(s) and training log from runs/snake/:
checkpoints/ep_0100.pt
checkpoints/ep_0200.pt
...
training.log
Proceed? [y/N]: y
Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.
The ``config.toml`` is always preserved so you do not need to run
``retro-gamer create`` again.
Reasoning about training from the log
--------------------------------------
The training log is one of the most useful tools for understanding what
is happening during training. Here are some patterns to look for and
what they mean.
**Reward increasing steadily** is the normal, healthy pattern. Each
checkpoint block should show a higher ``avg_reward`` than the last.
The rate of increase typically slows as training progresses.
**Reward flat or negative through early episodes** is normal. Early in
training, ``epsilon`` is high and the agent is mostly acting randomly.
It has not yet discovered effective strategies. Patience—and a look at
the ``epsilon`` column—will confirm whether this is just the exploration
phase.
**Loss decreasing** is also healthy. As the Q-network's estimates
improve, the difference between predicted and target Q-values (the TD
error) should shrink. A loss that stabilizes near zero is usually a
good sign.
**Loss growing without bound** indicates the learning rate is too high.
The trainer uses Huber loss, which is robust to large reward scales, but
a learning rate above roughly ``0.001`` can still destabilise training.
Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
and restarting training.
**Short episodes (low ``avg_steps``)** combined with low reward
suggests the agent is dying frequently. Early in training this is
normal. If it persists late in training, the agent may have settled on
a bad policy—consider extending training or adjusting
``epsilon_decay`` to explore longer.
**Reward that improves and then regresses** can indicate that the
agent has discovered a suboptimal but consistent strategy and is stuck.
Increasing ``epsilon_min`` to keep some exploration active, or
adjusting the reward signal to better differentiate good moves from
bad ones, can help.
Questions for investigation
----------------------------
@@ -297,3 +527,4 @@ concepts underlying the training algorithm.
episode 1000 and watch each play the same game. What has the later
agent learned that the earlier one has not? How would you describe
this difference to someone who does not know about neural networks?