Updates across the board

This commit is contained in:
Chris Proctor
2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions

View File

@@ -21,9 +21,9 @@ You will need:
Preparing your game
-------------------
``retro-gamer`` loads your game by importing a Python module and
calling a function named ``create_game``. The ``create_game`` function
must take no arguments and return a new ``Game`` instance.
``retro-gamer`` loads your game by calling a function named
``create_game``. The function must take no arguments and return a new
``Game`` instance.
Here is the ``create_game`` function for Snake:
@@ -32,12 +32,20 @@ Here is the ``create_game`` function for Snake:
def create_game():
head = SnakeHead()
apple = Apple()
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
apple.relocate(game)
return game
If your game module does not already have a ``create_game`` function,
add one following this pattern.
If your game file does not already have a ``create_game`` function, add
one following this pattern.
When you run ``retro-gamer create``, you can point to your game file
directly by path or by Python module name:
.. code-block:: console
% retro-gamer create --game my_game.py --output runs/my_game/
% retro-gamer create --game retro.examples.snake --output runs/snake/
Describing your game
@@ -57,8 +65,6 @@ Here is the ``[tool.retro-gamer]`` section for the Snake example:
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"
character_set = ["@", "*", ">", "<", "^", "v"]
spatial = true
observe_state = []
Let's go through each field.
@@ -80,9 +86,10 @@ implicitly has access to a no-op (doing nothing).
The key in the game's state dictionary to use as the reward signal.
``retro-gamer`` computes the reward for each turn as the *change* in
this value from one turn to the next. For Snake, score increases by 1
(or more) each time the apple is eaten, so the agent receives a reward
of 1 when it eats an apple and 0 otherwise.
this value from one turn to the next. For Snake, the score changes when
the snake eats an apple (+50), when it moves away from the apple (1),
and when it dies (10). These incremental changes are what the agent
tries to maximize.
Choosing an appropriate reward is one of the most consequential
decisions in RL. Some considerations:
@@ -115,15 +122,48 @@ phase before training to discover which characters actually appear.
The number of exploration turns is controlled by the
``exploration_turns`` hyperparameter.
``spatial``
~~~~~~~~~~~
``spatial`` and other preprocessing options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Whether to treat the board as a spatial scene (default: ``true``). A
spatial game uses a *convolutional neural network* (CNN) that can
detect patterns in the relative arrangement of characters. A
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
ignores positional relationships. Set to ``false`` for games where
position is irrelevant.
The ``[tool.retro-gamer]`` section describes the game. Preprocessing
options—such as ``spatial`` (whether to use a CNN or MLP, default:
``false``), ``egocentric``, and ``observe_state``—live in the
``[preprocessing]`` section of the generated ``config.toml``. You can
edit them there after running ``retro-gamer create``.
``observe_state``
~~~~~~~~~~~~~~~~~
By default the agent only sees the board. You can also give it access
to computed values from ``game.state`` by listing the relevant keys in
the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
For example, Snake exposes the normalized direction to the apple:
.. code-block:: toml
[preprocessing]
observe_state = ["apple_dx", "apple_dy"]
The trainer appends these values to the observation vector after the
board encoding (or uses them as the entire observation when
``board = false``).
These values must be set in ``game.state`` at the start of every
episode—typically inside ``create_game()``—and must keep the same
type and length from episode to episode.
.. warning::
Always initialize every key listed in ``observe_state`` before the
game starts. If a key is missing or its length changes between
episodes, training stops immediately with a clear error explaining
what changed. The neural network's input size is fixed when training
begins; it cannot adapt to a changing observation shape mid-run.
This is a good place to ask: *can a human player see this information?*
The apple's location is visible on screen; the normalized distance vector
is not. Whether that asymmetry is appropriate is a design choice worth
examining.
Once you have written this section, create the training run directory:
@@ -139,7 +179,7 @@ Once you have written this section, create the training run directory:
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
reward : score
characters : ['@', '*', '>', '<', '^', 'v']
architecture: CNN (spatial)
architecture: MLP
``retro-gamer create`` reads your game metadata directly from
``pyproject.toml`` and writes it—along with all hyperparameters—to
@@ -153,64 +193,141 @@ With the ``config.toml`` in place, start training:
.. code-block:: console
% retro-gamer train runs/snake/
Training for 1000 episodes…
Done. Checkpoints in runs/snake/checkpoints/
100%|████████████████████| 1000/1000 [12:34<00:00, 1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
Done. Checkpoints saved in runs/snake/checkpoints/
Training saves checkpoints every 100 episodes and a ``final.pt``
checkpoint when complete. You can follow progress in the training log:
A progress bar shows how far training has gone, along with the most
recent episode's reward, the current exploration rate (``eps``), and
the average prediction error (``loss``).
Training saves a checkpoint every 100 episodes to
``runs/snake/checkpoints/``. You can stop training at any time with
Ctrl-C and resume it later—the next ``retro-gamer train`` command will
automatically pick up from the latest checkpoint.
Reading the training log
~~~~~~~~~~~~~~~~~~~~~~~~
For a longer view of how training is progressing, inspect the training
log:
.. code-block:: console
% tail -f runs/snake/training.log
% cat runs/snake/training.log
The log shows one line per episode:
The log begins with the full network architecture, followed by one line
per checkpoint (every 100 episodes):
.. code-block:: text
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
[ep_0100] ep=0001-0100 avg_reward=-31.4 avg_steps=47 epsilon=0.938 avg_loss=7.2 time=0m12s total=0m12s
[ep_0200] ep=0101-0200 avg_reward=-18.6 avg_steps=89 epsilon=0.879 avg_loss=6.8 time=0m14s total=0m26s
[ep_0300] ep=0201-0300 avg_reward= -4.1 avg_steps=134 epsilon=0.824 avg_loss=6.1 time=0m15s total=0m41s
[ep_0500] ep=0401-0500 avg_reward= +8.7 avg_steps=210 epsilon=0.724 avg_loss=5.4 time=0m16s total=1m12s
[ep_1000] ep=0901-1000 avg_reward=+22.3 avg_steps=389 epsilon=0.557 avg_loss=4.9 time=0m18s total=2m30s
- **total_reward**: the total score earned during the episode (how many
apples the snake ate, for Snake).
- **steps**: how many turns the episode lasted.
- **epsilon**: the current exploration rate. Early in training this is
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
- **avg_loss**: the average temporal-difference error across training
steps in this episode. A decreasing loss generally indicates that the
Q-value estimates are converging.
Here is what each field means:
Resuming training
~~~~~~~~~~~~~~~~~
- **avg_reward**: Average total reward per episode over the past 100 episodes.
Positive values mean the agent is accumulating reward; negative values mean
it is accumulating penalties. An upward trend over time is the main signal
that learning is working.
- **avg_steps**: Average number of turns per episode. If episodes are ending
quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
generally indicate the agent is surviving longer.
- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
behavior is expected.
- **avg_loss**: Average Huber loss across training steps. Huber loss is
quadratic for small prediction errors and linear for large ones, which keeps
it stable even when rewards have a wide range (such as a large bonus for
reaching a goal). Values in the range 010 are typical for most games.
A slow downward trend is the healthy pattern. A loss that grows without bound
indicates the learning rate is too high.
- **time**: Wall-clock time for this 100-episode interval.
- **total**: Cumulative training time across all sessions.
Training can be resumed from a checkpoint:
When training is resumed after a stop, a header line marks the break::
=== Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===
This lets you track exactly when each session took place.
Stopping training to watch the agent play
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You do not need to wait for training to finish before watching the
agent. Training can be stopped at any time with Ctrl-C, and the latest
checkpoint is always available immediately:
.. code-block:: console
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
% retro-gamer play runs/snake/
Watching a trained agent play
------------------------------
This loads the most recent checkpoint and runs the agent in your
terminal. Press Enter or Escape to quit.
To watch a trained agent play the game in your terminal:
.. note::
.. code-block:: console
The game is rendered directly in your terminal. If the window is
smaller than the board plus borders, ``retro-gamer play`` will raise
a ``TerminalTooSmall`` error — enlarge the terminal window and try
again.
% retro-gamer play runs/snake/ --checkpoint final
You can substitute any checkpoint name:
To watch an earlier stage of training, use ``--checkpoint``:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint ep_0100
Press Enter or Escape to quit.
Comparing what the agent at episode 100 does versus the agent at episode
500 can reveal exactly what the agent has (and has not) learned. For
Snake, you might notice the episode-100 agent moving somewhat randomly,
while the episode-500 agent consistently navigates toward the apple.
Articulating *why* the later agent behaves differently—what the training
process produced—connects observation directly to the concepts underlying
DQN.
Comparing agents trained at different checkpoints is a useful activity:
the agent at episode 100 has learned *something*, but typically much
less than the agent at episode 500. Articulating *what* the earlier
agent has and has not learned, and *why*, is productive reasoning about
the training process.
Resuming training after watching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After watching the agent play, resume training with exactly the same
command you used before:
.. code-block:: console
% retro-gamer train runs/snake/
``retro-gamer`` automatically detects and resumes from the latest
checkpoint. No extra flags are needed. If all configured episodes have
already been completed, it prints a message and exits:
.. code-block:: console
Training already complete (1000 episodes). To keep training,
increase training_episodes in config.toml.
To continue training, open ``runs/snake/config.toml``, increase the
``training_episodes`` value, and run ``retro-gamer train`` again.
Watching a trained agent play
------------------------------
Once training is complete, watch the final agent:
.. code-block:: console
% retro-gamer play runs/snake/
By default the latest checkpoint is loaded. You can also compare the
agent's performance at different stages of training:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint ep_0100
% retro-gamer play runs/snake/ --checkpoint ep_0500
Press Enter or Escape to quit.
Inspecting a run
----------------
@@ -220,18 +337,20 @@ To review the configuration and recent training progress for a run:
.. code-block:: console
% retro-gamer info runs/snake/
Game module : retro.examples.snake
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
Game module : retro.examples.snake
Metadata : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
Preprocessing : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
Model : {'hidden_sizes': [128, 64]}
Training : {'learning_rate': 0.0001, 'gamma': 0.99, ...}
Last 5 episodes:
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
Last 5 checkpoints:
[ep_0600] ep=0501-0600 avg_reward=+12.1 ...
[ep_0700] ep=0601-0700 avg_reward=+14.8 ...
[ep_0800] ep=0701-0800 avg_reward=+16.3 ...
[ep_0900] ep=0801-0900 avg_reward=+19.0 ...
[ep_1000] ep=0901-1000 avg_reward=+22.3 ...
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']
Adjusting hyperparameters
--------------------------
@@ -241,7 +360,8 @@ before training, or by passing them as options to ``retro-gamer
create``. Common adjustments and their effects:
**``training_episodes``** — How long to train. More episodes give the
agent more time to learn, but also take longer to run.
agent more time to learn, but also take longer to run. This is always
safe to increase.
**``epsilon_decay``** — How quickly exploration decreases. A faster
decay (smaller ``epsilon_decay``) means the agent commits to its early
@@ -257,14 +377,124 @@ a small learning rate is stable but slow.
means the agent values long-term consequences; closer to 0 makes the
agent focus on immediate reward.
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
head. Larger networks can represent more complex Q-functions but are
slower to train and may overfit.
**``hidden_sizes``** — The shape of the MLP head as a list of layer
sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
more complex Q-functions but are slower to train and may overfit.
**``prioritize_experiences``** — Whether to use prioritized experience
replay. This often improves sample efficiency but is slightly slower
per step.
.. _incompatible-changes:
Why some changes require starting fresh
----------------------------------------
Not all changes to ``config.toml`` are equal. Some can be applied
immediately to an existing training run; others make the existing
checkpoints unusable.
**Safe to change at any time** (``[training]`` section) — These affect
*how* the agent learns, not *what* it is learning to do. Existing
checkpoints remain valid:
- ``training_episodes``, ``max_turns_per_episode``
- ``learning_rate``, ``learning_rate_decay``, ``gamma``
- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
- ``target_update_freq``, ``train_every``
**Requires starting fresh** — These changes alter the shape of the
game or the shape of the network. The saved model weights are
incompatible with the new configuration:
- ``actions``, ``reward``, ``character_set``, ``board_size``
(``[metadata]``) — These define what the agent perceives and what it
can do. Changing them changes the size of the network's input or
output layers; the existing weights no longer fit.
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
``egocentric``, ``egocentric_player``, ``egocentric_radius``
(``[preprocessing]``) — These control how the observation is
constructed. Any change here alters the input shape or meaning and
makes existing weights invalid.
- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
layers. Changing it changes the shape of the network; the existing
weights no longer fit.
If you try to resume training after making one of these changes,
``retro-gamer train`` detects the mismatch and stops with a clear
explanation, for example::
Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.
The following changes require starting fresh. The existing model was
trained on a different problem and its weights cannot be reused:
character_set
was : ['@', '*', '>', '<', '^', 'v']
now : ['@', '*', '>', '<', '^', 'v', '#']
why : the set of board characters (changes input layer size)
Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
training log, then run 'retro-gamer train RUN_DIR' to start fresh.
To clear out the old checkpoints and begin again:
.. code-block:: console
% retro-gamer clean runs/snake/
Will remove 5 checkpoint(s) and training log from runs/snake/:
checkpoints/ep_0100.pt
checkpoints/ep_0200.pt
...
training.log
Proceed? [y/N]: y
Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.
The ``config.toml`` is always preserved so you do not need to run
``retro-gamer create`` again.
Reasoning about training from the log
--------------------------------------
The training log is one of the most useful tools for understanding what
is happening during training. Here are some patterns to look for and
what they mean.
**Reward increasing steadily** is the normal, healthy pattern. Each
checkpoint block should show a higher ``avg_reward`` than the last.
The rate of increase typically slows as training progresses.
**Reward flat or negative through early episodes** is normal. Early in
training, ``epsilon`` is high and the agent is mostly acting randomly.
It has not yet discovered effective strategies. Patience—and a look at
the ``epsilon`` column—will confirm whether this is just the exploration
phase.
**Loss decreasing** is also healthy. As the Q-network's estimates
improve, the difference between predicted and target Q-values (the TD
error) should shrink. A loss that stabilizes near zero is usually a
good sign.
**Loss growing without bound** indicates the learning rate is too high.
The trainer uses Huber loss, which is robust to large reward scales, but
a learning rate above roughly ``0.001`` can still destabilise training.
Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
and restarting training.
**Short episodes (low ``avg_steps``)** combined with low reward
suggests the agent is dying frequently. Early in training this is
normal. If it persists late in training, the agent may have settled on
a bad policy—consider extending training or adjusting
``epsilon_decay`` to explore longer.
**Reward that improves and then regresses** can indicate that the
agent has discovered a suboptimal but consistent strategy and is stuck.
Increasing ``epsilon_min`` to keep some exploration active, or
adjusting the reward signal to better differentiate good moves from
bad ones, can help.
Questions for investigation
----------------------------
@@ -297,3 +527,4 @@ concepts underlying the training algorithm.
episode 1000 and watch each play the same game. What has the later
agent learned that the earlier one has not? How would you describe
this difference to someone who does not know about neural networks?