531 lines
20 KiB
ReStructuredText
531 lines
20 KiB
ReStructuredText
Walkthrough
|
||
===========
|
||
|
||
This section walks through a complete ``retro-gamer`` workflow, from
|
||
preparing a game to watching a trained agent play. The game used here
|
||
is the Snake example included with the ``retro-games`` framework, but
|
||
the same steps apply to any game you build.
|
||
|
||
Prerequisites
|
||
-------------
|
||
|
||
You will need:
|
||
|
||
- Python 3.11 or higher.
|
||
- The ``retro-games`` framework installed and a game you have written
|
||
(or the built-in Snake example). See the
|
||
`retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
|
||
for help writing games.
|
||
- ``retro-gamer`` installed (see :ref:`installation`).
|
||
|
||
Preparing your game
|
||
-------------------
|
||
|
||
``retro-gamer`` loads your game by calling a function named
|
||
``create_game``. The function must take no arguments and return a new
|
||
``Game`` instance.
|
||
|
||
Here is the ``create_game`` function for Snake:
|
||
|
||
.. code-block:: python
|
||
|
||
def create_game():
|
||
head = SnakeHead()
|
||
apple = Apple()
|
||
game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
|
||
apple.relocate(game)
|
||
return game
|
||
|
||
If your game file does not already have a ``create_game`` function, add
|
||
one following this pattern.
|
||
|
||
When you run ``retro-gamer create``, you can point to your game file
|
||
directly by path or by Python module name:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer create --game my_game.py --output runs/my_game/
|
||
% retro-gamer create --game retro.examples.snake --output runs/snake/
|
||
|
||
|
||
Describing your game
|
||
--------------------
|
||
|
||
Every training run begins with a description of your game. This
|
||
description belongs in the ``[tool.retro-gamer]`` section of your game
|
||
project's ``pyproject.toml``—the same file that defines the project's
|
||
name, version, and dependencies. Placing it there keeps the description
|
||
with the game itself, where it belongs.
|
||
|
||
Here is the ``[tool.retro-gamer]`` section for the Snake example:
|
||
|
||
.. code-block:: toml
|
||
|
||
[tool.retro-gamer]
|
||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||
reward = "score"
|
||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||
|
||
Let's go through each field.
|
||
|
||
``actions``
|
||
~~~~~~~~~~~
|
||
|
||
A list of the keystrokes the agent may send to the game. For Snake,
|
||
the four arrow keys control the direction of travel. The agent also
|
||
implicitly has access to a no-op (doing nothing).
|
||
|
||
.. note::
|
||
|
||
Only include actions that the game actually responds to. Listing
|
||
unused keys wastes part of the agent's action space and may slow
|
||
training.
|
||
|
||
``reward``
|
||
~~~~~~~~~~
|
||
|
||
The key in the game's state dictionary to use as the reward signal.
|
||
``retro-gamer`` computes the reward for each turn as the *change* in
|
||
this value from one turn to the next. For Snake, the score changes when
|
||
the snake eats an apple (+50), when it moves away from the apple (−1),
|
||
and when it dies (−10). These incremental changes are what the agent
|
||
tries to maximize.
|
||
|
||
Choosing an appropriate reward is one of the most consequential
|
||
decisions in RL. Some considerations:
|
||
|
||
- A reward that is too sparse—where the agent goes many turns without
|
||
receiving any signal—makes learning slow. A snake that dies without
|
||
ever eating an apple receives no positive reward at all in the first
|
||
episodes, giving the learning algorithm almost nothing to work with.
|
||
- A reward that is too dense—assigned every turn—may not reflect the
|
||
true goal of the game.
|
||
- An artificial reward, such as giving a point for moving toward the
|
||
apple, can accelerate early training but may cause the agent to
|
||
optimize the proxy rather than the real objective.
|
||
|
||
``character_set``
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
The characters that can appear on the board, as a list of
|
||
single-character strings. Each cell of the board will be *one-hot
|
||
encoded* using this list: the agent represents the content of each cell
|
||
as a vector of zeros with a single 1 at the position corresponding to
|
||
the character. A cell containing a character not in this list is treated
|
||
as empty.
|
||
|
||
For Snake, the characters are: ``@`` (the apple), ``*`` (body
|
||
segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).
|
||
|
||
If you omit this field, ``retro-gamer`` will run a brief exploration
|
||
phase before training to discover which characters actually appear.
|
||
The number of exploration turns is controlled by the
|
||
``exploration_turns`` hyperparameter.
|
||
|
||
``spatial`` and other preprocessing options
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ``[tool.retro-gamer]`` section describes the game. Preprocessing
|
||
options—such as ``spatial`` (whether to use a CNN or MLP, default:
|
||
``false``), ``egocentric``, and ``observe_state``—live in the
|
||
``[preprocessing]`` section of the generated ``config.toml``. You can
|
||
edit them there after running ``retro-gamer create``.
|
||
|
||
``observe_state``
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
By default the agent only sees the board. You can also give it access
|
||
to computed values from ``game.state`` by listing the relevant keys in
|
||
the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
|
||
For example, Snake exposes the normalized direction to the apple:
|
||
|
||
.. code-block:: toml
|
||
|
||
[preprocessing]
|
||
observe_state = ["apple_dx", "apple_dy"]
|
||
|
||
The trainer appends these values to the observation vector after the
|
||
board encoding (or uses them as the entire observation when
|
||
``board = false``).
|
||
|
||
These values must be set in ``game.state`` at the start of every
|
||
episode—typically inside ``create_game()``—and must keep the same
|
||
type and length from episode to episode.
|
||
|
||
.. warning::
|
||
|
||
Always initialize every key listed in ``observe_state`` before the
|
||
game starts. If a key is missing or its length changes between
|
||
episodes, training stops immediately with a clear error explaining
|
||
what changed. The neural network's input size is fixed when training
|
||
begins; it cannot adapt to a changing observation shape mid-run.
|
||
|
||
This is a good place to ask: *can a human player see this information?*
|
||
The apple's location is visible on screen; the normalized distance vector
|
||
is not. Whether that asymmetry is appropriate is a design choice worth
|
||
examining.
|
||
|
||
Once you have written this section, create the training run directory:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer create \
|
||
--game retro.examples.snake \
|
||
--output runs/snake/
|
||
|
||
Created training run at runs/snake/config.toml
|
||
game : retro.examples.snake
|
||
board_size : 32×16
|
||
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
|
||
reward : score
|
||
characters : ['@', '*', '>', '<', '^', 'v']
|
||
architecture: MLP
|
||
|
||
``retro-gamer create`` reads your game metadata directly from
|
||
``pyproject.toml`` and writes it—along with all hyperparameters—to
|
||
``runs/snake/config.toml``.
|
||
|
||
Training the agent
|
||
------------------
|
||
|
||
With the ``config.toml`` in place, start training:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer train runs/snake/
|
||
100%|████████████████████| 1000/1000 [12:34<00:00, 1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
|
||
Done. Checkpoints saved in runs/snake/checkpoints/
|
||
|
||
A progress bar shows how far training has gone, along with the most
|
||
recent episode's reward, the current exploration rate (``eps``), and
|
||
the average prediction error (``loss``).
|
||
|
||
Training saves a checkpoint every 100 episodes to
|
||
``runs/snake/checkpoints/``. You can stop training at any time with
|
||
Ctrl-C and resume it later—the next ``retro-gamer train`` command will
|
||
automatically pick up from the latest checkpoint.
|
||
|
||
Reading the training log
|
||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
For a longer view of how training is progressing, inspect the training
|
||
log:
|
||
|
||
.. code-block:: console
|
||
|
||
% cat runs/snake/training.log
|
||
|
||
The log begins with the full network architecture, followed by one line
|
||
per checkpoint (every 100 episodes):
|
||
|
||
.. code-block:: text
|
||
|
||
[ep_0100] ep=0001-0100 avg_reward=-31.4 avg_steps=47 epsilon=0.938 avg_loss=7.2 time=0m12s total=0m12s
|
||
[ep_0200] ep=0101-0200 avg_reward=-18.6 avg_steps=89 epsilon=0.879 avg_loss=6.8 time=0m14s total=0m26s
|
||
[ep_0300] ep=0201-0300 avg_reward= -4.1 avg_steps=134 epsilon=0.824 avg_loss=6.1 time=0m15s total=0m41s
|
||
[ep_0500] ep=0401-0500 avg_reward= +8.7 avg_steps=210 epsilon=0.724 avg_loss=5.4 time=0m16s total=1m12s
|
||
[ep_1000] ep=0901-1000 avg_reward=+22.3 avg_steps=389 epsilon=0.557 avg_loss=4.9 time=0m18s total=2m30s
|
||
|
||
Here is what each field means:
|
||
|
||
- **avg_reward**: Average total reward per episode over the past 100 episodes.
|
||
Positive values mean the agent is accumulating reward; negative values mean
|
||
it is accumulating penalties. An upward trend over time is the main signal
|
||
that learning is working.
|
||
- **avg_steps**: Average number of turns per episode. If episodes are ending
|
||
quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
|
||
generally indicate the agent is surviving longer.
|
||
- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
|
||
and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
|
||
behavior is expected.
|
||
- **avg_loss**: Average Huber loss across training steps. Huber loss is
|
||
quadratic for small prediction errors and linear for large ones, which keeps
|
||
it stable even when rewards have a wide range (such as a large bonus for
|
||
reaching a goal). Values in the range 0–10 are typical for most games.
|
||
A slow downward trend is the healthy pattern. A loss that grows without bound
|
||
indicates the learning rate is too high.
|
||
- **time**: Wall-clock time for this 100-episode interval.
|
||
- **total**: Cumulative training time across all sessions.
|
||
|
||
When training is resumed after a stop, a header line marks the break::
|
||
|
||
=== Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===
|
||
|
||
This lets you track exactly when each session took place.
|
||
|
||
Stopping training to watch the agent play
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
You do not need to wait for training to finish before watching the
|
||
agent. Training can be stopped at any time with Ctrl-C, and the latest
|
||
checkpoint is always available immediately:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer play runs/snake/
|
||
|
||
This loads the most recent checkpoint and runs the agent in your
|
||
terminal. Press Enter or Escape to quit.
|
||
|
||
.. note::
|
||
|
||
The game is rendered directly in your terminal. If the window is
|
||
smaller than the board plus borders, ``retro-gamer play`` will raise
|
||
a ``TerminalTooSmall`` error — enlarge the terminal window and try
|
||
again.
|
||
|
||
To watch an earlier stage of training, use ``--checkpoint``:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||
|
||
Comparing what the agent at episode 100 does versus the agent at episode
|
||
500 can reveal exactly what the agent has (and has not) learned. For
|
||
Snake, you might notice the episode-100 agent moving somewhat randomly,
|
||
while the episode-500 agent consistently navigates toward the apple.
|
||
Articulating *why* the later agent behaves differently—what the training
|
||
process produced—connects observation directly to the concepts underlying
|
||
DQN.
|
||
|
||
Resuming training after watching
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
After watching the agent play, resume training with exactly the same
|
||
command you used before:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer train runs/snake/
|
||
|
||
``retro-gamer`` automatically detects and resumes from the latest
|
||
checkpoint. No extra flags are needed. If all configured episodes have
|
||
already been completed, it prints a message and exits:
|
||
|
||
.. code-block:: console
|
||
|
||
Training already complete (1000 episodes). To keep training,
|
||
increase training_episodes in config.toml.
|
||
|
||
To continue training, open ``runs/snake/config.toml``, increase the
|
||
``training_episodes`` value, and run ``retro-gamer train`` again.
|
||
|
||
Watching a trained agent play
|
||
------------------------------
|
||
|
||
Once training is complete, watch the final agent:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer play runs/snake/
|
||
|
||
By default the latest checkpoint is loaded. You can also compare the
|
||
agent's performance at different stages of training:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||
% retro-gamer play runs/snake/ --checkpoint ep_0500
|
||
|
||
Press Enter or Escape to quit.
|
||
|
||
Inspecting a run
|
||
----------------
|
||
|
||
To review the configuration and recent training progress for a run:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer info runs/snake/
|
||
Game module : retro.examples.snake
|
||
Metadata : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
|
||
Preprocessing : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
|
||
Model : {'hidden_sizes': [128, 64]}
|
||
Training : {'learning_rate': 0.0001, 'gamma': 0.99, ...}
|
||
|
||
Last 5 checkpoints:
|
||
[ep_0600] ep=0501-0600 avg_reward=+12.1 ...
|
||
[ep_0700] ep=0601-0700 avg_reward=+14.8 ...
|
||
[ep_0800] ep=0701-0800 avg_reward=+16.3 ...
|
||
[ep_0900] ep=0801-0900 avg_reward=+19.0 ...
|
||
[ep_1000] ep=0901-1000 avg_reward=+22.3 ...
|
||
|
||
Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']
|
||
|
||
Adjusting hyperparameters
|
||
--------------------------
|
||
|
||
The training hyperparameters can be changed by editing ``config.toml``
|
||
before training, or by passing them as options to ``retro-gamer
|
||
create``. Common adjustments and their effects:
|
||
|
||
**``training_episodes``** — How long to train. More episodes give the
|
||
agent more time to learn, but also take longer to run. This is always
|
||
safe to increase.
|
||
|
||
**``epsilon_decay``** — How quickly exploration decreases. A faster
|
||
decay (smaller ``epsilon_decay``) means the agent commits to its early
|
||
Q-estimates before they are fully reliable. A slower decay (larger
|
||
``epsilon_decay``, closer to 1) gives the agent more time to explore
|
||
but may waste training time on random actions.
|
||
|
||
**``learning_rate``** — How large the weight updates are at each
|
||
training step. A large learning rate learns fast but may overshoot;
|
||
a small learning rate is stable but slow.
|
||
|
||
**``gamma``** — The discount factor for future rewards. Closer to 1
|
||
means the agent values long-term consequences; closer to 0 makes the
|
||
agent focus on immediate reward.
|
||
|
||
**``hidden_sizes``** — The shape of the MLP head as a list of layer
|
||
sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
|
||
more complex Q-functions but are slower to train and may overfit.
|
||
|
||
**``prioritize_experiences``** — Whether to use prioritized experience
|
||
replay. This often improves sample efficiency but is slightly slower
|
||
per step.
|
||
|
||
.. _incompatible-changes:
|
||
|
||
Why some changes require starting fresh
|
||
----------------------------------------
|
||
|
||
Not all changes to ``config.toml`` are equal. Some can be applied
|
||
immediately to an existing training run; others make the existing
|
||
checkpoints unusable.
|
||
|
||
**Safe to change at any time** (``[training]`` section) — These affect
|
||
*how* the agent learns, not *what* it is learning to do. Existing
|
||
checkpoints remain valid:
|
||
|
||
- ``training_episodes``, ``max_turns_per_episode``
|
||
- ``learning_rate``, ``learning_rate_decay``, ``gamma``
|
||
- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
|
||
- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
|
||
- ``target_update_freq``, ``train_every``
|
||
|
||
**Requires starting fresh** — These changes alter the shape of the
|
||
game or the shape of the network. The saved model weights are
|
||
incompatible with the new configuration:
|
||
|
||
- ``actions``, ``reward``, ``character_set``, ``board_size``
|
||
(``[metadata]``) — These define what the agent perceives and what it
|
||
can do. Changing them changes the size of the network's input or
|
||
output layers; the existing weights no longer fit.
|
||
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
|
||
``egocentric``, ``egocentric_player``, ``egocentric_radius``
|
||
(``[preprocessing]``) — These control how the observation is
|
||
constructed. Any change here alters the input shape or meaning and
|
||
makes existing weights invalid.
|
||
- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
|
||
layers. Changing it changes the shape of the network; the existing
|
||
weights no longer fit.
|
||
|
||
If you try to resume training after making one of these changes,
|
||
``retro-gamer train`` detects the mismatch and stops with a clear
|
||
explanation, for example::
|
||
|
||
Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.
|
||
|
||
The following changes require starting fresh. The existing model was
|
||
trained on a different problem and its weights cannot be reused:
|
||
|
||
character_set
|
||
was : ['@', '*', '>', '<', '^', 'v']
|
||
now : ['@', '*', '>', '<', '^', 'v', '#']
|
||
why : the set of board characters (changes input layer size)
|
||
|
||
Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
|
||
training log, then run 'retro-gamer train RUN_DIR' to start fresh.
|
||
|
||
To clear out the old checkpoints and begin again:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer clean runs/snake/
|
||
Will remove 5 checkpoint(s) and training log from runs/snake/:
|
||
checkpoints/ep_0100.pt
|
||
checkpoints/ep_0200.pt
|
||
...
|
||
training.log
|
||
|
||
Proceed? [y/N]: y
|
||
Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.
|
||
|
||
The ``config.toml`` is always preserved so you do not need to run
|
||
``retro-gamer create`` again.
|
||
|
||
Reasoning about training from the log
|
||
--------------------------------------
|
||
|
||
The training log is one of the most useful tools for understanding what
|
||
is happening during training. Here are some patterns to look for and
|
||
what they mean.
|
||
|
||
**Reward increasing steadily** is the normal, healthy pattern. Each
|
||
checkpoint block should show a higher ``avg_reward`` than the last.
|
||
The rate of increase typically slows as training progresses.
|
||
|
||
**Reward flat or negative through early episodes** is normal. Early in
|
||
training, ``epsilon`` is high and the agent is mostly acting randomly.
|
||
It has not yet discovered effective strategies. Patience—and a look at
|
||
the ``epsilon`` column—will confirm whether this is just the exploration
|
||
phase.
|
||
|
||
**Loss decreasing** is also healthy. As the Q-network's estimates
|
||
improve, the difference between predicted and target Q-values (the TD
|
||
error) should shrink. A loss that stabilizes near zero is usually a
|
||
good sign.
|
||
|
||
**Loss growing without bound** indicates the learning rate is too high.
|
||
The trainer uses Huber loss, which is robust to large reward scales, but
|
||
a learning rate above roughly ``0.001`` can still destabilise training.
|
||
Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
|
||
and restarting training.
|
||
|
||
**Short episodes (low ``avg_steps``)** combined with low reward
|
||
suggests the agent is dying frequently. Early in training this is
|
||
normal. If it persists late in training, the agent may have settled on
|
||
a bad policy—consider extending training or adjusting
|
||
``epsilon_decay`` to explore longer.
|
||
|
||
**Reward that improves and then regresses** can indicate that the
|
||
agent has discovered a suboptimal but consistent strategy and is stuck.
|
||
Increasing ``epsilon_min`` to keep some exploration active, or
|
||
adjusting the reward signal to better differentiate good moves from
|
||
bad ones, can help.
|
||
|
||
Questions for investigation
|
||
----------------------------
|
||
|
||
The following questions are intended to guide productive investigation
|
||
using ``retro-gamer``. They are chosen because they have specific,
|
||
reasoned answers that connect what you know about the game to the
|
||
concepts underlying the training algorithm.
|
||
|
||
1. **Character set completeness.** Train two agents: one with the full
|
||
character set, one missing a character that frequently appears on the
|
||
board. Compare their performance. What did the second agent lose the
|
||
ability to perceive, and how did that affect its behavior?
|
||
|
||
2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
|
||
true`` and ``spatial = false``. How does training efficiency differ?
|
||
Can you explain the difference in terms of what each architecture
|
||
can and cannot learn?
|
||
|
||
3. **Reward shaping.** If the game currently rewards only the final
|
||
objective (e.g., reaching a goal), add intermediate rewards for
|
||
sub-goals. How does this change the early training curve? Does it
|
||
change the agent's final strategy?
|
||
|
||
4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
|
||
(so the agent commits to exploiting early) and a very slow one (so
|
||
exploration continues for a long time). How do the training curves
|
||
differ? What is the agent doing in each case when ``epsilon`` is low?
|
||
|
||
5. **Checkpoint comparison.** Load the agent at episode 100 and at
|
||
episode 1000 and watch each play the same game. What has the later
|
||
agent learned that the earlier one has not? How would you describe
|
||
this difference to someone who does not know about neural networks?
|
||
|