retro-gamer/docs/walkthrough.rst

Walkthrough
===========

This section walks through a complete ``retro-gamer`` workflow, from
preparing a game to watching a trained agent play. The game used here
is the Snake example included with the ``retro-games`` framework, but
the same steps apply to any game you build.

Prerequisites
-------------

You will need:

- Python 3.11 or higher.
- The ``retro-games`` framework installed and a game you have written
  (or the built-in Snake example). See the
  `retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
  for help writing games.
- ``retro-gamer`` installed (see :ref:`installation`).

Preparing your game
-------------------

``retro-gamer`` loads your game by calling a function named
``create_game``. The function must take no arguments and return a new
``Game`` instance.

Here is the ``create_game`` function for Snake:

.. code-block:: python

   def create_game():
       head = SnakeHead()
       apple = Apple()
       game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
       apple.relocate(game)
       return game

If your game file does not already have a ``create_game`` function, add
one following this pattern.

When you run ``retro-gamer create``, you can point to your game file
directly by path or by Python module name:

.. code-block:: console

   % retro-gamer create --game my_game.py --output runs/my_game/
   % retro-gamer create --game retro.examples.snake --output runs/snake/


Describing your game
--------------------

Every training run begins with a description of your game. This
description belongs in the ``[tool.retro-gamer]`` section of your game
project's ``pyproject.toml``—the same file that defines the project's
name, version, and dependencies. Placing it there keeps the description
with the game itself, where it belongs.

Here is the ``[tool.retro-gamer]`` section for the Snake example:

.. code-block:: toml

   [tool.retro-gamer]
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]

Let's go through each field.

``actions``
~~~~~~~~~~~

A list of the keystrokes the agent may send to the game. For Snake,
the four arrow keys control the direction of travel. The agent also
implicitly has access to a no-op (doing nothing).

.. note::

   Only include actions that the game actually responds to. Listing
   unused keys wastes part of the agent's action space and may slow
   training.

``reward``
~~~~~~~~~~

The key in the game's state dictionary to use as the reward signal.
``retro-gamer`` computes the reward for each turn as the *change* in
this value from one turn to the next. For Snake, the score changes when
the snake eats an apple (+50), when it moves away from the apple (−1),
and when it dies (−10). These incremental changes are what the agent
tries to maximize.

Choosing an appropriate reward is one of the most consequential
decisions in RL. Some considerations:

- A reward that is too sparse—where the agent goes many turns without
  receiving any signal—makes learning slow. A snake that dies without
  ever eating an apple receives no positive reward at all in the first
  episodes, giving the learning algorithm almost nothing to work with.
- A reward that is too dense—assigned every turn—may not reflect the
  true goal of the game.
- An artificial reward, such as giving a point for moving toward the
  apple, can accelerate early training but may cause the agent to
  optimize the proxy rather than the real objective.

``character_set``
~~~~~~~~~~~~~~~~~

The characters that can appear on the board, as a list of
single-character strings. Each cell of the board will be *one-hot
encoded* using this list: the agent represents the content of each cell
as a vector of zeros with a single 1 at the position corresponding to
the character. A cell containing a character not in this list is treated
as empty.

For Snake, the characters are: ``@`` (the apple), ``*`` (body
segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).

If you omit this field, ``retro-gamer`` will run a brief exploration
phase before training to discover which characters actually appear.
The number of exploration turns is controlled by the
``exploration_turns`` hyperparameter.

``spatial`` and other preprocessing options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``[tool.retro-gamer]`` section describes the game. Preprocessing
options—such as ``spatial`` (whether to use a CNN or MLP, default:
``false``), ``egocentric``, and ``observe_state``—live in the
``[preprocessing]`` section of the generated ``config.toml``. You can
edit them there after running ``retro-gamer create``.

``observe_state``
~~~~~~~~~~~~~~~~~

By default the agent only sees the board. You can also give it access
to computed values from ``game.state`` by listing the relevant keys in
the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
For example, Snake exposes the normalized direction to the apple:

.. code-block:: toml

   [preprocessing]
   observe_state = ["apple_dx", "apple_dy"]

The trainer appends these values to the observation vector after the
board encoding (or uses them as the entire observation when
``board = false``).

These values must be set in ``game.state`` at the start of every
episode—typically inside ``create_game()``—and must keep the same
type and length from episode to episode.

.. warning::

   Always initialize every key listed in ``observe_state`` before the
   game starts. If a key is missing or its length changes between
   episodes, training stops immediately with a clear error explaining
   what changed. The neural network's input size is fixed when training
   begins; it cannot adapt to a changing observation shape mid-run.

This is a good place to ask: *can a human player see this information?*
The apple's location is visible on screen; the normalized distance vector
is not. Whether that asymmetry is appropriate is a design choice worth
examining.

Once you have written this section, create the training run directory:

.. code-block:: console

   % retro-gamer create                    \
       --game retro.examples.snake         \
       --output runs/snake/

   Created training run at runs/snake/config.toml
     game        : retro.examples.snake
     board_size  : 32×16
     actions     : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
     reward      : score
     characters  : ['@', '*', '>', '<', '^', 'v']
     architecture: MLP

``retro-gamer create`` reads your game metadata directly from
``pyproject.toml`` and writes it—along with all hyperparameters—to
``runs/snake/config.toml``.

Training the agent
------------------

With the ``config.toml`` in place, start training:

.. code-block:: console

   % retro-gamer train runs/snake/
   100%|████████████████████| 1000/1000 [12:34<00:00,  1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
   Done. Checkpoints saved in runs/snake/checkpoints/

A progress bar shows how far training has gone, along with the most
recent episode's reward, the current exploration rate (``eps``), and
the average prediction error (``loss``).

Training saves a checkpoint every 100 episodes to
``runs/snake/checkpoints/``. You can stop training at any time with
Ctrl-C and resume it later—the next ``retro-gamer train`` command will
automatically pick up from the latest checkpoint.

Reading the training log
~~~~~~~~~~~~~~~~~~~~~~~~

For a longer view of how training is progressing, inspect the training
log:

.. code-block:: console

   % cat runs/snake/training.log

The log begins with the full network architecture, followed by one line
per checkpoint (every 100 episodes):

.. code-block:: text

   [ep_0100]  ep=0001-0100  avg_reward=-31.4  avg_steps=47   epsilon=0.938  avg_loss=7.2  time=0m12s  total=0m12s
   [ep_0200]  ep=0101-0200  avg_reward=-18.6  avg_steps=89   epsilon=0.879  avg_loss=6.8  time=0m14s  total=0m26s
   [ep_0300]  ep=0201-0300  avg_reward= -4.1  avg_steps=134  epsilon=0.824  avg_loss=6.1  time=0m15s  total=0m41s
   [ep_0500]  ep=0401-0500  avg_reward= +8.7  avg_steps=210  epsilon=0.724  avg_loss=5.4  time=0m16s  total=1m12s
   [ep_1000]  ep=0901-1000  avg_reward=+22.3  avg_steps=389  epsilon=0.557  avg_loss=4.9  time=0m18s  total=2m30s

Here is what each field means:

- **avg_reward**: Average total reward per episode over the past 100 episodes.
  Positive values mean the agent is accumulating reward; negative values mean
  it is accumulating penalties. An upward trend over time is the main signal
  that learning is working.
- **avg_steps**: Average number of turns per episode. If episodes are ending
  quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
  generally indicate the agent is surviving longer.
- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
  and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
  behavior is expected.
- **avg_loss**: Average Huber loss across training steps. Huber loss is
  quadratic for small prediction errors and linear for large ones, which keeps
  it stable even when rewards have a wide range (such as a large bonus for
  reaching a goal). Values in the range 0–10 are typical for most games.
  A slow downward trend is the healthy pattern. A loss that grows without bound
  indicates the learning rate is too high.
- **time**: Wall-clock time for this 100-episode interval.
- **total**: Cumulative training time across all sessions.

When training is resumed after a stop, a header line marks the break::

   === Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===

This lets you track exactly when each session took place.

Stopping training to watch the agent play
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You do not need to wait for training to finish before watching the
agent. Training can be stopped at any time with Ctrl-C, and the latest
checkpoint is always available immediately:

.. code-block:: console

   % retro-gamer play runs/snake/

This loads the most recent checkpoint and runs the agent in your
terminal. Press Enter or Escape to quit.

.. note::

   The game is rendered directly in your terminal. If the window is
   smaller than the board plus borders, ``retro-gamer play`` will raise
   a ``TerminalTooSmall`` error — enlarge the terminal window and try
   again.

To watch an earlier stage of training, use ``--checkpoint``:

.. code-block:: console

   % retro-gamer play runs/snake/ --checkpoint ep_0100

Comparing what the agent at episode 100 does versus the agent at episode
500 can reveal exactly what the agent has (and has not) learned. For
Snake, you might notice the episode-100 agent moving somewhat randomly,
while the episode-500 agent consistently navigates toward the apple.
Articulating *why* the later agent behaves differently—what the training
process produced—connects observation directly to the concepts underlying
DQN.

Resuming training after watching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After watching the agent play, resume training with exactly the same
command you used before:

.. code-block:: console

   % retro-gamer train runs/snake/

``retro-gamer`` automatically detects and resumes from the latest
checkpoint. No extra flags are needed. If all configured episodes have
already been completed, it prints a message and exits:

.. code-block:: console

   Training already complete (1000 episodes). To keep training,
   increase training_episodes in config.toml.

To continue training, open ``runs/snake/config.toml``, increase the
``training_episodes`` value, and run ``retro-gamer train`` again.

Watching a trained agent play
------------------------------

Once training is complete, watch the final agent:

.. code-block:: console

   % retro-gamer play runs/snake/

By default the latest checkpoint is loaded. You can also compare the
agent's performance at different stages of training:

.. code-block:: console

   % retro-gamer play runs/snake/ --checkpoint ep_0100
   % retro-gamer play runs/snake/ --checkpoint ep_0500

Press Enter or Escape to quit.

Inspecting a run
----------------

To review the configuration and recent training progress for a run:

.. code-block:: console

   % retro-gamer info runs/snake/
   Game module    : retro.examples.snake
   Metadata       : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
   Preprocessing  : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
   Model          : {'hidden_sizes': [128, 64]}
   Training       : {'learning_rate': 0.0001, 'gamma': 0.99, ...}

   Last 5 checkpoints:
     [ep_0600]  ep=0501-0600  avg_reward=+12.1 ...
     [ep_0700]  ep=0601-0700  avg_reward=+14.8 ...
     [ep_0800]  ep=0701-0800  avg_reward=+16.3 ...
     [ep_0900]  ep=0801-0900  avg_reward=+19.0 ...
     [ep_1000]  ep=0901-1000  avg_reward=+22.3 ...

   Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']

Adjusting hyperparameters
--------------------------

The training hyperparameters can be changed by editing ``config.toml``
before training, or by passing them as options to ``retro-gamer
create``. Common adjustments and their effects:

**``training_episodes``** — How long to train. More episodes give the
agent more time to learn, but also take longer to run. This is always
safe to increase.

**``epsilon_decay``** — How quickly exploration decreases. A faster
decay (smaller ``epsilon_decay``) means the agent commits to its early
Q-estimates before they are fully reliable. A slower decay (larger
``epsilon_decay``, closer to 1) gives the agent more time to explore
but may waste training time on random actions.

**``learning_rate``** — How large the weight updates are at each
training step. A large learning rate learns fast but may overshoot;
a small learning rate is stable but slow.

**``gamma``** — The discount factor for future rewards. Closer to 1
means the agent values long-term consequences; closer to 0 makes the
agent focus on immediate reward.

**``hidden_sizes``** — The shape of the MLP head as a list of layer
sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
more complex Q-functions but are slower to train and may overfit.

**``prioritize_experiences``** — Whether to use prioritized experience
replay. This often improves sample efficiency but is slightly slower
per step.

.. _incompatible-changes:

Why some changes require starting fresh
----------------------------------------

Not all changes to ``config.toml`` are equal. Some can be applied
immediately to an existing training run; others make the existing
checkpoints unusable.

**Safe to change at any time** (``[training]`` section) — These affect
*how* the agent learns, not *what* it is learning to do. Existing
checkpoints remain valid:

- ``training_episodes``, ``max_turns_per_episode``
- ``learning_rate``, ``learning_rate_decay``, ``gamma``
- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
- ``target_update_freq``, ``train_every``

**Requires starting fresh** — These changes alter the shape of the
game or the shape of the network. The saved model weights are
incompatible with the new configuration:

- ``actions``, ``reward``, ``character_set``, ``board_size``
  (``[metadata]``) — These define what the agent perceives and what it
  can do. Changing them changes the size of the network's input or
  output layers; the existing weights no longer fit.
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
  ``egocentric``, ``egocentric_player``, ``egocentric_radius``
  (``[preprocessing]``) — These control how the observation is
  constructed. Any change here alters the input shape or meaning and
  makes existing weights invalid.
- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
  layers. Changing it changes the shape of the network; the existing
  weights no longer fit.

If you try to resume training after making one of these changes,
``retro-gamer train`` detects the mismatch and stops with a clear
explanation, for example::

   Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.

   The following changes require starting fresh. The existing model was
   trained on a different problem and its weights cannot be reused:

     character_set
       was : ['@', '*', '>', '<', '^', 'v']
       now : ['@', '*', '>', '<', '^', 'v', '#']
       why : the set of board characters (changes input layer size)

   Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
   training log, then run 'retro-gamer train RUN_DIR' to start fresh.

To clear out the old checkpoints and begin again:

.. code-block:: console

   % retro-gamer clean runs/snake/
   Will remove 5 checkpoint(s) and training log from runs/snake/:
     checkpoints/ep_0100.pt
     checkpoints/ep_0200.pt
     ...
     training.log

   Proceed? [y/N]: y
   Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.

The ``config.toml`` is always preserved so you do not need to run
``retro-gamer create`` again.

Reasoning about training from the log
--------------------------------------

The training log is one of the most useful tools for understanding what
is happening during training. Here are some patterns to look for and
what they mean.

**Reward increasing steadily** is the normal, healthy pattern. Each
checkpoint block should show a higher ``avg_reward`` than the last.
The rate of increase typically slows as training progresses.

**Reward flat or negative through early episodes** is normal. Early in
training, ``epsilon`` is high and the agent is mostly acting randomly.
It has not yet discovered effective strategies. Patience—and a look at
the ``epsilon`` column—will confirm whether this is just the exploration
phase.

**Loss decreasing** is also healthy. As the Q-network's estimates
improve, the difference between predicted and target Q-values (the TD
error) should shrink. A loss that stabilizes near zero is usually a
good sign.

**Loss growing without bound** indicates the learning rate is too high.
The trainer uses Huber loss, which is robust to large reward scales, but
a learning rate above roughly ``0.001`` can still destabilise training.
Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
and restarting training.

**Short episodes (low ``avg_steps``)** combined with low reward
suggests the agent is dying frequently. Early in training this is
normal. If it persists late in training, the agent may have settled on
a bad policy—consider extending training or adjusting
``epsilon_decay`` to explore longer.

**Reward that improves and then regresses** can indicate that the
agent has discovered a suboptimal but consistent strategy and is stuck.
Increasing ``epsilon_min`` to keep some exploration active, or
adjusting the reward signal to better differentiate good moves from
bad ones, can help.

Questions for investigation
----------------------------

The following questions are intended to guide productive investigation
using ``retro-gamer``. They are chosen because they have specific,
reasoned answers that connect what you know about the game to the
concepts underlying the training algorithm.

1. **Character set completeness.** Train two agents: one with the full
   character set, one missing a character that frequently appears on the
   board. Compare their performance. What did the second agent lose the
   ability to perceive, and how did that affect its behavior?

2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
   true`` and ``spatial = false``. How does training efficiency differ?
   Can you explain the difference in terms of what each architecture
   can and cannot learn?

3. **Reward shaping.** If the game currently rewards only the final
   objective (e.g., reaching a goal), add intermediate rewards for
   sub-goals. How does this change the early training curve? Does it
   change the agent's final strategy?

4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
   (so the agent commits to exploiting early) and a very slow one (so
   exploration continues for a long time). How do the training curves
   differ? What is the agent doing in each case when ``epsilon`` is low?

5. **Checkpoint comparison.** Load the agent at episode 100 and at
   episode 1000 and watch each play the same game. What has the later
   agent learned that the earlier one has not? How would you describe
   this difference to someone who does not know about neural networks?