Updates across the board

2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions
--- a/docs/walkthrough.rst
+++ b/docs/walkthrough.rst
@@ -21,9 +21,9 @@ You will need:
 Preparing your game
 -------------------

-``retro-gamer`` loads your game by importing a Python module and
-calling a function named ``create_game``. The ``create_game`` function
-must take no arguments and return a new ``Game`` instance.
+``retro-gamer`` loads your game by calling a function named
+``create_game``. The function must take no arguments and return a new
+``Game`` instance.

 Here is the ``create_game`` function for Snake:

@@ -32,12 +32,20 @@ Here is the ``create_game`` function for Snake:
   def create_game():
       head = SnakeHead()
       apple = Apple()
-       game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
+       game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
       apple.relocate(game)
       return game

-If your game module does not already have a ``create_game`` function,
-add one following this pattern.
+If your game file does not already have a ``create_game`` function, add
+one following this pattern.
+
+When you run ``retro-gamer create``, you can point to your game file
+directly by path or by Python module name:
+
+.. code-block:: console
+
+   % retro-gamer create --game my_game.py --output runs/my_game/
+   % retro-gamer create --game retro.examples.snake --output runs/snake/


 Describing your game
@@ -57,8 +65,6 @@ Here is the ``[tool.retro-gamer]`` section for the Snake example:
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]
-   spatial = true
-   observe_state = []

 Let's go through each field.

@@ -80,9 +86,10 @@ implicitly has access to a no-op (doing nothing).

 The key in the game's state dictionary to use as the reward signal.
 ``retro-gamer`` computes the reward for each turn as the *change* in
-this value from one turn to the next. For Snake, score increases by 1
-(or more) each time the apple is eaten, so the agent receives a reward
-of 1 when it eats an apple and 0 otherwise.
+this value from one turn to the next. For Snake, the score changes when
+the snake eats an apple (+50), when it moves away from the apple (−1),
+and when it dies (−10). These incremental changes are what the agent
+tries to maximize.

 Choosing an appropriate reward is one of the most consequential
 decisions in RL. Some considerations:
@@ -115,15 +122,48 @@ phase before training to discover which characters actually appear.
 The number of exploration turns is controlled by the
 ``exploration_turns`` hyperparameter.

-``spatial``
-~~~~~~~~~~~
+``spatial`` and other preprocessing options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Whether to treat the board as a spatial scene (default: ``true``). A
-spatial game uses a *convolutional neural network* (CNN) that can
-detect patterns in the relative arrangement of characters. A
-non-spatial game uses a simpler *multilayer perceptron* (MLP) that
-ignores positional relationships. Set to ``false`` for games where
-position is irrelevant.
+The ``[tool.retro-gamer]`` section describes the game. Preprocessing
+options—such as ``spatial`` (whether to use a CNN or MLP, default:
+``false``), ``egocentric``, and ``observe_state``—live in the
+``[preprocessing]`` section of the generated ``config.toml``. You can
+edit them there after running ``retro-gamer create``.
+
+``observe_state``
+~~~~~~~~~~~~~~~~~
+
+By default the agent only sees the board. You can also give it access
+to computed values from ``game.state`` by listing the relevant keys in
+the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
+For example, Snake exposes the normalized direction to the apple:
+
+.. code-block:: toml
+
+   [preprocessing]
+   observe_state = ["apple_dx", "apple_dy"]
+
+The trainer appends these values to the observation vector after the
+board encoding (or uses them as the entire observation when
+``board = false``).
+
+These values must be set in ``game.state`` at the start of every
+episode—typically inside ``create_game()``—and must keep the same
+type and length from episode to episode.
+
+.. warning::
+
+   Always initialize every key listed in ``observe_state`` before the
+   game starts. If a key is missing or its length changes between
+   episodes, training stops immediately with a clear error explaining
+   what changed. The neural network's input size is fixed when training
+   begins; it cannot adapt to a changing observation shape mid-run.
+
+This is a good place to ask: *can a human player see this information?*
+The apple's location is visible on screen; the normalized distance vector
+is not. Whether that asymmetry is appropriate is a design choice worth
+examining.

 Once you have written this section, create the training run directory:

@@ -139,7 +179,7 @@ Once you have written this section, create the training run directory:
     actions     : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
     reward      : score
     characters  : ['@', '*', '>', '<', '^', 'v']
-     architecture: CNN (spatial)
+     architecture: MLP

 ``retro-gamer create`` reads your game metadata directly from
 ``pyproject.toml`` and writes it—along with all hyperparameters—to
@@ -153,64 +193,141 @@ With the ``config.toml`` in place, start training:
 .. code-block:: console

   % retro-gamer train runs/snake/
-   Training for 1000 episodes…
-   Done. Checkpoints in runs/snake/checkpoints/
+   100%|████████████████████| 1000/1000 [12:34<00:00,  1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
+   Done. Checkpoints saved in runs/snake/checkpoints/

-Training saves checkpoints every 100 episodes and a ``final.pt``
-checkpoint when complete. You can follow progress in the training log:
+A progress bar shows how far training has gone, along with the most
+recent episode's reward, the current exploration rate (``eps``), and
+the average prediction error (``loss``).
+
+Training saves a checkpoint every 100 episodes to
+``runs/snake/checkpoints/``. You can stop training at any time with
+Ctrl-C and resume it later—the next ``retro-gamer train`` command will
+automatically pick up from the latest checkpoint.
+
+Reading the training log
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+For a longer view of how training is progressing, inspect the training
+log:

 .. code-block:: console

-   % tail -f runs/snake/training.log
+   % cat runs/snake/training.log

-The log shows one line per episode:
+The log begins with the full network architecture, followed by one line
+per checkpoint (every 100 episodes):

 .. code-block:: text

-   [EP 0001] total_reward=0.0  steps=2000  epsilon=0.9950  avg_loss=0.023540
-   [EP 0050] total_reward=1.0  steps=1921  epsilon=0.7783  avg_loss=0.003217
-   [EP 0100] total_reward=3.0  steps=1847  epsilon=0.6065  avg_loss=0.001204
+   [ep_0100]  ep=0001-0100  avg_reward=-31.4  avg_steps=47   epsilon=0.938  avg_loss=7.2  time=0m12s  total=0m12s
+   [ep_0200]  ep=0101-0200  avg_reward=-18.6  avg_steps=89   epsilon=0.879  avg_loss=6.8  time=0m14s  total=0m26s
+   [ep_0300]  ep=0201-0300  avg_reward= -4.1  avg_steps=134  epsilon=0.824  avg_loss=6.1  time=0m15s  total=0m41s
+   [ep_0500]  ep=0401-0500  avg_reward= +8.7  avg_steps=210  epsilon=0.724  avg_loss=5.4  time=0m16s  total=1m12s
+   [ep_1000]  ep=0901-1000  avg_reward=+22.3  avg_steps=389  epsilon=0.557  avg_loss=4.9  time=0m18s  total=2m30s

- **total_reward**: the total score earned during the episode (how many
-  apples the snake ate, for Snake).
- **steps**: how many turns the episode lasted.
- **epsilon**: the current exploration rate. Early in training this is
-  close to 1 (mostly random actions); it decays toward ``epsilon_min``.
- **avg_loss**: the average temporal-difference error across training
-  steps in this episode. A decreasing loss generally indicates that the
-  Q-value estimates are converging.
+Here is what each field means:

-Resuming training
-~~~~~~~~~~~~~~~~~
+- **avg_reward**: Average total reward per episode over the past 100 episodes.
+  Positive values mean the agent is accumulating reward; negative values mean
+  it is accumulating penalties. An upward trend over time is the main signal
+  that learning is working.
+- **avg_steps**: Average number of turns per episode. If episodes are ending
+  quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
+  generally indicate the agent is surviving longer.
+- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
+  and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
+  behavior is expected.
+- **avg_loss**: Average Huber loss across training steps. Huber loss is
+  quadratic for small prediction errors and linear for large ones, which keeps
+  it stable even when rewards have a wide range (such as a large bonus for
+  reaching a goal). Values in the range 0–10 are typical for most games.
+  A slow downward trend is the healthy pattern. A loss that grows without bound
+  indicates the learning rate is too high.
+- **time**: Wall-clock time for this 100-episode interval.
+- **total**: Cumulative training time across all sessions.

-Training can be resumed from a checkpoint:
+When training is resumed after a stop, a header line marks the break::
+
+   === Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===
+
+This lets you track exactly when each session took place.
+
+Stopping training to watch the agent play
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You do not need to wait for training to finish before watching the
+agent. Training can be stopped at any time with Ctrl-C, and the latest
+checkpoint is always available immediately:

 .. code-block:: console

-   % retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
+   % retro-gamer play runs/snake/

-Watching a trained agent play
------------------------------
+This loads the most recent checkpoint and runs the agent in your
+terminal. Press Enter or Escape to quit.

-To watch a trained agent play the game in your terminal:
+.. note::

-.. code-block:: console
+   The game is rendered directly in your terminal. If the window is
+   smaller than the board plus borders, ``retro-gamer play`` will raise
+   a ``TerminalTooSmall`` error — enlarge the terminal window and try
+   again.

-   % retro-gamer play runs/snake/ --checkpoint final
-
-You can substitute any checkpoint name:
+To watch an earlier stage of training, use ``--checkpoint``:

 .. code-block:: console

   % retro-gamer play runs/snake/ --checkpoint ep_0100

-Press Enter or Escape to quit.
+Comparing what the agent at episode 100 does versus the agent at episode
+500 can reveal exactly what the agent has (and has not) learned. For
+Snake, you might notice the episode-100 agent moving somewhat randomly,
+while the episode-500 agent consistently navigates toward the apple.
+Articulating *why* the later agent behaves differently—what the training
+process produced—connects observation directly to the concepts underlying
+DQN.

-Comparing agents trained at different checkpoints is a useful activity:
-the agent at episode 100 has learned *something*, but typically much
-less than the agent at episode 500. Articulating *what* the earlier
-agent has and has not learned, and *why*, is productive reasoning about
-the training process.
+Resuming training after watching
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+After watching the agent play, resume training with exactly the same
+command you used before:
+
+.. code-block:: console
+
+   % retro-gamer train runs/snake/
+
+``retro-gamer`` automatically detects and resumes from the latest
+checkpoint. No extra flags are needed. If all configured episodes have
+already been completed, it prints a message and exits:
+
+.. code-block:: console
+
+   Training already complete (1000 episodes). To keep training,
+   increase training_episodes in config.toml.
+
+To continue training, open ``runs/snake/config.toml``, increase the
+``training_episodes`` value, and run ``retro-gamer train`` again.
+
+Watching a trained agent play
+------------------------------
+
+Once training is complete, watch the final agent:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/
+
+By default the latest checkpoint is loaded. You can also compare the
+agent's performance at different stages of training:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/ --checkpoint ep_0100
+   % retro-gamer play runs/snake/ --checkpoint ep_0500
+
+Press Enter or Escape to quit.

 Inspecting a run
 ----------------
@@ -220,18 +337,20 @@ To review the configuration and recent training progress for a run:
 .. code-block:: console

   % retro-gamer info runs/snake/
-   Game module : retro.examples.snake
-   Metadata    : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
-   Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
+   Game module    : retro.examples.snake
+   Metadata       : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
+   Preprocessing  : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
+   Model          : {'hidden_sizes': [128, 64]}
+   Training       : {'learning_rate': 0.0001, 'gamma': 0.99, ...}

-   Last 5 episodes:
-     [EP 0996] total_reward=9.0   steps=1203  epsilon=0.0074  avg_loss=0.000312
-     [EP 0997] total_reward=11.0  steps=1051  epsilon=0.0074  avg_loss=0.000289
-     [EP 0998] total_reward=14.0  steps=987   epsilon=0.0074  avg_loss=0.000274
-     [EP 0999] total_reward=8.0   steps=1142  epsilon=0.0074  avg_loss=0.000261
-     [EP 1000] total_reward=12.0  steps=1089  epsilon=0.0074  avg_loss=0.000248
+   Last 5 checkpoints:
+     [ep_0600]  ep=0501-0600  avg_reward=+12.1 ...
+     [ep_0700]  ep=0601-0700  avg_reward=+14.8 ...
+     [ep_0800]  ep=0701-0800  avg_reward=+16.3 ...
+     [ep_0900]  ep=0801-0900  avg_reward=+19.0 ...
+     [ep_1000]  ep=0901-1000  avg_reward=+22.3 ...

-   Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
+   Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']

 Adjusting hyperparameters
 --------------------------
@@ -241,7 +360,8 @@ before training, or by passing them as options to ``retro-gamer
 create``. Common adjustments and their effects:

 **``training_episodes``** — How long to train. More episodes give the
-agent more time to learn, but also take longer to run.
+agent more time to learn, but also take longer to run. This is always
+safe to increase.

 **``epsilon_decay``** — How quickly exploration decreases. A faster
 decay (smaller ``epsilon_decay``) means the agent commits to its early
@@ -257,14 +377,124 @@ a small learning rate is stable but slow.
 means the agent values long-term consequences; closer to 0 makes the
 agent focus on immediate reward.

-**``n_layers`` and ``layer_size``** — The depth and width of the MLP
-head. Larger networks can represent more complex Q-functions but are
-slower to train and may overfit.
+**``hidden_sizes``** — The shape of the MLP head as a list of layer
+sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
+more complex Q-functions but are slower to train and may overfit.

 **``prioritize_experiences``** — Whether to use prioritized experience
 replay. This often improves sample efficiency but is slightly slower
 per step.

+.. _incompatible-changes:
+
+Why some changes require starting fresh
+----------------------------------------
+
+Not all changes to ``config.toml`` are equal. Some can be applied
+immediately to an existing training run; others make the existing
+checkpoints unusable.
+
+**Safe to change at any time** (``[training]`` section) — These affect
+*how* the agent learns, not *what* it is learning to do. Existing
+checkpoints remain valid:
+
+- ``training_episodes``, ``max_turns_per_episode``
+- ``learning_rate``, ``learning_rate_decay``, ``gamma``
+- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
+- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
+- ``target_update_freq``, ``train_every``
+
+**Requires starting fresh** — These changes alter the shape of the
+game or the shape of the network. The saved model weights are
+incompatible with the new configuration:
+
+- ``actions``, ``reward``, ``character_set``, ``board_size``
+  (``[metadata]``) — These define what the agent perceives and what it
+  can do. Changing them changes the size of the network's input or
+  output layers; the existing weights no longer fit.
+- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
+  ``egocentric``, ``egocentric_player``, ``egocentric_radius``
+  (``[preprocessing]``) — These control how the observation is
+  constructed. Any change here alters the input shape or meaning and
+  makes existing weights invalid.
+- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
+  layers. Changing it changes the shape of the network; the existing
+  weights no longer fit.
+
+If you try to resume training after making one of these changes,
+``retro-gamer train`` detects the mismatch and stops with a clear
+explanation, for example::
+
+   Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.
+
+   The following changes require starting fresh. The existing model was
+   trained on a different problem and its weights cannot be reused:
+
+     character_set
+       was : ['@', '*', '>', '<', '^', 'v']
+       now : ['@', '*', '>', '<', '^', 'v', '#']
+       why : the set of board characters (changes input layer size)
+
+   Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
+   training log, then run 'retro-gamer train RUN_DIR' to start fresh.
+
+To clear out the old checkpoints and begin again:
+
+.. code-block:: console
+
+   % retro-gamer clean runs/snake/
+   Will remove 5 checkpoint(s) and training log from runs/snake/:
+     checkpoints/ep_0100.pt
+     checkpoints/ep_0200.pt
+     ...
+     training.log
+
+   Proceed? [y/N]: y
+   Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.
+
+The ``config.toml`` is always preserved so you do not need to run
+``retro-gamer create`` again.
+
+Reasoning about training from the log
+--------------------------------------
+
+The training log is one of the most useful tools for understanding what
+is happening during training. Here are some patterns to look for and
+what they mean.
+
+**Reward increasing steadily** is the normal, healthy pattern. Each
+checkpoint block should show a higher ``avg_reward`` than the last.
+The rate of increase typically slows as training progresses.
+
+**Reward flat or negative through early episodes** is normal. Early in
+training, ``epsilon`` is high and the agent is mostly acting randomly.
+It has not yet discovered effective strategies. Patience—and a look at
+the ``epsilon`` column—will confirm whether this is just the exploration
+phase.
+
+**Loss decreasing** is also healthy. As the Q-network's estimates
+improve, the difference between predicted and target Q-values (the TD
+error) should shrink. A loss that stabilizes near zero is usually a
+good sign.
+
+**Loss growing without bound** indicates the learning rate is too high.
+The trainer uses Huber loss, which is robust to large reward scales, but
+a learning rate above roughly ``0.001`` can still destabilise training.
+Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
+and restarting training.
+
+**Short episodes (low ``avg_steps``)** combined with low reward
+suggests the agent is dying frequently. Early in training this is
+normal. If it persists late in training, the agent may have settled on
+a bad policy—consider extending training or adjusting
+``epsilon_decay`` to explore longer.
+
+**Reward that improves and then regresses** can indicate that the
+agent has discovered a suboptimal but consistent strategy and is stuck.
+Increasing ``epsilon_min`` to keep some exploration active, or
+adjusting the reward signal to better differentiate good moves from
+bad ones, can help.
+
 Questions for investigation
 ----------------------------

@@ -297,3 +527,4 @@ concepts underlying the training algorithm.
   episode 1000 and watch each play the same game. What has the later
   agent learned that the earlier one has not? How would you describe
   this difference to someone who does not know about neural networks?
+