Updates across the board

2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions
--- a/docs/reference.rst
+++ b/docs/reference.rst
@@ -17,8 +17,6 @@ A complete example for the Snake game:
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]
-   spatial = true
-   observe_state = []

 You do not need to specify the board size: ``retro-gamer`` reads it
 directly from your game's ``board_size`` attribute.
@@ -65,54 +63,156 @@ If omitted, ``retro-gamer`` runs an exploration phase to discover the
 characters that appear in practice. The length of this phase is
 controlled by the ``exploration_turns`` hyperparameter.

-``spatial``
-~~~~~~~~~~~
+Preprocessing options
+---------------------

-**Optional; default ``true``.** Whether to treat the board as a 2D
-spatial scene. When ``true``, the trainer uses a convolutional neural
-network (CNN) that can detect patterns in the relative positions of
-characters. When ``false``, the trainer uses a multilayer perceptron
-(MLP) that sees the board as a flat list of numbers without positional
-structure.
+Preprocessing options live in the ``[preprocessing]`` section of a run's
+``config.toml``. They control how the game's board and state are
+transformed into the observation vector that the neural network sees.
+``retro-gamer create`` writes sensible defaults; you can edit them by
+hand before running ``retro-gamer train``.
+
+.. note::
+
+   Changes to any ``[preprocessing]`` option—or to the game description
+   fields above—make existing checkpoints incompatible. Run
+   ``retro-gamer clean`` before retraining after such changes.
+
+``spatial`` (default: ``false``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Whether to treat the board as a 2D spatial scene. When ``true``, the
+trainer uses a convolutional neural network (CNN); when ``false``, a
+multilayer perceptron (MLP) that sees the board as a flat list of
+numbers.
+
+``board`` (default: ``true``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Whether to include the board encoding in the observation vector. Set
+to ``false`` to train on game state variables only, with no board at
+all. This is useful for games with small, enumerable state spaces where
+a lookup table (classic Q-learning) is sufficient.
+
+When ``board = false``:
+
+- ``spatial`` must also be ``false`` (no board means no 2D scene for a CNN).
+- At least one key must be listed in ``observe_state``.
+- ``character_set`` is not required and character discovery is skipped.

 .. code-block:: toml

-   spatial = true
+   [preprocessing]
+   board = false
+   observe_state = ["board_state"]

-``observe_state``
-~~~~~~~~~~~~~~~~~
+``observe_state`` (default: ``[]``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-**Optional; default ``[]``.** A list of keys from the game's state
-dictionary to append to the observation vector. The values must be
-numbers (integers, floats, or booleans). The reward key must not
-appear in this list.
+A list of keys from ``game.state`` to include in the observation
+vector, appended after the board encoding (or as the entire
+observation when ``board = false``). Scalar values contribute one
+element each; list or tuple values are flattened.

 .. code-block:: toml

-   observe_state = ["lives", "level"]
+   observe_state = ["apple_dx", "apple_dy"]
+
+The keys must be present in ``game.state`` at every step, initialized
+in ``create_game()`` before the game starts. All values that are lists
+or tuples must always have the same length from episode to episode.
+
+.. warning::
+
+   ``observe_state`` keys must be initialized to their final shape in
+   ``create_game()`` before the game starts. If a key is absent or its
+   list length changes between episodes, training will crash with an
+   error explaining which key changed and by how much. This happens
+   because the neural network's input layer has a fixed size determined
+   at the start of training; it cannot adapt to a changing observation
+   shape mid-run.
+
+   Always initialize every observed key with a placeholder of the
+   correct type and length before the first ``game.step()`` call.
+
+``observe_state_sizes`` (auto-discovered)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A table mapping each ``observe_state`` key to its flat size (``1`` for
+scalars, ``N`` for sequences of length N). This is written automatically
+to ``config.toml`` the first time ``retro-gamer train`` runs, after the
+trainer samples ``game.state`` to discover the actual sizes:
+
+.. code-block:: toml
+
+   observe_state_sizes = {board_state = 9}
+
+You do not need to set this manually. Once written, it is used to
+detect changes in state shape when resuming training—an incompatible
+change here requires running ``retro-gamer clean`` and starting fresh.
+
+``egocentric`` (default: ``false``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When ``true``, the board observation is cropped to a square window
+centred on a specific agent rather than the full board. This gives the
+agent a local, first-person-like view and makes the observation
+invariant to the agent's absolute position on the board.
+
+Requires ``egocentric_player`` and ``egocentric_radius``.
+
+``egocentric_player``
+~~~~~~~~~~~~~~~~~~~~~~
+
+The name of the agent to use as the centre of the egocentric crop.
+Must match the ``name`` attribute of one of the game's agents.
+
+.. code-block:: toml
+
+   egocentric_player = "Snake head"
+
+``egocentric_radius``
+~~~~~~~~~~~~~~~~~~~~~~
+
+The half-side-length of the egocentric crop window, in cells. The
+resulting observation covers a ``(2r+1) × (2r+1)`` region. Larger
+values give the agent a wider view; smaller values focus it on the
+immediate vicinity.
+
+.. code-block:: toml
+
+   egocentric_radius = 8   # 17×17 window
+
+When ``egocentric_radius`` is set, ``board_size`` in ``[metadata]`` is
+automatically updated to ``[2r+1, 2r+1]`` so the network is sized
+correctly.

 .. _hyperparameters:

 Hyperparameters
 ---------------

-Hyperparameters are stored in the ``[hyperparameters]`` section of
-``config.toml``. They can be set via ``retro-gamer create`` options or
-edited directly.
+Hyperparameters are split across two sections of ``config.toml``:
+
+- ``[model]`` — network architecture (changing these requires starting fresh)
+- ``[training]`` — learning algorithm parameters (safe to change at any time)
+
+Both sections can be set via ``retro-gamer create`` options or edited directly.

 Learning and optimization
 ~~~~~~~~~~~~~~~~~~~~~~~~~

-``learning_rate`` (default: ``0.001``)
+``learning_rate`` (default: ``0.0001``)
    The step size used by the Adam optimizer when updating network
    weights. Larger values converge faster but may be unstable; smaller
    values are more stable but slower.

-``lr_decay`` (default: ``0.995``)
+``learning_rate_decay`` (default: ``0.9999``)
    Multiplicative decay applied to the learning rate after each
    episode. The learning rate decreases geometrically over training,
    helping the network fine-tune later without destabilizing early
-    progress.
+    progress. With the default value, the learning rate decays to about
+    13 % of its starting value after 20 000 episodes.

 ``gamma`` (default: ``0.99``)
    The discount factor for future rewards. A value of 1.0 makes the
@@ -127,7 +227,7 @@ Exploration
    random action with probability ``epsilon`` and exploits its current
    Q-function with probability ``1 - epsilon``.

-``epsilon_decay`` (default: ``0.995``)
+``epsilon_decay`` (default: ``0.9997``)
    Multiplicative decay applied to ``epsilon`` after each episode.

 ``epsilon_min`` (default: ``0.05``)
@@ -142,31 +242,33 @@ Memory and sampling
    The number of experiences sampled from the replay buffer per
    training step.

-``memory_capacity`` (default: ``10000``)
+``memory_capacity`` (default: ``50000``)
    The maximum number of experiences the replay buffer can hold. When
    full, the oldest experiences are discarded.

-``prioritize_experiences`` (default: ``false``)
+``prioritize_experiences`` (default: ``true``)
    Whether to use prioritized experience replay. When ``true``,
    experiences with larger TD errors are sampled more frequently.
    This often improves sample efficiency at a modest computational
    cost.

-Network architecture
-~~~~~~~~~~~~~~~~~~~~
+Model architecture
+~~~~~~~~~~~~~~~~~~

-``n_layers`` (default: ``2``)
-    The number of hidden layers in the MLP head (for spatial games,
-    this follows the CNN; for non-spatial games, it is the full
-    network).
+These live in the ``[model]`` section. Changing them requires starting fresh
+(run ``retro-gamer clean`` before retraining).

-``layer_size`` (default: ``128``)
-    The width (number of units) in each hidden layer.
+``hidden_sizes`` (default: ``[128, 64]``)
+    A list of integers giving the size of each hidden layer in the MLP
+    head. The default creates two layers: 128 units then 64. For spatial
+    games this follows the CNN; for non-spatial games it is the full
+    network. Larger or deeper networks can represent more complex
+    Q-functions but train more slowly and may need more episodes.

 Training duration
 ~~~~~~~~~~~~~~~~~

-``training_episodes`` (default: ``1000``)
+``training_episodes`` (default: ``20000``)
    The total number of game episodes to run. Each episode runs until
    the game ends or ``max_turns_per_episode`` turns have elapsed.

@@ -175,12 +277,18 @@ Training duration
    indefinitely (for example, if the agent finds a way to avoid
    dying).

-``target_update_freq`` (default: ``100``)
+``target_update_freq`` (default: ``500``)
    How many training steps between updates of the target network.
    More frequent updates make training targets move faster (less
    stable); less frequent updates make them more stable but slower
    to reflect new learning.

+``train_every`` (default: ``4``)
+    Run one training step every N game steps. Higher values speed up
+    episode collection at the cost of fewer gradient updates per
+    experience. The default of 4 is a good balance for most games;
+    set to 1 to train on every step.
+
 Character discovery
 ~~~~~~~~~~~~~~~~~~~

@@ -207,23 +315,26 @@ game's ``pyproject.toml``; you do not pass it on the command line.

 .. code-block:: console

-   % retro-gamer create --game MODULE --output DIR [OPTIONS]
+   % retro-gamer create --game GAME --output DIR [OPTIONS]

 **Required options:**

- ``--game MODULE`` — Python module containing ``create_game()``
-  (e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
-  is read from the ``pyproject.toml`` found in or above the module's
-  source directory.
+- ``--game GAME`` — Your game, specified as a file path or a Python
+  module name:
+
+  - File path: ``--game my_game.py`` or ``--game my_game/``
+  - Module name: ``--game retro.examples.snake``
+
+  The ``[tool.retro-gamer]`` section is read from the ``pyproject.toml``
+  found in or above the game file.
 - ``--output DIR`` — Directory to create for this training run.

 **Hyperparameter options** (all optional; see :ref:`hyperparameters`):

 - ``--training-episodes N``
- ``--n-layers N``
- ``--layer-size N``
+- ``--hidden-sizes SIZES`` — comma-separated, e.g. ``512,256``
 - ``--learning-rate F``
- ``--lr-decay F``
+- ``--learning-rate-decay F``
 - ``--gamma F``
 - ``--epsilon-decay F``
 - ``--epsilon-min F``
@@ -232,20 +343,40 @@ game's ``pyproject.toml``; you do not pass it on the command line.
 - ``--target-update-freq N``
 - ``--max-turns-per-episode N``
 - ``--exploration-turns N``
+- ``--train-every N``
 - ``--prioritize-experiences`` / ``--no-prioritize-experiences``

 ``retro-gamer train``
 ~~~~~~~~~~~~~~~~~~~~~

-Train (or resume training) a DQN agent.
+Train a DQN agent.

 .. code-block:: console

-   % retro-gamer train RUN_DIR [--resume CHECKPOINT]
+   % retro-gamer train RUN_DIR

 ``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
-create``. If ``--resume`` is given, training resumes from the specified
-checkpoint file (relative or absolute path).
+create``. If checkpoints already exist in ``RUN_DIR``, training
+automatically resumes from the latest one so prior work is never lost.
+
+If all configured episodes have already been completed, the command
+prints a message and exits immediately. To keep training, increase
+``training_episodes`` in ``config.toml`` and run again.
+
+**Incompatible changes.** Some config changes make existing checkpoints
+unusable. If you change any of the following, ``retro-gamer train`` will
+detect the mismatch and refuse to resume, with a clear explanation:
+
+- ``actions``, ``reward``, ``character_set``, ``board_size``
+  (``[metadata]``) — game description
+- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
+  ``egocentric``, ``egocentric_player``, ``egocentric_radius``
+  (``[preprocessing]``) — observation encoding
+- ``hidden_sizes`` (``[model]``) — network architecture
+
+Run ``retro-gamer clean RUN_DIR`` to remove the old checkpoints and start
+fresh. Other hyperparameter changes (learning rate, epsilon, etc.) are
+safe and take effect immediately on the next training run.

 ``retro-gamer play``
 ~~~~~~~~~~~~~~~~~~~~
@@ -256,16 +387,32 @@ Watch a trained agent play the game in the terminal.

   % retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]

-``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
-name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
+By default, the latest available checkpoint is loaded. Use
+``--checkpoint`` to load a specific one by name (e.g. ``ep_0100``).
 ``--framerate`` sets the target frames per second (default: 12). Press
 Enter or Escape to quit.

+``retro-gamer clean``
+~~~~~~~~~~~~~~~~~~~~~
+
+Remove all checkpoints and the training log from a run directory.
+
+.. code-block:: console
+
+   % retro-gamer clean RUN_DIR
+
+Prompts for confirmation before deleting. Use ``--yes`` / ``-y`` to skip
+the prompt. The ``config.toml`` is preserved so you can run
+``retro-gamer train`` immediately to start fresh with the same settings.
+
+Use this after making an incompatible change (see ``retro-gamer train``
+above) or any time you want to restart training from scratch.
+
 ``retro-gamer info``
 ~~~~~~~~~~~~~~~~~~~~~

 Print a summary of a training run: metadata, hyperparameters, recent
-episode log, and available checkpoints.
+checkpoint log, and available checkpoints.

 .. code-block:: console

@@ -285,60 +432,49 @@ contents:
   └── checkpoints/
       ├── ep_0100.pt    # model weights at episode 100
       ├── ep_0200.pt
-       ├── ...
-       └── final.pt      # model weights at training completion
+       └── ...           # one file saved every 100 episodes

 ``config.toml`` is written by ``retro-gamer create`` and updated (with
 the discovered character set and resolved hyperparameters) when
-``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
-and ``train`` is the recommended way to adjust hyperparameters.
+``retro-gamer train`` begins. It has five sections: ``[game]``,
+``[metadata]``, ``[preprocessing]``, ``[model]``, and ``[training]``.
+Editing ``config.toml`` between ``create`` and ``train`` is the
+recommended way to adjust hyperparameters.

-``training.log`` begins with the full architecture description
-generated at training startup, followed by one line per episode in the
-format::
+``training.log`` begins with the full network architecture description,
+then one line per checkpoint (every 100 episodes) in the format::

-   [EP NNNN] total_reward=F  steps=N  epsilon=F  avg_loss=F
+   [ep_NNNN]  ep=SSSS-NNNN  avg_reward=F  avg_steps=N  epsilon=F  avg_loss=F  time=Xm Xs  total=Xm Xs

-Checkpoint files are PyTorch state dictionaries containing model
-weights, optimizer state, the current epsilon, and the total number of
-training steps completed. They can be loaded with
-``retro-gamer play`` or directly with the Python API.
+Each field averages over the episodes since the previous checkpoint:
+
+- ``ep=SSSS-NNNN`` — episode range covered by this entry
+- ``avg_reward`` — mean total reward per episode (positive = good)
+- ``avg_steps`` — mean episode length in game turns
+- ``epsilon`` — current exploration rate (approaches ``epsilon_min`` over time)
+- ``avg_loss`` — mean Huber loss across training steps (should decrease as learning
+  stabilises). Huber loss equals ½·(q−t)² for small errors and |q−t|−½ for large
+  ones, so it stays bounded even when Q-values are large. Values in the range
+  0–10 are typical; a slow downward trend over thousands of episodes is the
+  healthy pattern. A loss that grows without bound indicates a learning rate
+  that is too high.
+- ``time`` — wall-clock time for this checkpoint interval
+- ``total`` — cumulative training time across all sessions
+
+When training is resumed, a ``=== Resumed from ... ===`` line is appended
+so the log records the full history of a run across multiple sessions.

 Python API
 ----------

 For advanced use, ``retro-gamer``'s components are importable as a
-library.
+library. See the :doc:`api` reference for full details.

 .. code-block:: python

-   from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
+   from retro_gamer import GameMetadata, DQNTrainer
   from retro.examples.snake import create_game

-   # Read metadata from [tool.retro-gamer] in the game's pyproject.toml
   metadata = GameMetadata.from_pyproject("retro.examples.snake")
-
-   trainer = DQNTrainer(
-       create_game, metadata, "runs/snake/",
-       training_episodes=500,
-       n_layers=2,
-       layer_size=128,
-   )
+   trainer = DQNTrainer(create_game, metadata, "runs/snake/")
   trainer.train()
-
-``GameEnvironment`` provides a gym-style interface for stepping through
-a game programmatically:
-
-.. code-block:: python
-
-   from retro_gamer import GameEnvironment
-
-   env = GameEnvironment(create_game, metadata)
-   obs = env.reset()             # returns initial observation vector
-   obs, reward, done = env.step("KEY_RIGHT")
-
-The observation is a flat NumPy array of dtype ``float32``. For spatial
-games, the first ``C × H × W`` elements are the board (channel-first
-one-hot encoding); for non-spatial games, the board is encoded
-``H × W × C`` and then flattened. Any ``observe_state`` values are
-appended at the end.