Updates across the board

2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions
--- a/docs/api.rst
+++ b/docs/api.rst
@@ -0,0 +1,30 @@
+API Reference
+=============
+
+All classes below are importable directly from ``retro_gamer``.
+
+Game description
+----------------
+
+.. autoclass:: retro_gamer.GameMetadata
+   :members: from_pyproject, from_dict, validate
+
+Training
+--------
+
+.. autoclass:: retro_gamer.DQNTrainer
+   :members: train, load_checkpoint
+
+Environment
+-----------
+
+.. autoclass:: retro_gamer.GameEnvironment
+   :members: reset, step
+
+Using a trained model
+---------------------
+
+.. autoclass:: retro_gamer.TrainedPolicy
+   :members: get_action
+
+.. autoclass:: retro_gamer.PolicyInput
--- a/docs/background.rst
+++ b/docs/background.rst
@@ -343,12 +343,13 @@ If the character set is not specified, ``retro-gamer`` runs a brief
 exploration phase before training to observe which characters actually
 appear.

-In addition to the board, the agent can observe numerical values from
-the game's state dictionary via ``observe_state``. These are
-appended to the end of the observation vector. The reward key must
-not be included in ``observe_state``: it would give the agent direct
-access to its own performance signal, which is not a realistic observation
-in most game contexts and can cause training pathologies.
+In addition to the board, the agent can observe extra computed values
+from ``game.state``. Listing keys in the ``observe_state`` option of
+``[preprocessing]`` causes those values to be appended to the
+observation vector after the board encoding. This is where feature
+engineering decisions live: what derived quantities should the agent
+see, and does giving it those values give it an advantage a human
+player would not have?

 Neural network architectures
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -356,7 +357,8 @@ Neural network architectures
 The architecture of the Q-network—the number and arrangement of its
 layers—is one of the most consequential choices in DQN training.
 ``retro-gamer`` selects an architecture based on the ``spatial``
-field in the game description and generates a plain-language rationale.
+option in ``[preprocessing]`` of ``config.toml`` and generates a
+plain-language rationale.

 **Multilayer perceptrons (MLP)**

@@ -379,8 +381,7 @@ that these numbers were arranged in a 2D grid, or that spatially
 adjacent cells are related. This is appropriate when the game's
 observation is better understood as a collection of independent
 readings—a set of meters or status indicators—rather than as a spatial
-scene. Set ``spatial = false`` in the game description to use this
-architecture.
+scene. ``spatial = false`` (the default) selects this architecture.

 **Convolutional neural networks (CNN)**

@@ -405,8 +406,8 @@ channels respectively, kernel size 3, padding 1) followed by a
 flattening step and an MLP head. The padding ensures that the spatial
 dimensions are preserved through the convolution, so the output of the
 second conv layer has shape (64, H, W), which is then flattened and
-passed to the MLP. Set ``spatial = true`` (the default) to use this
-architecture.
+passed to the MLP. Set ``spatial = true`` in ``[preprocessing]`` to
+use this architecture.

 Connecting architecture to game metadata
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -416,15 +417,17 @@ follow from the game description you provide. This connection is worth
 making explicit, because understanding it is one of the main paths into
 understanding why neural network architecture matters.

- If ``spatial = true``, the CNN can detect local patterns—which characters
-  are adjacent to which—without needing to see every possible arrangement.
-  This is appropriate for games like Snake, where the snake's direction
-  and the apple's relative position are spatially encoded.
+- If ``spatial = true`` (in ``[preprocessing]``), the CNN can detect
+  local patterns—which characters are adjacent to which—without needing
+  to see every possible arrangement. This is appropriate for games like
+  Snake, where the snake's direction and the apple's relative position
+  are spatially encoded.

- If ``spatial = false``, the MLP treats the board as a flat vector. This
-  may be appropriate for games that use the character grid primarily as a
-  display rather than a spatial field—for example, a game where characters
-  appear in fixed, non-interacting positions as status indicators.
+- If ``spatial = false`` (the default), the MLP treats the board as a
+  flat vector. This may be appropriate for games that use the character
+  grid primarily as a display rather than a spatial field—for example,
+  a game where characters appear in fixed, non-interacting positions as
+  status indicators.

 - The ``character_set`` determines the depth (C) of the board tensor.
  More characters mean more numbers per cell and a larger input to the
@@ -432,11 +435,185 @@ understanding why neural network architecture matters.
  wastes capacity; a character set that omits relevant characters forces
  the agent to treat different things as the same.

- The ``observe_state`` fields are appended to the flattened CNN output
-  before the MLP head. This allows the agent to use explicit state
-  variables—a timer, a lives count—alongside the visual board
-  representation.
+- Keys listed in ``observe_state`` (in ``[preprocessing]``) are appended
+  to the flattened board output before the MLP head. This allows the
+  agent to use computed values—a direction to the goal, a distance, a
+  timer—alongside the visual board representation.

 These relationships are not incidental features of the implementation.
 They are the reason the game description matters: every field you fill
 in shapes what the agent can perceive and therefore what it can learn.
+
+Design rationale
+----------------
+
+This section explains the reasoning behind several design decisions in
+``retro-gamer`` that go beyond technical necessity. Each choice was
+made with a specific pedagogical goal: to create a tool that not only
+trains agents, but also helps students build genuine understanding of
+how and why the training process works.
+
+Checkpoint compatibility and the "start fresh" workflow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a student changes the game description or network architecture
+mid-training, ``retro-gamer`` refuses to resume and explains exactly
+which fields changed and why they are incompatible. This behavior is
+deliberate.
+
+The immediate practical reason is correctness: if the character set
+changes, the network's input layer changes size, and the saved weights
+no longer correspond to any meaningful function. Loading them would
+produce garbage behavior. If the reward signal changes, the Q-values
+the network has accumulated are estimates of a *different* objective;
+resuming would mislead the network, not help it.
+
+But the deeper reason is pedagogical. The incompatibility check is a
+moment of forced reflection. When a student sees::
+
+   character_set
+     was : ['@', '*', '>', '<', '^', 'v']
+     now : ['@', '*', '>', '<', '^', 'v', '#']
+     why : the set of board characters (changes input layer size)
+
+they are confronted with the concrete consequence of a description
+change. The character set is not a label; it determines the shape of
+the tensor the network operates on. Changing it invalidates the
+network the same way changing the rules of chess would invalidate a
+chess engine. The error message is designed to make this connection
+legible, not just to block a problematic action.
+
+The ``retro-gamer clean`` command exists to make the recovery path
+explicit: you can start fresh, and you should. There is no partial
+salvage. This mirrors an important truth about RL training: some
+decisions are foundational, and changing them means starting over.
+Students who encounter this—who have to decide whether a change is
+worth the cost of retraining—are reasoning about the architecture in
+a way that purely reading about it does not produce.
+
+The distinction between incompatible changes (game description,
+network architecture) and safe changes (hyperparameters like learning
+rate and epsilon) is also pedagogically useful. It encodes, in the
+tool itself, the distinction between *what the agent is learning* and
+*how it is learning*. Students who ask "can I change the learning rate
+without retraining?" are asking a question with a precise answer, and
+answering it correctly requires understanding why the learning rate is
+different in kind from the character set.
+
+Checkpoint-level logging
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Early versions of ``retro-gamer`` logged one line per episode. This
+was accurate but not very useful: a run of 1,000 episodes produces
+1,000 log lines, most of which are noise. Individual episodes vary
+widely due to randomness in both the game and the agent's exploration,
+making it hard to see the underlying trend.
+
+The current format logs one line per checkpoint—once every 100
+episodes—using averages over that window. This design serves several
+goals:
+
+**Noise reduction.** Single-episode rewards are highly variable,
+especially when epsilon is high and the agent is behaving randomly.
+Averaging over 100 episodes smooths out this variance and makes
+genuine trends visible.
+
+**Interpretive scaffolding.** The log line includes ``epsilon``
+alongside ``avg_reward``, so students can directly see the
+relationship between exploration rate and performance. Early entries
+with low ``avg_reward`` and high ``epsilon`` invite the question:
+"is this bad performance, or just exploration?" The answer—that random
+behavior is expected when epsilon is near 1—is readable from the log
+itself.
+
+**Timing information.** Each log line records both the elapsed time
+for that 100-episode interval and the total training time accumulated
+across all sessions. This serves two purposes. Practically, it lets
+students estimate how long continued training will take. Conceptually,
+it makes the cost of training tangible: RL is not instant, and the
+log makes the time investment visible.
+
+**Session continuity.** When training resumes from a checkpoint, a
+header line marks the break (``=== Resumed from ep_0500.pt ===``).
+This lets the full log tell the story of a run across multiple
+sessions, preserving the history of when training happened even if the
+student stops and restarts many times.
+
+The stop-watch-adjust-resume workflow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``retro-gamer`` is designed around a workflow that the log format and
+checkpoint system both support: stop training, watch the agent play,
+decide what to change, and resume.
+
+This workflow is pedagogically productive because it gives students
+a *reason* to look at the log and a *reason* to think about
+hyperparameters. Watching the agent at episode 100 play erratically,
+then watching the agent at episode 500 navigate toward the apple more
+consistently, is not just satisfying—it raises concrete questions.
+Why did the agent improve? What changed between those two checkpoints?
+What would happen if we gave it more time, or adjusted the reward?
+
+These questions are best answered by consulting the log. The log in
+turn connects the behavior the student observed to numbers they can
+reason about: a decreasing loss, a declining epsilon, a rising average
+reward. The three—visual observation, log interpretation, and
+conceptual understanding—form a feedback loop that is much harder to
+close if training is treated as a black box that produces only a final
+model.
+
+The fact that training can be stopped and resumed freely, with no
+penalty and no extra flags, removes friction from this cycle. Students
+who feel they can experiment—stop, look, think, resume—are more
+likely to do so than students who feel they have to commit to a full
+training run before seeing results.
+
+Reward design as game description
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``reward`` field in ``[tool.retro-gamer]`` specifies a key from
+the game's state dictionary, not a function or a formula. This is
+another deliberate design choice. The reward signal is defined in the
+game code—in how the score changes when certain events occur—not in
+the training configuration.
+
+This forces students to engage with the reward where it lives: in the
+game logic. If a student wants to change the reward structure, they
+must change the game. This connects the RL concept of reward shaping
+to the concrete act of writing Python code that updates a score. The
+question "what reward should the agent get for moving toward the
+apple?" becomes "what code should run when the snake moves?"—and
+answering it requires reasoning about what behavior you want to
+encourage and how a small, frequent signal compares to a large,
+infrequent one.
+
+The distinction between reward-signal design (a pedagogically rich
+question with many possible answers) and reward-field specification
+(a technical detail) is preserved in the interface. Students configure
+the *key* to track; they design the *signal* in the game itself.
+
+Metadata as game description, not training configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The game description lives in ``[tool.retro-gamer]`` inside the
+game's own ``pyproject.toml``, not in a separate training
+configuration file. This placement encodes a claim: the character set,
+the action space, and the reward signal are *properties of the game*,
+not settings for the trainer.
+
+A student who edits the character set is not tweaking the trainer;
+they are more accurately describing their game. This framing matters
+because it positions the student as the expert on the game—which they
+are—and the trainer as a tool that depends on the accuracy of that
+description. Errors in the description are not configuration mistakes;
+they are inaccurate descriptions of something the student knows.
+
+When a student omits a character from the character set and the agent
+fails to notice that character on the board, the diagnostic question
+is not "what went wrong with training?" but "is my description of the
+game correct?" This is a more productive question, because it connects
+the student's domain knowledge (they know what characters appear and
+why they matter) to the technical representation (one-hot encoding
+requires knowing in advance which characters to encode). The fix is
+not to adjust a hyperparameter; it is to describe the game more
+accurately.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -1,9 +1,18 @@
+import os
+import sys
+
+sys.path.insert(0, os.path.abspath('..'))
+
 project = 'retro-gamer'
 copyright = '2025, Chris Proctor'
 author = 'Chris Proctor'
 release = '0.1.0'

-extensions = []
+extensions = [
+    'sphinx.ext.autodoc',
+]
+
+autodoc_member_order = 'bysource'

 templates_path = ['_templates']
 exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -31,17 +31,22 @@ with `retro-games <https://retro-games.readthedocs.io/en/latest/>`__.
 The retro-games framework must also be installed; see its documentation
 for instructions.

+If you are working through a Making With Code lab, ``retro-gamer`` is
+already installed in your project environment — skip ahead to
+:ref:`installation`.
+
+**Add to a project** using ``uv`` or ``pip``:
+
 .. code-block:: console

+   % uv add retro-gamer
   % pip install retro-gamer

-To install from source (for development or to use the latest changes):
+**Install as a global tool** (available everywhere, no project needed):

 .. code-block:: console

-   % git clone https://github.com/cproctor/retro-gamer
-   % cd retro-gamer
-   % pip install -e .
+   % uv tool install retro-gamer

 Verify the installation by checking the command-line tool:

@@ -65,5 +70,8 @@ Verify the installation by checking the command-line tool:
   introduction
   background
   walkthrough
+   troubleshooting
   reference
+   integration
+   api
   contributing
--- a/docs/integration.rst
+++ b/docs/integration.rst
@@ -0,0 +1,186 @@
+Integrating a Trained Model
+===========================
+
+Once you have trained a model, you can use it in two ways:
+
+- **PolicyInput** — the model replaces the keyboard, driving an existing
+  player-controlled agent. Use this to watch a trained agent play, or to
+  run automated evaluations.
+- **TrainedPolicy in play_turn** — call ``get_action(game)`` from inside any
+  agent's ``play_turn`` to embed the model as an autonomous character (for
+  example, a smart enemy) alongside human-controlled or other agents.
+
+Loading a trained model
+-----------------------
+
+Both approaches start by creating a :class:`retro_gamer.TrainedPolicy`:
+
+.. code-block:: python
+
+   from retro_gamer import TrainedPolicy
+
+   ai = TrainedPolicy("runs/snake/")
+
+This reads ``config.toml``, rebuilds the network, and loads the latest
+checkpoint. To load a specific checkpoint instead:
+
+.. code-block:: python
+
+   ai = TrainedPolicy("runs/snake/", checkpoint="ep_0500")
+
+PolicyInput: model as player
+----------------------------
+
+:class:`retro_gamer.PolicyInput` is an input source — it implements the same
+interface as keyboard input, but chooses actions using the trained model. Pass
+it to ``game.play()`` and everything else works exactly as usual:
+
+.. code-block:: python
+
+   from retro.examples.snake import create_game
+   from retro_gamer import TrainedPolicy, PolicyInput
+
+   ai = TrainedPolicy("runs/snake/")
+   game = create_game()
+   game.play(input_source=PolicyInput(ai, game))
+
+On each turn, ``PolicyInput`` observes the current board and game state, runs
+the model, and sends the chosen action to the game exactly as if the player
+had pressed that key.
+
+TrainedPolicy in play_turn: model as autonomous character
+---------------------------------------------------------
+
+To embed a trained model as an autonomous game character, create a
+``TrainedPolicy`` at module level and call ``get_action(game)`` from inside
+the agent's ``play_turn``. Placing it at module level means the model is
+loaded from disk once — not once per episode.
+
+.. code-block:: python
+
+   from retro.game import Game
+   from retro.examples.snake import Apple, SnakeHead
+   from retro_gamer import TrainedPolicy
+
+   _ai = TrainedPolicy("runs/snake/")
+
+   class AISnake(SnakeHead):
+       def handle_keystroke(self, k, game): pass  # ignore keyboard
+
+       def play_turn(self, game):
+           key = _ai.get_action(game)
+           if key == 'KEY_RIGHT': self.direction = (1, 0)
+           elif key == 'KEY_LEFT': self.direction = (-1, 0)
+           elif key == 'KEY_UP': self.direction = (0, -1)
+           elif key == 'KEY_DOWN': self.direction = (0, 1)
+           super().play_turn(game)
+
+   human_snake = SnakeHead()
+   ai_snake = AISnake()
+   ai_snake.position = (16, 8)
+   apple = Apple()
+
+   game = Game([human_snake, ai_snake, apple], {"score": 0}, board_size=(32, 16))
+   apple.relocate(game)
+   game.play()
+
+Training an enemy model
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can use the same training pipeline to produce a model for an enemy agent.
+``retro-gamer`` does not care *which* character it is training — it only cares
+that it can control one character through the keyboard and read a reward signal
+from the game state. To train an enemy:
+
+1. **Create an enemy-perspective game variant.** Write (or add) a
+   ``create_game`` function — in a separate file, or alongside your main one —
+   where the enemy agent is the keyboard-driven character and the reward key
+   in the game state reflects the enemy's objective (for example, a bonus for
+   catching the player). The human player can be absent, replaced by a
+   random-moving agent, or driven by a ``TrainedPolicy`` once you have a trained
+   player model.
+
+   .. code-block:: python
+
+      def create_enemy_training_game():
+          enemy = EnemyAgent()       # the character the trainer will control
+          player = RandomPlayer()    # a stand-in; no human involved
+          game = Game([enemy, player], {'enemy_reward': 0}, board_size=(32, 16))
+          return game
+
+2. **Train normally against this variant.**
+
+   .. code-block:: console
+
+      % retro-gamer create --game my_game:create_enemy_training_game \
+                           --output runs/enemy/
+      % retro-gamer train runs/enemy/
+
+3. **Embed the trained model in your main game** using ``get_action``, exactly
+   as shown above.
+
+.. note::
+
+   Because ``retro-gamer`` injects actions through the game's global input
+   source, *all* keyboard-listening agents in the training game will receive
+   the trainer's keystrokes. The cleanest approach is to make the enemy the
+   only keyboard-driven character in the training variant — any other
+   characters should advance on their own without reading from the keyboard.
+
+Adversarial training
+~~~~~~~~~~~~~~~~~~~~~
+
+Once you have separate training runs for the player and the enemy, you can
+train them *against each other* iteratively. The idea is simple: train the
+player against the current enemy model, then train the enemy against the
+updated player model, and repeat. Each side is forced to improve against an
+increasingly capable opponent.
+
+The key technique is to load the opponent's model at module level in each
+training game variant, so it is loaded from disk once per run rather than
+once per episode:
+
+.. code-block:: python
+
+   # enemy_training_game.py
+   from retro_gamer import TrainedPolicy
+
+   _player = TrainedPolicy("runs/player/")   # loaded once when the module is imported
+
+   def create_game():
+       enemy = EnemyAgent()
+       player = AIPlayer(_player)           # uses _player.get_action in play_turn
+       return Game([enemy, player], {'enemy_reward': 0}, board_size=(32, 16))
+
+You then alternate training runs:
+
+.. code-block:: console
+
+   % retro-gamer train runs/player/   # train player against current enemy
+   % retro-gamer train runs/enemy/    # train enemy against updated player
+   % retro-gamer train runs/player/   # train player again
+   # ...
+
+How many episodes to run before switching is itself a design decision: too
+few and neither model has time to adapt; too many and each side overfits to
+its current opponent. Watching how the strategies evolve — and asking *why*
+each model behaves as it does at each stage — connects directly to concepts
+in multi-agent reinforcement learning and adversarial training.
+
+Differences between the two approaches
+---------------------------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 35 65
+
+   * - ``PolicyInput``
+     - ``TrainedPolicy`` in ``play_turn``
+   * - Replaces human input for the whole game
+     - One autonomous agent among many
+   * - Game code is unchanged
+     - Agent's ``play_turn`` calls ``get_action``
+   * - One model drives all player-controlled agents
+     - Each agent instance has its own model
+   * - Simpler — just pass to ``game.play()``
+     - More flexible — mix human and AI characters
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@@ -100,12 +100,12 @@ matters.

 **Observation design** determines what information is available to the
 agent. If you leave a character out of the ``character_set``, the agent
-will not distinguish it from empty space. If you include a game-state
-variable in ``observe_state``, the agent can see it directly rather than
-having to infer it from the board. The consequences of these choices for
-what the agent can learn are reasonably predictable—and making and
-checking those predictions is exactly the kind of reasoning the tool is
-designed to support.
+will not distinguish it from empty space. If the game module defines a
+``get_state()`` function, the agent also receives those computed values
+as part of its observation. The consequences of these choices for what
+the agent can learn are reasonably predictable — and making and checking
+those predictions is exactly the kind of reasoning the tool is designed
+to support.

 **Reward engineering** is the craft of specifying what counts as doing
 well in a way the agent can actually optimize. Using score as the reward
--- a/docs/reference.rst
+++ b/docs/reference.rst
@@ -17,8 +17,6 @@ A complete example for the Snake game:
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]
-   spatial = true
-   observe_state = []

 You do not need to specify the board size: ``retro-gamer`` reads it
 directly from your game's ``board_size`` attribute.
@@ -65,54 +63,156 @@ If omitted, ``retro-gamer`` runs an exploration phase to discover the
 characters that appear in practice. The length of this phase is
 controlled by the ``exploration_turns`` hyperparameter.

-``spatial``
-~~~~~~~~~~~
+Preprocessing options
+---------------------

-**Optional; default ``true``.** Whether to treat the board as a 2D
-spatial scene. When ``true``, the trainer uses a convolutional neural
-network (CNN) that can detect patterns in the relative positions of
-characters. When ``false``, the trainer uses a multilayer perceptron
-(MLP) that sees the board as a flat list of numbers without positional
-structure.
+Preprocessing options live in the ``[preprocessing]`` section of a run's
+``config.toml``. They control how the game's board and state are
+transformed into the observation vector that the neural network sees.
+``retro-gamer create`` writes sensible defaults; you can edit them by
+hand before running ``retro-gamer train``.
+
+.. note::
+
+   Changes to any ``[preprocessing]`` option—or to the game description
+   fields above—make existing checkpoints incompatible. Run
+   ``retro-gamer clean`` before retraining after such changes.
+
+``spatial`` (default: ``false``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Whether to treat the board as a 2D spatial scene. When ``true``, the
+trainer uses a convolutional neural network (CNN); when ``false``, a
+multilayer perceptron (MLP) that sees the board as a flat list of
+numbers.
+
+``board`` (default: ``true``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Whether to include the board encoding in the observation vector. Set
+to ``false`` to train on game state variables only, with no board at
+all. This is useful for games with small, enumerable state spaces where
+a lookup table (classic Q-learning) is sufficient.
+
+When ``board = false``:
+
+- ``spatial`` must also be ``false`` (no board means no 2D scene for a CNN).
+- At least one key must be listed in ``observe_state``.
+- ``character_set`` is not required and character discovery is skipped.

 .. code-block:: toml

-   spatial = true
+   [preprocessing]
+   board = false
+   observe_state = ["board_state"]

-``observe_state``
-~~~~~~~~~~~~~~~~~
+``observe_state`` (default: ``[]``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-**Optional; default ``[]``.** A list of keys from the game's state
-dictionary to append to the observation vector. The values must be
-numbers (integers, floats, or booleans). The reward key must not
-appear in this list.
+A list of keys from ``game.state`` to include in the observation
+vector, appended after the board encoding (or as the entire
+observation when ``board = false``). Scalar values contribute one
+element each; list or tuple values are flattened.

 .. code-block:: toml

-   observe_state = ["lives", "level"]
+   observe_state = ["apple_dx", "apple_dy"]
+
+The keys must be present in ``game.state`` at every step, initialized
+in ``create_game()`` before the game starts. All values that are lists
+or tuples must always have the same length from episode to episode.
+
+.. warning::
+
+   ``observe_state`` keys must be initialized to their final shape in
+   ``create_game()`` before the game starts. If a key is absent or its
+   list length changes between episodes, training will crash with an
+   error explaining which key changed and by how much. This happens
+   because the neural network's input layer has a fixed size determined
+   at the start of training; it cannot adapt to a changing observation
+   shape mid-run.
+
+   Always initialize every observed key with a placeholder of the
+   correct type and length before the first ``game.step()`` call.
+
+``observe_state_sizes`` (auto-discovered)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A table mapping each ``observe_state`` key to its flat size (``1`` for
+scalars, ``N`` for sequences of length N). This is written automatically
+to ``config.toml`` the first time ``retro-gamer train`` runs, after the
+trainer samples ``game.state`` to discover the actual sizes:
+
+.. code-block:: toml
+
+   observe_state_sizes = {board_state = 9}
+
+You do not need to set this manually. Once written, it is used to
+detect changes in state shape when resuming training—an incompatible
+change here requires running ``retro-gamer clean`` and starting fresh.
+
+``egocentric`` (default: ``false``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When ``true``, the board observation is cropped to a square window
+centred on a specific agent rather than the full board. This gives the
+agent a local, first-person-like view and makes the observation
+invariant to the agent's absolute position on the board.
+
+Requires ``egocentric_player`` and ``egocentric_radius``.
+
+``egocentric_player``
+~~~~~~~~~~~~~~~~~~~~~~
+
+The name of the agent to use as the centre of the egocentric crop.
+Must match the ``name`` attribute of one of the game's agents.
+
+.. code-block:: toml
+
+   egocentric_player = "Snake head"
+
+``egocentric_radius``
+~~~~~~~~~~~~~~~~~~~~~~
+
+The half-side-length of the egocentric crop window, in cells. The
+resulting observation covers a ``(2r+1) × (2r+1)`` region. Larger
+values give the agent a wider view; smaller values focus it on the
+immediate vicinity.
+
+.. code-block:: toml
+
+   egocentric_radius = 8   # 17×17 window
+
+When ``egocentric_radius`` is set, ``board_size`` in ``[metadata]`` is
+automatically updated to ``[2r+1, 2r+1]`` so the network is sized
+correctly.

 .. _hyperparameters:

 Hyperparameters
 ---------------

-Hyperparameters are stored in the ``[hyperparameters]`` section of
-``config.toml``. They can be set via ``retro-gamer create`` options or
-edited directly.
+Hyperparameters are split across two sections of ``config.toml``:
+
+- ``[model]`` — network architecture (changing these requires starting fresh)
+- ``[training]`` — learning algorithm parameters (safe to change at any time)
+
+Both sections can be set via ``retro-gamer create`` options or edited directly.

 Learning and optimization
 ~~~~~~~~~~~~~~~~~~~~~~~~~

-``learning_rate`` (default: ``0.001``)
+``learning_rate`` (default: ``0.0001``)
    The step size used by the Adam optimizer when updating network
    weights. Larger values converge faster but may be unstable; smaller
    values are more stable but slower.

-``lr_decay`` (default: ``0.995``)
+``learning_rate_decay`` (default: ``0.9999``)
    Multiplicative decay applied to the learning rate after each
    episode. The learning rate decreases geometrically over training,
    helping the network fine-tune later without destabilizing early
-    progress.
+    progress. With the default value, the learning rate decays to about
+    13 % of its starting value after 20 000 episodes.

 ``gamma`` (default: ``0.99``)
    The discount factor for future rewards. A value of 1.0 makes the
@@ -127,7 +227,7 @@ Exploration
    random action with probability ``epsilon`` and exploits its current
    Q-function with probability ``1 - epsilon``.

-``epsilon_decay`` (default: ``0.995``)
+``epsilon_decay`` (default: ``0.9997``)
    Multiplicative decay applied to ``epsilon`` after each episode.

 ``epsilon_min`` (default: ``0.05``)
@@ -142,31 +242,33 @@ Memory and sampling
    The number of experiences sampled from the replay buffer per
    training step.

-``memory_capacity`` (default: ``10000``)
+``memory_capacity`` (default: ``50000``)
    The maximum number of experiences the replay buffer can hold. When
    full, the oldest experiences are discarded.

-``prioritize_experiences`` (default: ``false``)
+``prioritize_experiences`` (default: ``true``)
    Whether to use prioritized experience replay. When ``true``,
    experiences with larger TD errors are sampled more frequently.
    This often improves sample efficiency at a modest computational
    cost.

-Network architecture
-~~~~~~~~~~~~~~~~~~~~
+Model architecture
+~~~~~~~~~~~~~~~~~~

-``n_layers`` (default: ``2``)
-    The number of hidden layers in the MLP head (for spatial games,
-    this follows the CNN; for non-spatial games, it is the full
-    network).
+These live in the ``[model]`` section. Changing them requires starting fresh
+(run ``retro-gamer clean`` before retraining).

-``layer_size`` (default: ``128``)
-    The width (number of units) in each hidden layer.
+``hidden_sizes`` (default: ``[128, 64]``)
+    A list of integers giving the size of each hidden layer in the MLP
+    head. The default creates two layers: 128 units then 64. For spatial
+    games this follows the CNN; for non-spatial games it is the full
+    network. Larger or deeper networks can represent more complex
+    Q-functions but train more slowly and may need more episodes.

 Training duration
 ~~~~~~~~~~~~~~~~~

-``training_episodes`` (default: ``1000``)
+``training_episodes`` (default: ``20000``)
    The total number of game episodes to run. Each episode runs until
    the game ends or ``max_turns_per_episode`` turns have elapsed.

@@ -175,12 +277,18 @@ Training duration
    indefinitely (for example, if the agent finds a way to avoid
    dying).

-``target_update_freq`` (default: ``100``)
+``target_update_freq`` (default: ``500``)
    How many training steps between updates of the target network.
    More frequent updates make training targets move faster (less
    stable); less frequent updates make them more stable but slower
    to reflect new learning.

+``train_every`` (default: ``4``)
+    Run one training step every N game steps. Higher values speed up
+    episode collection at the cost of fewer gradient updates per
+    experience. The default of 4 is a good balance for most games;
+    set to 1 to train on every step.
+
 Character discovery
 ~~~~~~~~~~~~~~~~~~~

@@ -207,23 +315,26 @@ game's ``pyproject.toml``; you do not pass it on the command line.

 .. code-block:: console

-   % retro-gamer create --game MODULE --output DIR [OPTIONS]
+   % retro-gamer create --game GAME --output DIR [OPTIONS]

 **Required options:**

- ``--game MODULE`` — Python module containing ``create_game()``
-  (e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
-  is read from the ``pyproject.toml`` found in or above the module's
-  source directory.
+- ``--game GAME`` — Your game, specified as a file path or a Python
+  module name:
+
+  - File path: ``--game my_game.py`` or ``--game my_game/``
+  - Module name: ``--game retro.examples.snake``
+
+  The ``[tool.retro-gamer]`` section is read from the ``pyproject.toml``
+  found in or above the game file.
 - ``--output DIR`` — Directory to create for this training run.

 **Hyperparameter options** (all optional; see :ref:`hyperparameters`):

 - ``--training-episodes N``
- ``--n-layers N``
- ``--layer-size N``
+- ``--hidden-sizes SIZES`` — comma-separated, e.g. ``512,256``
 - ``--learning-rate F``
- ``--lr-decay F``
+- ``--learning-rate-decay F``
 - ``--gamma F``
 - ``--epsilon-decay F``
 - ``--epsilon-min F``
@@ -232,20 +343,40 @@ game's ``pyproject.toml``; you do not pass it on the command line.
 - ``--target-update-freq N``
 - ``--max-turns-per-episode N``
 - ``--exploration-turns N``
+- ``--train-every N``
 - ``--prioritize-experiences`` / ``--no-prioritize-experiences``

 ``retro-gamer train``
 ~~~~~~~~~~~~~~~~~~~~~

-Train (or resume training) a DQN agent.
+Train a DQN agent.

 .. code-block:: console

-   % retro-gamer train RUN_DIR [--resume CHECKPOINT]
+   % retro-gamer train RUN_DIR

 ``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
-create``. If ``--resume`` is given, training resumes from the specified
-checkpoint file (relative or absolute path).
+create``. If checkpoints already exist in ``RUN_DIR``, training
+automatically resumes from the latest one so prior work is never lost.
+
+If all configured episodes have already been completed, the command
+prints a message and exits immediately. To keep training, increase
+``training_episodes`` in ``config.toml`` and run again.
+
+**Incompatible changes.** Some config changes make existing checkpoints
+unusable. If you change any of the following, ``retro-gamer train`` will
+detect the mismatch and refuse to resume, with a clear explanation:
+
+- ``actions``, ``reward``, ``character_set``, ``board_size``
+  (``[metadata]``) — game description
+- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
+  ``egocentric``, ``egocentric_player``, ``egocentric_radius``
+  (``[preprocessing]``) — observation encoding
+- ``hidden_sizes`` (``[model]``) — network architecture
+
+Run ``retro-gamer clean RUN_DIR`` to remove the old checkpoints and start
+fresh. Other hyperparameter changes (learning rate, epsilon, etc.) are
+safe and take effect immediately on the next training run.

 ``retro-gamer play``
 ~~~~~~~~~~~~~~~~~~~~
@@ -256,16 +387,32 @@ Watch a trained agent play the game in the terminal.

   % retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]

-``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
-name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
+By default, the latest available checkpoint is loaded. Use
+``--checkpoint`` to load a specific one by name (e.g. ``ep_0100``).
 ``--framerate`` sets the target frames per second (default: 12). Press
 Enter or Escape to quit.

+``retro-gamer clean``
+~~~~~~~~~~~~~~~~~~~~~
+
+Remove all checkpoints and the training log from a run directory.
+
+.. code-block:: console
+
+   % retro-gamer clean RUN_DIR
+
+Prompts for confirmation before deleting. Use ``--yes`` / ``-y`` to skip
+the prompt. The ``config.toml`` is preserved so you can run
+``retro-gamer train`` immediately to start fresh with the same settings.
+
+Use this after making an incompatible change (see ``retro-gamer train``
+above) or any time you want to restart training from scratch.
+
 ``retro-gamer info``
 ~~~~~~~~~~~~~~~~~~~~~

 Print a summary of a training run: metadata, hyperparameters, recent
-episode log, and available checkpoints.
+checkpoint log, and available checkpoints.

 .. code-block:: console

@@ -285,60 +432,49 @@ contents:
   └── checkpoints/
       ├── ep_0100.pt    # model weights at episode 100
       ├── ep_0200.pt
-       ├── ...
-       └── final.pt      # model weights at training completion
+       └── ...           # one file saved every 100 episodes

 ``config.toml`` is written by ``retro-gamer create`` and updated (with
 the discovered character set and resolved hyperparameters) when
-``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
-and ``train`` is the recommended way to adjust hyperparameters.
+``retro-gamer train`` begins. It has five sections: ``[game]``,
+``[metadata]``, ``[preprocessing]``, ``[model]``, and ``[training]``.
+Editing ``config.toml`` between ``create`` and ``train`` is the
+recommended way to adjust hyperparameters.

-``training.log`` begins with the full architecture description
-generated at training startup, followed by one line per episode in the
-format::
+``training.log`` begins with the full network architecture description,
+then one line per checkpoint (every 100 episodes) in the format::

-   [EP NNNN] total_reward=F  steps=N  epsilon=F  avg_loss=F
+   [ep_NNNN]  ep=SSSS-NNNN  avg_reward=F  avg_steps=N  epsilon=F  avg_loss=F  time=Xm Xs  total=Xm Xs

-Checkpoint files are PyTorch state dictionaries containing model
-weights, optimizer state, the current epsilon, and the total number of
-training steps completed. They can be loaded with
-``retro-gamer play`` or directly with the Python API.
+Each field averages over the episodes since the previous checkpoint:
+
+- ``ep=SSSS-NNNN`` — episode range covered by this entry
+- ``avg_reward`` — mean total reward per episode (positive = good)
+- ``avg_steps`` — mean episode length in game turns
+- ``epsilon`` — current exploration rate (approaches ``epsilon_min`` over time)
+- ``avg_loss`` — mean Huber loss across training steps (should decrease as learning
+  stabilises). Huber loss equals ½·(q−t)² for small errors and |q−t|−½ for large
+  ones, so it stays bounded even when Q-values are large. Values in the range
+  0–10 are typical; a slow downward trend over thousands of episodes is the
+  healthy pattern. A loss that grows without bound indicates a learning rate
+  that is too high.
+- ``time`` — wall-clock time for this checkpoint interval
+- ``total`` — cumulative training time across all sessions
+
+When training is resumed, a ``=== Resumed from ... ===`` line is appended
+so the log records the full history of a run across multiple sessions.

 Python API
 ----------

 For advanced use, ``retro-gamer``'s components are importable as a
-library.
+library. See the :doc:`api` reference for full details.

 .. code-block:: python

-   from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
+   from retro_gamer import GameMetadata, DQNTrainer
   from retro.examples.snake import create_game

-   # Read metadata from [tool.retro-gamer] in the game's pyproject.toml
   metadata = GameMetadata.from_pyproject("retro.examples.snake")
-
-   trainer = DQNTrainer(
-       create_game, metadata, "runs/snake/",
-       training_episodes=500,
-       n_layers=2,
-       layer_size=128,
-   )
+   trainer = DQNTrainer(create_game, metadata, "runs/snake/")
   trainer.train()
-
-``GameEnvironment`` provides a gym-style interface for stepping through
-a game programmatically:
-
-.. code-block:: python
-
-   from retro_gamer import GameEnvironment
-
-   env = GameEnvironment(create_game, metadata)
-   obs = env.reset()             # returns initial observation vector
-   obs, reward, done = env.step("KEY_RIGHT")
-
-The observation is a flat NumPy array of dtype ``float32``. For spatial
-games, the first ``C × H × W`` elements are the board (channel-first
-one-hot encoding); for non-spatial games, the board is encoded
-``H × W × C`` and then flattened. Any ``observe_state`` values are
-appended at the end.
--- a/docs/troubleshooting.rst
+++ b/docs/troubleshooting.rst
@@ -0,0 +1,287 @@
+Troubleshooting
+===============
+
+This section describes problems that commonly arise when training an agent
+with ``retro-gamer``. Each entry names the issue, describes what you will
+see in the training log or when watching the agent play, explains what is
+happening in terms of the underlying reinforcement learning, and suggests
+how to fix it.
+
+.. contents:: Issues
+   :local:
+   :depth: 1
+
+
+Loss grows rapidly over training
+---------------------------------
+
+**Symptoms**
+
+The ``avg_loss`` column in the training log grows steadily from one
+checkpoint to the next, often at an accelerating rate::
+
+   [ep_0100]  avg_loss=22.2
+   [ep_0200]  avg_loss=128.5
+   [ep_0300]  avg_loss=2918.5
+   [ep_0400]  avg_loss=163825.1
+
+Left unchecked, the loss eventually reaches extreme values and the agent's
+behavior becomes erratic or degenerates entirely.
+
+**Why this happens**
+
+This is called *Q-value divergence*. The Q-network is trained to predict
+the total future reward of each action. To do that, it computes a *target*
+for each prediction — but the target itself is computed using the
+Q-network's own current predictions. This creates a feedback loop: if
+the predictions are slightly off, the targets drift, which makes the next
+predictions slightly more off, which drifts the targets further.
+
+Under normal conditions, the learning rate is small enough and the target
+network stable enough that this loop stays controlled. Divergence happens
+when the learning rate is too high, causing each update to overshoot.
+The problem is amplified by larger networks (more parameters to overshoot)
+and by prioritized experience replay, which deliberately samples the
+experiences the network is most wrong about — exactly the experiences most
+likely to destabilize it.
+
+**How to fix it**
+
+Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction
+(for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize
+training. If you recently increased the size of the network (via
+``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning
+rate than you used before is likely necessary — larger, more capable
+networks need smaller, more careful updates.
+
+Also consider increasing ``target_update_freq``. The target network is a
+frozen copy of the Q-network used to compute stable training targets; the
+less frequently it is updated, the more stable those targets are. The
+default is 200 steps; raising it to 500 or 1000 slows learning slightly
+but reduces the chance of divergence.
+
+Because divergence compounds over many episodes, a run that has begun
+diverging cannot simply be resumed with a lower learning rate — the
+weights have already drifted far from useful values. Use
+``retro-gamer clean`` to remove the existing checkpoints and start fresh.
+
+
+Agent ignores some actions entirely
+-------------------------------------
+
+**Symptoms**
+
+After training, the agent never (or almost never) turns in certain
+directions, regardless of the board state. If you compare checkpoints at
+different stages of training, the missing directions are absent from the
+very beginning and never appear. The agent may survive for a while but
+always move in only a subset of the possible directions.
+
+**Why this happens**
+
+If some actions lead to immediate death every time they are tried early in
+training, the Q-network quickly learns to assign them very low values.
+This is correct in the specific situation where those actions are always
+fatal — but the network then generalizes that association across *all*
+board positions, even positions where those actions would be safe.
+
+A common cause is a fixed starting position at the edge or corner of the
+board. A snake that always starts in the top-left corner and always begins
+moving downward will die immediately whenever it turns up or left in the
+first step. After thousands of early episodes where those actions produce
+instant death, the network has seen so much evidence that "turn left →
+die" and "turn up → die" that it assigns them low Q-values everywhere.
+
+**How to fix it**
+
+Make sure the game's starting conditions give the agent a chance to try
+every action safely. For a snake game, this means randomizing both the
+starting position (keeping at least one cell away from every edge) and
+the starting direction at the beginning of each episode. An agent that
+starts in different places and orientations each time will quickly learn
+that all four directions can be appropriate depending on context.
+
+
+Agent survives but never moves toward the goal
+-----------------------------------------------
+
+**Symptoms**
+
+The ``avg_steps`` column in the training log increases steadily — the
+agent is surviving longer — but ``avg_reward`` stays negative or barely
+improves. When you watch the agent play, it wanders around the board
+without ever approaching the target object. Episodes end because the
+agent runs into a wall, not because it reached the goal.
+
+**Why this happens**
+
+The reward signal is *asymmetric*: it penalizes moving away from the goal
+but gives no reward for moving toward it. With this signal, the agent
+learns to avoid the penalty by surviving, but it has no positive gradient
+pointing it in the right direction. The eventual goal-reaching reward
+(eating the apple, reaching the exit, etc.) is too rare — especially
+early in training when the agent is mostly acting randomly — to provide
+meaningful learning signal on its own.
+
+From the Q-network's perspective, all directions look roughly equivalent:
+moving toward the goal is 0 reward, moving away is −1. On a large board,
+the probability of eating the apple by chance is small enough that the
+network may never see the positive terminal reward at all during the
+exploration phase.
+
+**How to fix it**
+
+Make the distance-based reward symmetric: give **+1 for moving toward the
+goal** and **−1 for moving away**. This way, every single step provides a
+meaningful signal in the correct direction, and the agent does not need to
+reach the goal by chance in order to start learning. In a snake game,
+computing this signal requires only one line of arithmetic — the change
+in Manhattan distance between the head and the apple from one step to the
+next.
+
+Note that the shaped ±1 signal is a *proxy* for the real objective. If the
+agent learns to follow it too literally, it may take direct paths that run
+through its own body. The −10 death penalty and +50 apple reward are still
+necessary; the shaping only accelerates early learning.
+
+
+Exploration ends before learning is complete
+---------------------------------------------
+
+**Symptoms**
+
+The ``epsilon`` column in the training log reaches ``epsilon_min`` well
+before training is finished. After that point, ``avg_reward`` stops
+improving even though many episodes remain. When you watch the agent play,
+it commits to the same strategy regardless of what is happening on the
+board.
+
+**Why this happens**
+
+Epsilon controls the balance between exploration (random actions) and
+exploitation (using the learned policy). Early in training, when the
+Q-network has seen little data, exploration is essential: the agent needs
+to try different things to accumulate the varied experiences that make
+Q-value estimates reliable. Once epsilon reaches its minimum, the agent
+stops exploring and commits fully to whatever policy it has learned so far.
+
+If ``training_episodes`` is too small relative to ``epsilon_decay``, the
+exploration phase ends while the Q-network is still unreliable. The agent
+then exploits a half-learned policy that cannot improve because it never
+tries anything new.
+
+You can calculate when epsilon will reach its minimum:
+
+.. code-block:: python
+
+   import math
+   episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay)
+
+With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``,
+``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The
+agent should have substantial training time *after* the exploration phase
+ends — so ``training_episodes`` should be at least several times this
+number.
+
+**How to fix it**
+
+Increase ``training_episodes`` so that the agent has many episodes of
+exploitation after the exploration phase ends. For simple games on small
+boards, 10,000 episodes is a reasonable starting point; for complex games
+or large boards, 50,000–100,000 may be needed.
+
+This is always safe to change. Because ``training_episodes`` does not
+affect the network architecture or the reward signal, you can increase it
+in ``config.toml`` and resume training from the latest checkpoint without
+starting fresh.
+
+
+Death penalty dominates all other signals
+-------------------------------------------
+
+**Symptoms**
+
+After a period of training, the agent survives for many steps but rarely
+or never scores. It tends to circle, hug walls, or otherwise avoid the
+goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains
+persistently negative. The agent behaves as if staying alive is the only
+objective.
+
+**Why this happens**
+
+When the penalty for dying is much larger than any other reward in the
+game, the Q-network learns that staying alive is overwhelmingly the most
+important thing to do. Scoring — which requires taking some risk —
+becomes unattractive because a single death outweighs many successful
+goal-reaching events.
+
+For example, if the death penalty is −1000 and each successful apple is
+50, then dying once costs the equivalent of twenty apples. The agent
+learns that the safest strategy is to avoid risk entirely, even if that
+means never eating. From the Q-network's perspective, this is rational:
+it is correctly optimizing the reward signal you gave it.
+
+**How to fix it**
+
+Keep all reward magnitudes in the same order of magnitude. If per-step
+shaping gives ±1 and the goal reward is +50, a death penalty of −10 is
+appropriate: death is clearly bad (ten times worse than a bad step) but
+not so catastrophic that it crowds out everything else. As a rule of
+thumb, no single reward should be more than ten to twenty times larger
+than the typical per-step reward.
+
+Increasing ``gamma`` (the discount factor) is a better way to make the
+agent care more about long-term consequences. A higher gamma causes
+future rewards — including the eventual death penalty — to count more
+heavily in the agent's current decisions, without distorting the relative
+scale of the rewards.
+
+
+Reward signal and human score interfere with each other
+---------------------------------------------------------
+
+**Symptoms**
+
+Human players see scores that go negative, or that include penalties and
+adjustments that make no sense in the context of a normal game. Conversely,
+adjustments made to improve training (removing a per-step shaping penalty,
+changing a death penalty) change the game's visible score in ways that
+affect the experience for human players.
+
+**Why this happens**
+
+Using the same state variable for both the training reward and the
+human-visible score conflates two separate concerns. Training rewards
+benefit from shaping — intermediate signals like "moved toward the goal"
+and "died" that accelerate learning. Scores for human players should
+reflect only the game's actual objectives (apples eaten, enemies defeated,
+distance covered) so that they are legible and motivating.
+
+When these are the same variable, every design decision about one
+necessarily affects the other.
+
+**How to fix it**
+
+Use two separate keys in the game's state dictionary: one for the
+human-facing score (updated only by meaningful in-game events) and one
+for the training reward (updated every step with shaping signals and
+penalties). In the game code:
+
+.. code-block:: python
+
+   # Only updated when the snake eats an apple — clean for human players.
+   game.state['score'] += 50
+
+   # Updated every step — used only by the trainer.
+   game.state['reward'] += old_dist - new_dist   # +1 toward apple, -1 away
+   game.state['reward'] += 50                    # also reward eating
+   game.state['reward'] -= 10                    # death penalty
+
+Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of
+``pyproject.toml`` so the trainer watches the right key. The score display
+remains clean for human players, and you can adjust the training reward
+freely without affecting it.
+
+Note that changing the ``reward`` key is an incompatible change: existing
+checkpoints trained on the old signal will be rejected when you try to
+resume. Run ``retro-gamer clean`` and start fresh after making this change.
--- a/docs/walkthrough.rst
+++ b/docs/walkthrough.rst
@@ -21,9 +21,9 @@ You will need:
 Preparing your game
 -------------------

-``retro-gamer`` loads your game by importing a Python module and
-calling a function named ``create_game``. The ``create_game`` function
-must take no arguments and return a new ``Game`` instance.
+``retro-gamer`` loads your game by calling a function named
+``create_game``. The function must take no arguments and return a new
+``Game`` instance.

 Here is the ``create_game`` function for Snake:

@@ -32,12 +32,20 @@ Here is the ``create_game`` function for Snake:
   def create_game():
       head = SnakeHead()
       apple = Apple()
-       game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
+       game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
       apple.relocate(game)
       return game

-If your game module does not already have a ``create_game`` function,
-add one following this pattern.
+If your game file does not already have a ``create_game`` function, add
+one following this pattern.
+
+When you run ``retro-gamer create``, you can point to your game file
+directly by path or by Python module name:
+
+.. code-block:: console
+
+   % retro-gamer create --game my_game.py --output runs/my_game/
+   % retro-gamer create --game retro.examples.snake --output runs/snake/


 Describing your game
@@ -57,8 +65,6 @@ Here is the ``[tool.retro-gamer]`` section for the Snake example:
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]
-   spatial = true
-   observe_state = []

 Let's go through each field.

@@ -80,9 +86,10 @@ implicitly has access to a no-op (doing nothing).

 The key in the game's state dictionary to use as the reward signal.
 ``retro-gamer`` computes the reward for each turn as the *change* in
-this value from one turn to the next. For Snake, score increases by 1
-(or more) each time the apple is eaten, so the agent receives a reward
-of 1 when it eats an apple and 0 otherwise.
+this value from one turn to the next. For Snake, the score changes when
+the snake eats an apple (+50), when it moves away from the apple (−1),
+and when it dies (−10). These incremental changes are what the agent
+tries to maximize.

 Choosing an appropriate reward is one of the most consequential
 decisions in RL. Some considerations:
@@ -115,15 +122,48 @@ phase before training to discover which characters actually appear.
 The number of exploration turns is controlled by the
 ``exploration_turns`` hyperparameter.

-``spatial``
-~~~~~~~~~~~
+``spatial`` and other preprocessing options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Whether to treat the board as a spatial scene (default: ``true``). A
-spatial game uses a *convolutional neural network* (CNN) that can
-detect patterns in the relative arrangement of characters. A
-non-spatial game uses a simpler *multilayer perceptron* (MLP) that
-ignores positional relationships. Set to ``false`` for games where
-position is irrelevant.
+The ``[tool.retro-gamer]`` section describes the game. Preprocessing
+options—such as ``spatial`` (whether to use a CNN or MLP, default:
+``false``), ``egocentric``, and ``observe_state``—live in the
+``[preprocessing]`` section of the generated ``config.toml``. You can
+edit them there after running ``retro-gamer create``.
+
+``observe_state``
+~~~~~~~~~~~~~~~~~
+
+By default the agent only sees the board. You can also give it access
+to computed values from ``game.state`` by listing the relevant keys in
+the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
+For example, Snake exposes the normalized direction to the apple:
+
+.. code-block:: toml
+
+   [preprocessing]
+   observe_state = ["apple_dx", "apple_dy"]
+
+The trainer appends these values to the observation vector after the
+board encoding (or uses them as the entire observation when
+``board = false``).
+
+These values must be set in ``game.state`` at the start of every
+episode—typically inside ``create_game()``—and must keep the same
+type and length from episode to episode.
+
+.. warning::
+
+   Always initialize every key listed in ``observe_state`` before the
+   game starts. If a key is missing or its length changes between
+   episodes, training stops immediately with a clear error explaining
+   what changed. The neural network's input size is fixed when training
+   begins; it cannot adapt to a changing observation shape mid-run.
+
+This is a good place to ask: *can a human player see this information?*
+The apple's location is visible on screen; the normalized distance vector
+is not. Whether that asymmetry is appropriate is a design choice worth
+examining.

 Once you have written this section, create the training run directory:

@@ -139,7 +179,7 @@ Once you have written this section, create the training run directory:
     actions     : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
     reward      : score
     characters  : ['@', '*', '>', '<', '^', 'v']
-     architecture: CNN (spatial)
+     architecture: MLP

 ``retro-gamer create`` reads your game metadata directly from
 ``pyproject.toml`` and writes it—along with all hyperparameters—to
@@ -153,64 +193,141 @@ With the ``config.toml`` in place, start training:
 .. code-block:: console

   % retro-gamer train runs/snake/
-   Training for 1000 episodes…
-   Done. Checkpoints in runs/snake/checkpoints/
+   100%|████████████████████| 1000/1000 [12:34<00:00,  1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
+   Done. Checkpoints saved in runs/snake/checkpoints/

-Training saves checkpoints every 100 episodes and a ``final.pt``
-checkpoint when complete. You can follow progress in the training log:
+A progress bar shows how far training has gone, along with the most
+recent episode's reward, the current exploration rate (``eps``), and
+the average prediction error (``loss``).
+
+Training saves a checkpoint every 100 episodes to
+``runs/snake/checkpoints/``. You can stop training at any time with
+Ctrl-C and resume it later—the next ``retro-gamer train`` command will
+automatically pick up from the latest checkpoint.
+
+Reading the training log
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+For a longer view of how training is progressing, inspect the training
+log:

 .. code-block:: console

-   % tail -f runs/snake/training.log
+   % cat runs/snake/training.log

-The log shows one line per episode:
+The log begins with the full network architecture, followed by one line
+per checkpoint (every 100 episodes):

 .. code-block:: text

-   [EP 0001] total_reward=0.0  steps=2000  epsilon=0.9950  avg_loss=0.023540
-   [EP 0050] total_reward=1.0  steps=1921  epsilon=0.7783  avg_loss=0.003217
-   [EP 0100] total_reward=3.0  steps=1847  epsilon=0.6065  avg_loss=0.001204
+   [ep_0100]  ep=0001-0100  avg_reward=-31.4  avg_steps=47   epsilon=0.938  avg_loss=7.2  time=0m12s  total=0m12s
+   [ep_0200]  ep=0101-0200  avg_reward=-18.6  avg_steps=89   epsilon=0.879  avg_loss=6.8  time=0m14s  total=0m26s
+   [ep_0300]  ep=0201-0300  avg_reward= -4.1  avg_steps=134  epsilon=0.824  avg_loss=6.1  time=0m15s  total=0m41s
+   [ep_0500]  ep=0401-0500  avg_reward= +8.7  avg_steps=210  epsilon=0.724  avg_loss=5.4  time=0m16s  total=1m12s
+   [ep_1000]  ep=0901-1000  avg_reward=+22.3  avg_steps=389  epsilon=0.557  avg_loss=4.9  time=0m18s  total=2m30s

- **total_reward**: the total score earned during the episode (how many
-  apples the snake ate, for Snake).
- **steps**: how many turns the episode lasted.
- **epsilon**: the current exploration rate. Early in training this is
-  close to 1 (mostly random actions); it decays toward ``epsilon_min``.
- **avg_loss**: the average temporal-difference error across training
-  steps in this episode. A decreasing loss generally indicates that the
-  Q-value estimates are converging.
+Here is what each field means:

-Resuming training
-~~~~~~~~~~~~~~~~~
+- **avg_reward**: Average total reward per episode over the past 100 episodes.
+  Positive values mean the agent is accumulating reward; negative values mean
+  it is accumulating penalties. An upward trend over time is the main signal
+  that learning is working.
+- **avg_steps**: Average number of turns per episode. If episodes are ending
+  quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
+  generally indicate the agent is surviving longer.
+- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
+  and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
+  behavior is expected.
+- **avg_loss**: Average Huber loss across training steps. Huber loss is
+  quadratic for small prediction errors and linear for large ones, which keeps
+  it stable even when rewards have a wide range (such as a large bonus for
+  reaching a goal). Values in the range 0–10 are typical for most games.
+  A slow downward trend is the healthy pattern. A loss that grows without bound
+  indicates the learning rate is too high.
+- **time**: Wall-clock time for this 100-episode interval.
+- **total**: Cumulative training time across all sessions.

-Training can be resumed from a checkpoint:
+When training is resumed after a stop, a header line marks the break::
+
+   === Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===
+
+This lets you track exactly when each session took place.
+
+Stopping training to watch the agent play
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You do not need to wait for training to finish before watching the
+agent. Training can be stopped at any time with Ctrl-C, and the latest
+checkpoint is always available immediately:

 .. code-block:: console

-   % retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
+   % retro-gamer play runs/snake/

-Watching a trained agent play
------------------------------
+This loads the most recent checkpoint and runs the agent in your
+terminal. Press Enter or Escape to quit.

-To watch a trained agent play the game in your terminal:
+.. note::

-.. code-block:: console
+   The game is rendered directly in your terminal. If the window is
+   smaller than the board plus borders, ``retro-gamer play`` will raise
+   a ``TerminalTooSmall`` error — enlarge the terminal window and try
+   again.

-   % retro-gamer play runs/snake/ --checkpoint final
-
-You can substitute any checkpoint name:
+To watch an earlier stage of training, use ``--checkpoint``:

 .. code-block:: console

   % retro-gamer play runs/snake/ --checkpoint ep_0100

-Press Enter or Escape to quit.
+Comparing what the agent at episode 100 does versus the agent at episode
+500 can reveal exactly what the agent has (and has not) learned. For
+Snake, you might notice the episode-100 agent moving somewhat randomly,
+while the episode-500 agent consistently navigates toward the apple.
+Articulating *why* the later agent behaves differently—what the training
+process produced—connects observation directly to the concepts underlying
+DQN.

-Comparing agents trained at different checkpoints is a useful activity:
-the agent at episode 100 has learned *something*, but typically much
-less than the agent at episode 500. Articulating *what* the earlier
-agent has and has not learned, and *why*, is productive reasoning about
-the training process.
+Resuming training after watching
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+After watching the agent play, resume training with exactly the same
+command you used before:
+
+.. code-block:: console
+
+   % retro-gamer train runs/snake/
+
+``retro-gamer`` automatically detects and resumes from the latest
+checkpoint. No extra flags are needed. If all configured episodes have
+already been completed, it prints a message and exits:
+
+.. code-block:: console
+
+   Training already complete (1000 episodes). To keep training,
+   increase training_episodes in config.toml.
+
+To continue training, open ``runs/snake/config.toml``, increase the
+``training_episodes`` value, and run ``retro-gamer train`` again.
+
+Watching a trained agent play
+------------------------------
+
+Once training is complete, watch the final agent:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/
+
+By default the latest checkpoint is loaded. You can also compare the
+agent's performance at different stages of training:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/ --checkpoint ep_0100
+   % retro-gamer play runs/snake/ --checkpoint ep_0500
+
+Press Enter or Escape to quit.

 Inspecting a run
 ----------------
@@ -220,18 +337,20 @@ To review the configuration and recent training progress for a run:
 .. code-block:: console

   % retro-gamer info runs/snake/
-   Game module : retro.examples.snake
-   Metadata    : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
-   Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
+   Game module    : retro.examples.snake
+   Metadata       : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
+   Preprocessing  : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
+   Model          : {'hidden_sizes': [128, 64]}
+   Training       : {'learning_rate': 0.0001, 'gamma': 0.99, ...}

-   Last 5 episodes:
-     [EP 0996] total_reward=9.0   steps=1203  epsilon=0.0074  avg_loss=0.000312
-     [EP 0997] total_reward=11.0  steps=1051  epsilon=0.0074  avg_loss=0.000289
-     [EP 0998] total_reward=14.0  steps=987   epsilon=0.0074  avg_loss=0.000274
-     [EP 0999] total_reward=8.0   steps=1142  epsilon=0.0074  avg_loss=0.000261
-     [EP 1000] total_reward=12.0  steps=1089  epsilon=0.0074  avg_loss=0.000248
+   Last 5 checkpoints:
+     [ep_0600]  ep=0501-0600  avg_reward=+12.1 ...
+     [ep_0700]  ep=0601-0700  avg_reward=+14.8 ...
+     [ep_0800]  ep=0701-0800  avg_reward=+16.3 ...
+     [ep_0900]  ep=0801-0900  avg_reward=+19.0 ...
+     [ep_1000]  ep=0901-1000  avg_reward=+22.3 ...

-   Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
+   Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']

 Adjusting hyperparameters
 --------------------------
@@ -241,7 +360,8 @@ before training, or by passing them as options to ``retro-gamer
 create``. Common adjustments and their effects:

 **``training_episodes``** — How long to train. More episodes give the
-agent more time to learn, but also take longer to run.
+agent more time to learn, but also take longer to run. This is always
+safe to increase.

 **``epsilon_decay``** — How quickly exploration decreases. A faster
 decay (smaller ``epsilon_decay``) means the agent commits to its early
@@ -257,14 +377,124 @@ a small learning rate is stable but slow.
 means the agent values long-term consequences; closer to 0 makes the
 agent focus on immediate reward.

-**``n_layers`` and ``layer_size``** — The depth and width of the MLP
-head. Larger networks can represent more complex Q-functions but are
-slower to train and may overfit.
+**``hidden_sizes``** — The shape of the MLP head as a list of layer
+sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
+more complex Q-functions but are slower to train and may overfit.

 **``prioritize_experiences``** — Whether to use prioritized experience
 replay. This often improves sample efficiency but is slightly slower
 per step.

+.. _incompatible-changes:
+
+Why some changes require starting fresh
+----------------------------------------
+
+Not all changes to ``config.toml`` are equal. Some can be applied
+immediately to an existing training run; others make the existing
+checkpoints unusable.
+
+**Safe to change at any time** (``[training]`` section) — These affect
+*how* the agent learns, not *what* it is learning to do. Existing
+checkpoints remain valid:
+
+- ``training_episodes``, ``max_turns_per_episode``
+- ``learning_rate``, ``learning_rate_decay``, ``gamma``
+- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
+- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
+- ``target_update_freq``, ``train_every``
+
+**Requires starting fresh** — These changes alter the shape of the
+game or the shape of the network. The saved model weights are
+incompatible with the new configuration:
+
+- ``actions``, ``reward``, ``character_set``, ``board_size``
+  (``[metadata]``) — These define what the agent perceives and what it
+  can do. Changing them changes the size of the network's input or
+  output layers; the existing weights no longer fit.
+- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
+  ``egocentric``, ``egocentric_player``, ``egocentric_radius``
+  (``[preprocessing]``) — These control how the observation is
+  constructed. Any change here alters the input shape or meaning and
+  makes existing weights invalid.
+- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
+  layers. Changing it changes the shape of the network; the existing
+  weights no longer fit.
+
+If you try to resume training after making one of these changes,
+``retro-gamer train`` detects the mismatch and stops with a clear
+explanation, for example::
+
+   Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.
+
+   The following changes require starting fresh. The existing model was
+   trained on a different problem and its weights cannot be reused:
+
+     character_set
+       was : ['@', '*', '>', '<', '^', 'v']
+       now : ['@', '*', '>', '<', '^', 'v', '#']
+       why : the set of board characters (changes input layer size)
+
+   Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
+   training log, then run 'retro-gamer train RUN_DIR' to start fresh.
+
+To clear out the old checkpoints and begin again:
+
+.. code-block:: console
+
+   % retro-gamer clean runs/snake/
+   Will remove 5 checkpoint(s) and training log from runs/snake/:
+     checkpoints/ep_0100.pt
+     checkpoints/ep_0200.pt
+     ...
+     training.log
+
+   Proceed? [y/N]: y
+   Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.
+
+The ``config.toml`` is always preserved so you do not need to run
+``retro-gamer create`` again.
+
+Reasoning about training from the log
+--------------------------------------
+
+The training log is one of the most useful tools for understanding what
+is happening during training. Here are some patterns to look for and
+what they mean.
+
+**Reward increasing steadily** is the normal, healthy pattern. Each
+checkpoint block should show a higher ``avg_reward`` than the last.
+The rate of increase typically slows as training progresses.
+
+**Reward flat or negative through early episodes** is normal. Early in
+training, ``epsilon`` is high and the agent is mostly acting randomly.
+It has not yet discovered effective strategies. Patience—and a look at
+the ``epsilon`` column—will confirm whether this is just the exploration
+phase.
+
+**Loss decreasing** is also healthy. As the Q-network's estimates
+improve, the difference between predicted and target Q-values (the TD
+error) should shrink. A loss that stabilizes near zero is usually a
+good sign.
+
+**Loss growing without bound** indicates the learning rate is too high.
+The trainer uses Huber loss, which is robust to large reward scales, but
+a learning rate above roughly ``0.001`` can still destabilise training.
+Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
+and restarting training.
+
+**Short episodes (low ``avg_steps``)** combined with low reward
+suggests the agent is dying frequently. Early in training this is
+normal. If it persists late in training, the agent may have settled on
+a bad policy—consider extending training or adjusting
+``epsilon_decay`` to explore longer.
+
+**Reward that improves and then regresses** can indicate that the
+agent has discovered a suboptimal but consistent strategy and is stuck.
+Increasing ``epsilon_min`` to keep some exploration active, or
+adjusting the reward signal to better differentiate good moves from
+bad ones, can help.
+
 Questions for investigation
 ----------------------------

@@ -297,3 +527,4 @@ concepts underlying the training algorithm.
   episode 1000 and watch each play the same game. What has the later
   agent learned that the earlier one has not? How would you describe
   this difference to someone who does not know about neural networks?
+