Updates across the board

2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions
--- a/docs/background.rst
+++ b/docs/background.rst
@@ -343,12 +343,13 @@ If the character set is not specified, ``retro-gamer`` runs a brief
 exploration phase before training to observe which characters actually
 appear.

-In addition to the board, the agent can observe numerical values from
-the game's state dictionary via ``observe_state``. These are
-appended to the end of the observation vector. The reward key must
-not be included in ``observe_state``: it would give the agent direct
-access to its own performance signal, which is not a realistic observation
-in most game contexts and can cause training pathologies.
+In addition to the board, the agent can observe extra computed values
+from ``game.state``. Listing keys in the ``observe_state`` option of
+``[preprocessing]`` causes those values to be appended to the
+observation vector after the board encoding. This is where feature
+engineering decisions live: what derived quantities should the agent
+see, and does giving it those values give it an advantage a human
+player would not have?

 Neural network architectures
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -356,7 +357,8 @@ Neural network architectures
 The architecture of the Q-network—the number and arrangement of its
 layers—is one of the most consequential choices in DQN training.
 ``retro-gamer`` selects an architecture based on the ``spatial``
-field in the game description and generates a plain-language rationale.
+option in ``[preprocessing]`` of ``config.toml`` and generates a
+plain-language rationale.

 **Multilayer perceptrons (MLP)**

@@ -379,8 +381,7 @@ that these numbers were arranged in a 2D grid, or that spatially
 adjacent cells are related. This is appropriate when the game's
 observation is better understood as a collection of independent
 readings—a set of meters or status indicators—rather than as a spatial
-scene. Set ``spatial = false`` in the game description to use this
-architecture.
+scene. ``spatial = false`` (the default) selects this architecture.

 **Convolutional neural networks (CNN)**

@@ -405,8 +406,8 @@ channels respectively, kernel size 3, padding 1) followed by a
 flattening step and an MLP head. The padding ensures that the spatial
 dimensions are preserved through the convolution, so the output of the
 second conv layer has shape (64, H, W), which is then flattened and
-passed to the MLP. Set ``spatial = true`` (the default) to use this
-architecture.
+passed to the MLP. Set ``spatial = true`` in ``[preprocessing]`` to
+use this architecture.

 Connecting architecture to game metadata
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -416,15 +417,17 @@ follow from the game description you provide. This connection is worth
 making explicit, because understanding it is one of the main paths into
 understanding why neural network architecture matters.

- If ``spatial = true``, the CNN can detect local patterns—which characters
-  are adjacent to which—without needing to see every possible arrangement.
-  This is appropriate for games like Snake, where the snake's direction
-  and the apple's relative position are spatially encoded.
+- If ``spatial = true`` (in ``[preprocessing]``), the CNN can detect
+  local patterns—which characters are adjacent to which—without needing
+  to see every possible arrangement. This is appropriate for games like
+  Snake, where the snake's direction and the apple's relative position
+  are spatially encoded.

- If ``spatial = false``, the MLP treats the board as a flat vector. This
-  may be appropriate for games that use the character grid primarily as a
-  display rather than a spatial field—for example, a game where characters
-  appear in fixed, non-interacting positions as status indicators.
+- If ``spatial = false`` (the default), the MLP treats the board as a
+  flat vector. This may be appropriate for games that use the character
+  grid primarily as a display rather than a spatial field—for example,
+  a game where characters appear in fixed, non-interacting positions as
+  status indicators.

 - The ``character_set`` determines the depth (C) of the board tensor.
  More characters mean more numbers per cell and a larger input to the
@@ -432,11 +435,185 @@ understanding why neural network architecture matters.
  wastes capacity; a character set that omits relevant characters forces
  the agent to treat different things as the same.

- The ``observe_state`` fields are appended to the flattened CNN output
-  before the MLP head. This allows the agent to use explicit state
-  variables—a timer, a lives count—alongside the visual board
-  representation.
+- Keys listed in ``observe_state`` (in ``[preprocessing]``) are appended
+  to the flattened board output before the MLP head. This allows the
+  agent to use computed values—a direction to the goal, a distance, a
+  timer—alongside the visual board representation.

 These relationships are not incidental features of the implementation.
 They are the reason the game description matters: every field you fill
 in shapes what the agent can perceive and therefore what it can learn.
+
+Design rationale
+----------------
+
+This section explains the reasoning behind several design decisions in
+``retro-gamer`` that go beyond technical necessity. Each choice was
+made with a specific pedagogical goal: to create a tool that not only
+trains agents, but also helps students build genuine understanding of
+how and why the training process works.
+
+Checkpoint compatibility and the "start fresh" workflow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a student changes the game description or network architecture
+mid-training, ``retro-gamer`` refuses to resume and explains exactly
+which fields changed and why they are incompatible. This behavior is
+deliberate.
+
+The immediate practical reason is correctness: if the character set
+changes, the network's input layer changes size, and the saved weights
+no longer correspond to any meaningful function. Loading them would
+produce garbage behavior. If the reward signal changes, the Q-values
+the network has accumulated are estimates of a *different* objective;
+resuming would mislead the network, not help it.
+
+But the deeper reason is pedagogical. The incompatibility check is a
+moment of forced reflection. When a student sees::
+
+   character_set
+     was : ['@', '*', '>', '<', '^', 'v']
+     now : ['@', '*', '>', '<', '^', 'v', '#']
+     why : the set of board characters (changes input layer size)
+
+they are confronted with the concrete consequence of a description
+change. The character set is not a label; it determines the shape of
+the tensor the network operates on. Changing it invalidates the
+network the same way changing the rules of chess would invalidate a
+chess engine. The error message is designed to make this connection
+legible, not just to block a problematic action.
+
+The ``retro-gamer clean`` command exists to make the recovery path
+explicit: you can start fresh, and you should. There is no partial
+salvage. This mirrors an important truth about RL training: some
+decisions are foundational, and changing them means starting over.
+Students who encounter this—who have to decide whether a change is
+worth the cost of retraining—are reasoning about the architecture in
+a way that purely reading about it does not produce.
+
+The distinction between incompatible changes (game description,
+network architecture) and safe changes (hyperparameters like learning
+rate and epsilon) is also pedagogically useful. It encodes, in the
+tool itself, the distinction between *what the agent is learning* and
+*how it is learning*. Students who ask "can I change the learning rate
+without retraining?" are asking a question with a precise answer, and
+answering it correctly requires understanding why the learning rate is
+different in kind from the character set.
+
+Checkpoint-level logging
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Early versions of ``retro-gamer`` logged one line per episode. This
+was accurate but not very useful: a run of 1,000 episodes produces
+1,000 log lines, most of which are noise. Individual episodes vary
+widely due to randomness in both the game and the agent's exploration,
+making it hard to see the underlying trend.
+
+The current format logs one line per checkpoint—once every 100
+episodes—using averages over that window. This design serves several
+goals:
+
+**Noise reduction.** Single-episode rewards are highly variable,
+especially when epsilon is high and the agent is behaving randomly.
+Averaging over 100 episodes smooths out this variance and makes
+genuine trends visible.
+
+**Interpretive scaffolding.** The log line includes ``epsilon``
+alongside ``avg_reward``, so students can directly see the
+relationship between exploration rate and performance. Early entries
+with low ``avg_reward`` and high ``epsilon`` invite the question:
+"is this bad performance, or just exploration?" The answer—that random
+behavior is expected when epsilon is near 1—is readable from the log
+itself.
+
+**Timing information.** Each log line records both the elapsed time
+for that 100-episode interval and the total training time accumulated
+across all sessions. This serves two purposes. Practically, it lets
+students estimate how long continued training will take. Conceptually,
+it makes the cost of training tangible: RL is not instant, and the
+log makes the time investment visible.
+
+**Session continuity.** When training resumes from a checkpoint, a
+header line marks the break (``=== Resumed from ep_0500.pt ===``).
+This lets the full log tell the story of a run across multiple
+sessions, preserving the history of when training happened even if the
+student stops and restarts many times.
+
+The stop-watch-adjust-resume workflow
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``retro-gamer`` is designed around a workflow that the log format and
+checkpoint system both support: stop training, watch the agent play,
+decide what to change, and resume.
+
+This workflow is pedagogically productive because it gives students
+a *reason* to look at the log and a *reason* to think about
+hyperparameters. Watching the agent at episode 100 play erratically,
+then watching the agent at episode 500 navigate toward the apple more
+consistently, is not just satisfying—it raises concrete questions.
+Why did the agent improve? What changed between those two checkpoints?
+What would happen if we gave it more time, or adjusted the reward?
+
+These questions are best answered by consulting the log. The log in
+turn connects the behavior the student observed to numbers they can
+reason about: a decreasing loss, a declining epsilon, a rising average
+reward. The three—visual observation, log interpretation, and
+conceptual understanding—form a feedback loop that is much harder to
+close if training is treated as a black box that produces only a final
+model.
+
+The fact that training can be stopped and resumed freely, with no
+penalty and no extra flags, removes friction from this cycle. Students
+who feel they can experiment—stop, look, think, resume—are more
+likely to do so than students who feel they have to commit to a full
+training run before seeing results.
+
+Reward design as game description
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``reward`` field in ``[tool.retro-gamer]`` specifies a key from
+the game's state dictionary, not a function or a formula. This is
+another deliberate design choice. The reward signal is defined in the
+game code—in how the score changes when certain events occur—not in
+the training configuration.
+
+This forces students to engage with the reward where it lives: in the
+game logic. If a student wants to change the reward structure, they
+must change the game. This connects the RL concept of reward shaping
+to the concrete act of writing Python code that updates a score. The
+question "what reward should the agent get for moving toward the
+apple?" becomes "what code should run when the snake moves?"—and
+answering it requires reasoning about what behavior you want to
+encourage and how a small, frequent signal compares to a large,
+infrequent one.
+
+The distinction between reward-signal design (a pedagogically rich
+question with many possible answers) and reward-field specification
+(a technical detail) is preserved in the interface. Students configure
+the *key* to track; they design the *signal* in the game itself.
+
+Metadata as game description, not training configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The game description lives in ``[tool.retro-gamer]`` inside the
+game's own ``pyproject.toml``, not in a separate training
+configuration file. This placement encodes a claim: the character set,
+the action space, and the reward signal are *properties of the game*,
+not settings for the trainer.
+
+A student who edits the character set is not tweaking the trainer;
+they are more accurately describing their game. This framing matters
+because it positions the student as the expert on the game—which they
+are—and the trainer as a tool that depends on the accuracy of that
+description. Errors in the description are not configuration mistakes;
+they are inaccurate descriptions of something the student knows.
+
+When a student omits a character from the character set and the agent
+fails to notice that character on the board, the diagnostic question
+is not "what went wrong with training?" but "is my description of the
+game correct?" This is a more productive question, because it connects
+the student's domain knowledge (they know what characters appear and
+why they matter) to the technical representation (one-hot encoding
+requires knowing in advance which characters to encode). The fix is
+not to adjust a hyperparameter; it is to describe the game more
+accurately.