Updates across the board
This commit is contained in:
@@ -343,12 +343,13 @@ If the character set is not specified, ``retro-gamer`` runs a brief
|
||||
exploration phase before training to observe which characters actually
|
||||
appear.
|
||||
|
||||
In addition to the board, the agent can observe numerical values from
|
||||
the game's state dictionary via ``observe_state``. These are
|
||||
appended to the end of the observation vector. The reward key must
|
||||
not be included in ``observe_state``: it would give the agent direct
|
||||
access to its own performance signal, which is not a realistic observation
|
||||
in most game contexts and can cause training pathologies.
|
||||
In addition to the board, the agent can observe extra computed values
|
||||
from ``game.state``. Listing keys in the ``observe_state`` option of
|
||||
``[preprocessing]`` causes those values to be appended to the
|
||||
observation vector after the board encoding. This is where feature
|
||||
engineering decisions live: what derived quantities should the agent
|
||||
see, and does giving it those values give it an advantage a human
|
||||
player would not have?
|
||||
|
||||
Neural network architectures
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -356,7 +357,8 @@ Neural network architectures
|
||||
The architecture of the Q-network—the number and arrangement of its
|
||||
layers—is one of the most consequential choices in DQN training.
|
||||
``retro-gamer`` selects an architecture based on the ``spatial``
|
||||
field in the game description and generates a plain-language rationale.
|
||||
option in ``[preprocessing]`` of ``config.toml`` and generates a
|
||||
plain-language rationale.
|
||||
|
||||
**Multilayer perceptrons (MLP)**
|
||||
|
||||
@@ -379,8 +381,7 @@ that these numbers were arranged in a 2D grid, or that spatially
|
||||
adjacent cells are related. This is appropriate when the game's
|
||||
observation is better understood as a collection of independent
|
||||
readings—a set of meters or status indicators—rather than as a spatial
|
||||
scene. Set ``spatial = false`` in the game description to use this
|
||||
architecture.
|
||||
scene. ``spatial = false`` (the default) selects this architecture.
|
||||
|
||||
**Convolutional neural networks (CNN)**
|
||||
|
||||
@@ -405,8 +406,8 @@ channels respectively, kernel size 3, padding 1) followed by a
|
||||
flattening step and an MLP head. The padding ensures that the spatial
|
||||
dimensions are preserved through the convolution, so the output of the
|
||||
second conv layer has shape (64, H, W), which is then flattened and
|
||||
passed to the MLP. Set ``spatial = true`` (the default) to use this
|
||||
architecture.
|
||||
passed to the MLP. Set ``spatial = true`` in ``[preprocessing]`` to
|
||||
use this architecture.
|
||||
|
||||
Connecting architecture to game metadata
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -416,15 +417,17 @@ follow from the game description you provide. This connection is worth
|
||||
making explicit, because understanding it is one of the main paths into
|
||||
understanding why neural network architecture matters.
|
||||
|
||||
- If ``spatial = true``, the CNN can detect local patterns—which characters
|
||||
are adjacent to which—without needing to see every possible arrangement.
|
||||
This is appropriate for games like Snake, where the snake's direction
|
||||
and the apple's relative position are spatially encoded.
|
||||
- If ``spatial = true`` (in ``[preprocessing]``), the CNN can detect
|
||||
local patterns—which characters are adjacent to which—without needing
|
||||
to see every possible arrangement. This is appropriate for games like
|
||||
Snake, where the snake's direction and the apple's relative position
|
||||
are spatially encoded.
|
||||
|
||||
- If ``spatial = false``, the MLP treats the board as a flat vector. This
|
||||
may be appropriate for games that use the character grid primarily as a
|
||||
display rather than a spatial field—for example, a game where characters
|
||||
appear in fixed, non-interacting positions as status indicators.
|
||||
- If ``spatial = false`` (the default), the MLP treats the board as a
|
||||
flat vector. This may be appropriate for games that use the character
|
||||
grid primarily as a display rather than a spatial field—for example,
|
||||
a game where characters appear in fixed, non-interacting positions as
|
||||
status indicators.
|
||||
|
||||
- The ``character_set`` determines the depth (C) of the board tensor.
|
||||
More characters mean more numbers per cell and a larger input to the
|
||||
@@ -432,11 +435,185 @@ understanding why neural network architecture matters.
|
||||
wastes capacity; a character set that omits relevant characters forces
|
||||
the agent to treat different things as the same.
|
||||
|
||||
- The ``observe_state`` fields are appended to the flattened CNN output
|
||||
before the MLP head. This allows the agent to use explicit state
|
||||
variables—a timer, a lives count—alongside the visual board
|
||||
representation.
|
||||
- Keys listed in ``observe_state`` (in ``[preprocessing]``) are appended
|
||||
to the flattened board output before the MLP head. This allows the
|
||||
agent to use computed values—a direction to the goal, a distance, a
|
||||
timer—alongside the visual board representation.
|
||||
|
||||
These relationships are not incidental features of the implementation.
|
||||
They are the reason the game description matters: every field you fill
|
||||
in shapes what the agent can perceive and therefore what it can learn.
|
||||
|
||||
Design rationale
|
||||
----------------
|
||||
|
||||
This section explains the reasoning behind several design decisions in
|
||||
``retro-gamer`` that go beyond technical necessity. Each choice was
|
||||
made with a specific pedagogical goal: to create a tool that not only
|
||||
trains agents, but also helps students build genuine understanding of
|
||||
how and why the training process works.
|
||||
|
||||
Checkpoint compatibility and the "start fresh" workflow
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When a student changes the game description or network architecture
|
||||
mid-training, ``retro-gamer`` refuses to resume and explains exactly
|
||||
which fields changed and why they are incompatible. This behavior is
|
||||
deliberate.
|
||||
|
||||
The immediate practical reason is correctness: if the character set
|
||||
changes, the network's input layer changes size, and the saved weights
|
||||
no longer correspond to any meaningful function. Loading them would
|
||||
produce garbage behavior. If the reward signal changes, the Q-values
|
||||
the network has accumulated are estimates of a *different* objective;
|
||||
resuming would mislead the network, not help it.
|
||||
|
||||
But the deeper reason is pedagogical. The incompatibility check is a
|
||||
moment of forced reflection. When a student sees::
|
||||
|
||||
character_set
|
||||
was : ['@', '*', '>', '<', '^', 'v']
|
||||
now : ['@', '*', '>', '<', '^', 'v', '#']
|
||||
why : the set of board characters (changes input layer size)
|
||||
|
||||
they are confronted with the concrete consequence of a description
|
||||
change. The character set is not a label; it determines the shape of
|
||||
the tensor the network operates on. Changing it invalidates the
|
||||
network the same way changing the rules of chess would invalidate a
|
||||
chess engine. The error message is designed to make this connection
|
||||
legible, not just to block a problematic action.
|
||||
|
||||
The ``retro-gamer clean`` command exists to make the recovery path
|
||||
explicit: you can start fresh, and you should. There is no partial
|
||||
salvage. This mirrors an important truth about RL training: some
|
||||
decisions are foundational, and changing them means starting over.
|
||||
Students who encounter this—who have to decide whether a change is
|
||||
worth the cost of retraining—are reasoning about the architecture in
|
||||
a way that purely reading about it does not produce.
|
||||
|
||||
The distinction between incompatible changes (game description,
|
||||
network architecture) and safe changes (hyperparameters like learning
|
||||
rate and epsilon) is also pedagogically useful. It encodes, in the
|
||||
tool itself, the distinction between *what the agent is learning* and
|
||||
*how it is learning*. Students who ask "can I change the learning rate
|
||||
without retraining?" are asking a question with a precise answer, and
|
||||
answering it correctly requires understanding why the learning rate is
|
||||
different in kind from the character set.
|
||||
|
||||
Checkpoint-level logging
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Early versions of ``retro-gamer`` logged one line per episode. This
|
||||
was accurate but not very useful: a run of 1,000 episodes produces
|
||||
1,000 log lines, most of which are noise. Individual episodes vary
|
||||
widely due to randomness in both the game and the agent's exploration,
|
||||
making it hard to see the underlying trend.
|
||||
|
||||
The current format logs one line per checkpoint—once every 100
|
||||
episodes—using averages over that window. This design serves several
|
||||
goals:
|
||||
|
||||
**Noise reduction.** Single-episode rewards are highly variable,
|
||||
especially when epsilon is high and the agent is behaving randomly.
|
||||
Averaging over 100 episodes smooths out this variance and makes
|
||||
genuine trends visible.
|
||||
|
||||
**Interpretive scaffolding.** The log line includes ``epsilon``
|
||||
alongside ``avg_reward``, so students can directly see the
|
||||
relationship between exploration rate and performance. Early entries
|
||||
with low ``avg_reward`` and high ``epsilon`` invite the question:
|
||||
"is this bad performance, or just exploration?" The answer—that random
|
||||
behavior is expected when epsilon is near 1—is readable from the log
|
||||
itself.
|
||||
|
||||
**Timing information.** Each log line records both the elapsed time
|
||||
for that 100-episode interval and the total training time accumulated
|
||||
across all sessions. This serves two purposes. Practically, it lets
|
||||
students estimate how long continued training will take. Conceptually,
|
||||
it makes the cost of training tangible: RL is not instant, and the
|
||||
log makes the time investment visible.
|
||||
|
||||
**Session continuity.** When training resumes from a checkpoint, a
|
||||
header line marks the break (``=== Resumed from ep_0500.pt ===``).
|
||||
This lets the full log tell the story of a run across multiple
|
||||
sessions, preserving the history of when training happened even if the
|
||||
student stops and restarts many times.
|
||||
|
||||
The stop-watch-adjust-resume workflow
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``retro-gamer`` is designed around a workflow that the log format and
|
||||
checkpoint system both support: stop training, watch the agent play,
|
||||
decide what to change, and resume.
|
||||
|
||||
This workflow is pedagogically productive because it gives students
|
||||
a *reason* to look at the log and a *reason* to think about
|
||||
hyperparameters. Watching the agent at episode 100 play erratically,
|
||||
then watching the agent at episode 500 navigate toward the apple more
|
||||
consistently, is not just satisfying—it raises concrete questions.
|
||||
Why did the agent improve? What changed between those two checkpoints?
|
||||
What would happen if we gave it more time, or adjusted the reward?
|
||||
|
||||
These questions are best answered by consulting the log. The log in
|
||||
turn connects the behavior the student observed to numbers they can
|
||||
reason about: a decreasing loss, a declining epsilon, a rising average
|
||||
reward. The three—visual observation, log interpretation, and
|
||||
conceptual understanding—form a feedback loop that is much harder to
|
||||
close if training is treated as a black box that produces only a final
|
||||
model.
|
||||
|
||||
The fact that training can be stopped and resumed freely, with no
|
||||
penalty and no extra flags, removes friction from this cycle. Students
|
||||
who feel they can experiment—stop, look, think, resume—are more
|
||||
likely to do so than students who feel they have to commit to a full
|
||||
training run before seeing results.
|
||||
|
||||
Reward design as game description
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``reward`` field in ``[tool.retro-gamer]`` specifies a key from
|
||||
the game's state dictionary, not a function or a formula. This is
|
||||
another deliberate design choice. The reward signal is defined in the
|
||||
game code—in how the score changes when certain events occur—not in
|
||||
the training configuration.
|
||||
|
||||
This forces students to engage with the reward where it lives: in the
|
||||
game logic. If a student wants to change the reward structure, they
|
||||
must change the game. This connects the RL concept of reward shaping
|
||||
to the concrete act of writing Python code that updates a score. The
|
||||
question "what reward should the agent get for moving toward the
|
||||
apple?" becomes "what code should run when the snake moves?"—and
|
||||
answering it requires reasoning about what behavior you want to
|
||||
encourage and how a small, frequent signal compares to a large,
|
||||
infrequent one.
|
||||
|
||||
The distinction between reward-signal design (a pedagogically rich
|
||||
question with many possible answers) and reward-field specification
|
||||
(a technical detail) is preserved in the interface. Students configure
|
||||
the *key* to track; they design the *signal* in the game itself.
|
||||
|
||||
Metadata as game description, not training configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The game description lives in ``[tool.retro-gamer]`` inside the
|
||||
game's own ``pyproject.toml``, not in a separate training
|
||||
configuration file. This placement encodes a claim: the character set,
|
||||
the action space, and the reward signal are *properties of the game*,
|
||||
not settings for the trainer.
|
||||
|
||||
A student who edits the character set is not tweaking the trainer;
|
||||
they are more accurately describing their game. This framing matters
|
||||
because it positions the student as the expert on the game—which they
|
||||
are—and the trainer as a tool that depends on the accuracy of that
|
||||
description. Errors in the description are not configuration mistakes;
|
||||
they are inaccurate descriptions of something the student knows.
|
||||
|
||||
When a student omits a character from the character set and the agent
|
||||
fails to notice that character on the board, the diagnostic question
|
||||
is not "what went wrong with training?" but "is my description of the
|
||||
game correct?" This is a more productive question, because it connects
|
||||
the student's domain knowledge (they know what characters appear and
|
||||
why they matter) to the technical representation (one-hot encoding
|
||||
requires knowing in advance which characters to encode). The fix is
|
||||
not to adjust a hyperparameter; it is to describe the game more
|
||||
accurately.
|
||||
|
||||
Reference in New Issue
Block a user