Updates across the board
This commit is contained in:
30
docs/api.rst
Normal file
30
docs/api.rst
Normal file
@@ -0,0 +1,30 @@
|
||||
API Reference
|
||||
=============
|
||||
|
||||
All classes below are importable directly from ``retro_gamer``.
|
||||
|
||||
Game description
|
||||
----------------
|
||||
|
||||
.. autoclass:: retro_gamer.GameMetadata
|
||||
:members: from_pyproject, from_dict, validate
|
||||
|
||||
Training
|
||||
--------
|
||||
|
||||
.. autoclass:: retro_gamer.DQNTrainer
|
||||
:members: train, load_checkpoint
|
||||
|
||||
Environment
|
||||
-----------
|
||||
|
||||
.. autoclass:: retro_gamer.GameEnvironment
|
||||
:members: reset, step
|
||||
|
||||
Using a trained model
|
||||
---------------------
|
||||
|
||||
.. autoclass:: retro_gamer.TrainedPolicy
|
||||
:members: get_action
|
||||
|
||||
.. autoclass:: retro_gamer.PolicyInput
|
||||
@@ -343,12 +343,13 @@ If the character set is not specified, ``retro-gamer`` runs a brief
|
||||
exploration phase before training to observe which characters actually
|
||||
appear.
|
||||
|
||||
In addition to the board, the agent can observe numerical values from
|
||||
the game's state dictionary via ``observe_state``. These are
|
||||
appended to the end of the observation vector. The reward key must
|
||||
not be included in ``observe_state``: it would give the agent direct
|
||||
access to its own performance signal, which is not a realistic observation
|
||||
in most game contexts and can cause training pathologies.
|
||||
In addition to the board, the agent can observe extra computed values
|
||||
from ``game.state``. Listing keys in the ``observe_state`` option of
|
||||
``[preprocessing]`` causes those values to be appended to the
|
||||
observation vector after the board encoding. This is where feature
|
||||
engineering decisions live: what derived quantities should the agent
|
||||
see, and does giving it those values give it an advantage a human
|
||||
player would not have?
|
||||
|
||||
Neural network architectures
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -356,7 +357,8 @@ Neural network architectures
|
||||
The architecture of the Q-network—the number and arrangement of its
|
||||
layers—is one of the most consequential choices in DQN training.
|
||||
``retro-gamer`` selects an architecture based on the ``spatial``
|
||||
field in the game description and generates a plain-language rationale.
|
||||
option in ``[preprocessing]`` of ``config.toml`` and generates a
|
||||
plain-language rationale.
|
||||
|
||||
**Multilayer perceptrons (MLP)**
|
||||
|
||||
@@ -379,8 +381,7 @@ that these numbers were arranged in a 2D grid, or that spatially
|
||||
adjacent cells are related. This is appropriate when the game's
|
||||
observation is better understood as a collection of independent
|
||||
readings—a set of meters or status indicators—rather than as a spatial
|
||||
scene. Set ``spatial = false`` in the game description to use this
|
||||
architecture.
|
||||
scene. ``spatial = false`` (the default) selects this architecture.
|
||||
|
||||
**Convolutional neural networks (CNN)**
|
||||
|
||||
@@ -405,8 +406,8 @@ channels respectively, kernel size 3, padding 1) followed by a
|
||||
flattening step and an MLP head. The padding ensures that the spatial
|
||||
dimensions are preserved through the convolution, so the output of the
|
||||
second conv layer has shape (64, H, W), which is then flattened and
|
||||
passed to the MLP. Set ``spatial = true`` (the default) to use this
|
||||
architecture.
|
||||
passed to the MLP. Set ``spatial = true`` in ``[preprocessing]`` to
|
||||
use this architecture.
|
||||
|
||||
Connecting architecture to game metadata
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -416,15 +417,17 @@ follow from the game description you provide. This connection is worth
|
||||
making explicit, because understanding it is one of the main paths into
|
||||
understanding why neural network architecture matters.
|
||||
|
||||
- If ``spatial = true``, the CNN can detect local patterns—which characters
|
||||
are adjacent to which—without needing to see every possible arrangement.
|
||||
This is appropriate for games like Snake, where the snake's direction
|
||||
and the apple's relative position are spatially encoded.
|
||||
- If ``spatial = true`` (in ``[preprocessing]``), the CNN can detect
|
||||
local patterns—which characters are adjacent to which—without needing
|
||||
to see every possible arrangement. This is appropriate for games like
|
||||
Snake, where the snake's direction and the apple's relative position
|
||||
are spatially encoded.
|
||||
|
||||
- If ``spatial = false``, the MLP treats the board as a flat vector. This
|
||||
may be appropriate for games that use the character grid primarily as a
|
||||
display rather than a spatial field—for example, a game where characters
|
||||
appear in fixed, non-interacting positions as status indicators.
|
||||
- If ``spatial = false`` (the default), the MLP treats the board as a
|
||||
flat vector. This may be appropriate for games that use the character
|
||||
grid primarily as a display rather than a spatial field—for example,
|
||||
a game where characters appear in fixed, non-interacting positions as
|
||||
status indicators.
|
||||
|
||||
- The ``character_set`` determines the depth (C) of the board tensor.
|
||||
More characters mean more numbers per cell and a larger input to the
|
||||
@@ -432,11 +435,185 @@ understanding why neural network architecture matters.
|
||||
wastes capacity; a character set that omits relevant characters forces
|
||||
the agent to treat different things as the same.
|
||||
|
||||
- The ``observe_state`` fields are appended to the flattened CNN output
|
||||
before the MLP head. This allows the agent to use explicit state
|
||||
variables—a timer, a lives count—alongside the visual board
|
||||
representation.
|
||||
- Keys listed in ``observe_state`` (in ``[preprocessing]``) are appended
|
||||
to the flattened board output before the MLP head. This allows the
|
||||
agent to use computed values—a direction to the goal, a distance, a
|
||||
timer—alongside the visual board representation.
|
||||
|
||||
These relationships are not incidental features of the implementation.
|
||||
They are the reason the game description matters: every field you fill
|
||||
in shapes what the agent can perceive and therefore what it can learn.
|
||||
|
||||
Design rationale
|
||||
----------------
|
||||
|
||||
This section explains the reasoning behind several design decisions in
|
||||
``retro-gamer`` that go beyond technical necessity. Each choice was
|
||||
made with a specific pedagogical goal: to create a tool that not only
|
||||
trains agents, but also helps students build genuine understanding of
|
||||
how and why the training process works.
|
||||
|
||||
Checkpoint compatibility and the "start fresh" workflow
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When a student changes the game description or network architecture
|
||||
mid-training, ``retro-gamer`` refuses to resume and explains exactly
|
||||
which fields changed and why they are incompatible. This behavior is
|
||||
deliberate.
|
||||
|
||||
The immediate practical reason is correctness: if the character set
|
||||
changes, the network's input layer changes size, and the saved weights
|
||||
no longer correspond to any meaningful function. Loading them would
|
||||
produce garbage behavior. If the reward signal changes, the Q-values
|
||||
the network has accumulated are estimates of a *different* objective;
|
||||
resuming would mislead the network, not help it.
|
||||
|
||||
But the deeper reason is pedagogical. The incompatibility check is a
|
||||
moment of forced reflection. When a student sees::
|
||||
|
||||
character_set
|
||||
was : ['@', '*', '>', '<', '^', 'v']
|
||||
now : ['@', '*', '>', '<', '^', 'v', '#']
|
||||
why : the set of board characters (changes input layer size)
|
||||
|
||||
they are confronted with the concrete consequence of a description
|
||||
change. The character set is not a label; it determines the shape of
|
||||
the tensor the network operates on. Changing it invalidates the
|
||||
network the same way changing the rules of chess would invalidate a
|
||||
chess engine. The error message is designed to make this connection
|
||||
legible, not just to block a problematic action.
|
||||
|
||||
The ``retro-gamer clean`` command exists to make the recovery path
|
||||
explicit: you can start fresh, and you should. There is no partial
|
||||
salvage. This mirrors an important truth about RL training: some
|
||||
decisions are foundational, and changing them means starting over.
|
||||
Students who encounter this—who have to decide whether a change is
|
||||
worth the cost of retraining—are reasoning about the architecture in
|
||||
a way that purely reading about it does not produce.
|
||||
|
||||
The distinction between incompatible changes (game description,
|
||||
network architecture) and safe changes (hyperparameters like learning
|
||||
rate and epsilon) is also pedagogically useful. It encodes, in the
|
||||
tool itself, the distinction between *what the agent is learning* and
|
||||
*how it is learning*. Students who ask "can I change the learning rate
|
||||
without retraining?" are asking a question with a precise answer, and
|
||||
answering it correctly requires understanding why the learning rate is
|
||||
different in kind from the character set.
|
||||
|
||||
Checkpoint-level logging
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Early versions of ``retro-gamer`` logged one line per episode. This
|
||||
was accurate but not very useful: a run of 1,000 episodes produces
|
||||
1,000 log lines, most of which are noise. Individual episodes vary
|
||||
widely due to randomness in both the game and the agent's exploration,
|
||||
making it hard to see the underlying trend.
|
||||
|
||||
The current format logs one line per checkpoint—once every 100
|
||||
episodes—using averages over that window. This design serves several
|
||||
goals:
|
||||
|
||||
**Noise reduction.** Single-episode rewards are highly variable,
|
||||
especially when epsilon is high and the agent is behaving randomly.
|
||||
Averaging over 100 episodes smooths out this variance and makes
|
||||
genuine trends visible.
|
||||
|
||||
**Interpretive scaffolding.** The log line includes ``epsilon``
|
||||
alongside ``avg_reward``, so students can directly see the
|
||||
relationship between exploration rate and performance. Early entries
|
||||
with low ``avg_reward`` and high ``epsilon`` invite the question:
|
||||
"is this bad performance, or just exploration?" The answer—that random
|
||||
behavior is expected when epsilon is near 1—is readable from the log
|
||||
itself.
|
||||
|
||||
**Timing information.** Each log line records both the elapsed time
|
||||
for that 100-episode interval and the total training time accumulated
|
||||
across all sessions. This serves two purposes. Practically, it lets
|
||||
students estimate how long continued training will take. Conceptually,
|
||||
it makes the cost of training tangible: RL is not instant, and the
|
||||
log makes the time investment visible.
|
||||
|
||||
**Session continuity.** When training resumes from a checkpoint, a
|
||||
header line marks the break (``=== Resumed from ep_0500.pt ===``).
|
||||
This lets the full log tell the story of a run across multiple
|
||||
sessions, preserving the history of when training happened even if the
|
||||
student stops and restarts many times.
|
||||
|
||||
The stop-watch-adjust-resume workflow
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``retro-gamer`` is designed around a workflow that the log format and
|
||||
checkpoint system both support: stop training, watch the agent play,
|
||||
decide what to change, and resume.
|
||||
|
||||
This workflow is pedagogically productive because it gives students
|
||||
a *reason* to look at the log and a *reason* to think about
|
||||
hyperparameters. Watching the agent at episode 100 play erratically,
|
||||
then watching the agent at episode 500 navigate toward the apple more
|
||||
consistently, is not just satisfying—it raises concrete questions.
|
||||
Why did the agent improve? What changed between those two checkpoints?
|
||||
What would happen if we gave it more time, or adjusted the reward?
|
||||
|
||||
These questions are best answered by consulting the log. The log in
|
||||
turn connects the behavior the student observed to numbers they can
|
||||
reason about: a decreasing loss, a declining epsilon, a rising average
|
||||
reward. The three—visual observation, log interpretation, and
|
||||
conceptual understanding—form a feedback loop that is much harder to
|
||||
close if training is treated as a black box that produces only a final
|
||||
model.
|
||||
|
||||
The fact that training can be stopped and resumed freely, with no
|
||||
penalty and no extra flags, removes friction from this cycle. Students
|
||||
who feel they can experiment—stop, look, think, resume—are more
|
||||
likely to do so than students who feel they have to commit to a full
|
||||
training run before seeing results.
|
||||
|
||||
Reward design as game description
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``reward`` field in ``[tool.retro-gamer]`` specifies a key from
|
||||
the game's state dictionary, not a function or a formula. This is
|
||||
another deliberate design choice. The reward signal is defined in the
|
||||
game code—in how the score changes when certain events occur—not in
|
||||
the training configuration.
|
||||
|
||||
This forces students to engage with the reward where it lives: in the
|
||||
game logic. If a student wants to change the reward structure, they
|
||||
must change the game. This connects the RL concept of reward shaping
|
||||
to the concrete act of writing Python code that updates a score. The
|
||||
question "what reward should the agent get for moving toward the
|
||||
apple?" becomes "what code should run when the snake moves?"—and
|
||||
answering it requires reasoning about what behavior you want to
|
||||
encourage and how a small, frequent signal compares to a large,
|
||||
infrequent one.
|
||||
|
||||
The distinction between reward-signal design (a pedagogically rich
|
||||
question with many possible answers) and reward-field specification
|
||||
(a technical detail) is preserved in the interface. Students configure
|
||||
the *key* to track; they design the *signal* in the game itself.
|
||||
|
||||
Metadata as game description, not training configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The game description lives in ``[tool.retro-gamer]`` inside the
|
||||
game's own ``pyproject.toml``, not in a separate training
|
||||
configuration file. This placement encodes a claim: the character set,
|
||||
the action space, and the reward signal are *properties of the game*,
|
||||
not settings for the trainer.
|
||||
|
||||
A student who edits the character set is not tweaking the trainer;
|
||||
they are more accurately describing their game. This framing matters
|
||||
because it positions the student as the expert on the game—which they
|
||||
are—and the trainer as a tool that depends on the accuracy of that
|
||||
description. Errors in the description are not configuration mistakes;
|
||||
they are inaccurate descriptions of something the student knows.
|
||||
|
||||
When a student omits a character from the character set and the agent
|
||||
fails to notice that character on the board, the diagnostic question
|
||||
is not "what went wrong with training?" but "is my description of the
|
||||
game correct?" This is a more productive question, because it connects
|
||||
the student's domain knowledge (they know what characters appear and
|
||||
why they matter) to the technical representation (one-hot encoding
|
||||
requires knowing in advance which characters to encode). The fix is
|
||||
not to adjust a hyperparameter; it is to describe the game more
|
||||
accurately.
|
||||
|
||||
11
docs/conf.py
11
docs/conf.py
@@ -1,9 +1,18 @@
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.abspath('..'))
|
||||
|
||||
project = 'retro-gamer'
|
||||
copyright = '2025, Chris Proctor'
|
||||
author = 'Chris Proctor'
|
||||
release = '0.1.0'
|
||||
|
||||
extensions = []
|
||||
extensions = [
|
||||
'sphinx.ext.autodoc',
|
||||
]
|
||||
|
||||
autodoc_member_order = 'bysource'
|
||||
|
||||
templates_path = ['_templates']
|
||||
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
|
||||
|
||||
@@ -31,17 +31,22 @@ with `retro-games <https://retro-games.readthedocs.io/en/latest/>`__.
|
||||
The retro-games framework must also be installed; see its documentation
|
||||
for instructions.
|
||||
|
||||
If you are working through a Making With Code lab, ``retro-gamer`` is
|
||||
already installed in your project environment — skip ahead to
|
||||
:ref:`installation`.
|
||||
|
||||
**Add to a project** using ``uv`` or ``pip``:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% uv add retro-gamer
|
||||
% pip install retro-gamer
|
||||
|
||||
To install from source (for development or to use the latest changes):
|
||||
**Install as a global tool** (available everywhere, no project needed):
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% git clone https://github.com/cproctor/retro-gamer
|
||||
% cd retro-gamer
|
||||
% pip install -e .
|
||||
% uv tool install retro-gamer
|
||||
|
||||
Verify the installation by checking the command-line tool:
|
||||
|
||||
@@ -65,5 +70,8 @@ Verify the installation by checking the command-line tool:
|
||||
introduction
|
||||
background
|
||||
walkthrough
|
||||
troubleshooting
|
||||
reference
|
||||
integration
|
||||
api
|
||||
contributing
|
||||
|
||||
186
docs/integration.rst
Normal file
186
docs/integration.rst
Normal file
@@ -0,0 +1,186 @@
|
||||
Integrating a Trained Model
|
||||
===========================
|
||||
|
||||
Once you have trained a model, you can use it in two ways:
|
||||
|
||||
- **PolicyInput** — the model replaces the keyboard, driving an existing
|
||||
player-controlled agent. Use this to watch a trained agent play, or to
|
||||
run automated evaluations.
|
||||
- **TrainedPolicy in play_turn** — call ``get_action(game)`` from inside any
|
||||
agent's ``play_turn`` to embed the model as an autonomous character (for
|
||||
example, a smart enemy) alongside human-controlled or other agents.
|
||||
|
||||
Loading a trained model
|
||||
-----------------------
|
||||
|
||||
Both approaches start by creating a :class:`retro_gamer.TrainedPolicy`:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import TrainedPolicy
|
||||
|
||||
ai = TrainedPolicy("runs/snake/")
|
||||
|
||||
This reads ``config.toml``, rebuilds the network, and loads the latest
|
||||
checkpoint. To load a specific checkpoint instead:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
ai = TrainedPolicy("runs/snake/", checkpoint="ep_0500")
|
||||
|
||||
PolicyInput: model as player
|
||||
----------------------------
|
||||
|
||||
:class:`retro_gamer.PolicyInput` is an input source — it implements the same
|
||||
interface as keyboard input, but chooses actions using the trained model. Pass
|
||||
it to ``game.play()`` and everything else works exactly as usual:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro.examples.snake import create_game
|
||||
from retro_gamer import TrainedPolicy, PolicyInput
|
||||
|
||||
ai = TrainedPolicy("runs/snake/")
|
||||
game = create_game()
|
||||
game.play(input_source=PolicyInput(ai, game))
|
||||
|
||||
On each turn, ``PolicyInput`` observes the current board and game state, runs
|
||||
the model, and sends the chosen action to the game exactly as if the player
|
||||
had pressed that key.
|
||||
|
||||
TrainedPolicy in play_turn: model as autonomous character
|
||||
---------------------------------------------------------
|
||||
|
||||
To embed a trained model as an autonomous game character, create a
|
||||
``TrainedPolicy`` at module level and call ``get_action(game)`` from inside
|
||||
the agent's ``play_turn``. Placing it at module level means the model is
|
||||
loaded from disk once — not once per episode.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro.game import Game
|
||||
from retro.examples.snake import Apple, SnakeHead
|
||||
from retro_gamer import TrainedPolicy
|
||||
|
||||
_ai = TrainedPolicy("runs/snake/")
|
||||
|
||||
class AISnake(SnakeHead):
|
||||
def handle_keystroke(self, k, game): pass # ignore keyboard
|
||||
|
||||
def play_turn(self, game):
|
||||
key = _ai.get_action(game)
|
||||
if key == 'KEY_RIGHT': self.direction = (1, 0)
|
||||
elif key == 'KEY_LEFT': self.direction = (-1, 0)
|
||||
elif key == 'KEY_UP': self.direction = (0, -1)
|
||||
elif key == 'KEY_DOWN': self.direction = (0, 1)
|
||||
super().play_turn(game)
|
||||
|
||||
human_snake = SnakeHead()
|
||||
ai_snake = AISnake()
|
||||
ai_snake.position = (16, 8)
|
||||
apple = Apple()
|
||||
|
||||
game = Game([human_snake, ai_snake, apple], {"score": 0}, board_size=(32, 16))
|
||||
apple.relocate(game)
|
||||
game.play()
|
||||
|
||||
Training an enemy model
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can use the same training pipeline to produce a model for an enemy agent.
|
||||
``retro-gamer`` does not care *which* character it is training — it only cares
|
||||
that it can control one character through the keyboard and read a reward signal
|
||||
from the game state. To train an enemy:
|
||||
|
||||
1. **Create an enemy-perspective game variant.** Write (or add) a
|
||||
``create_game`` function — in a separate file, or alongside your main one —
|
||||
where the enemy agent is the keyboard-driven character and the reward key
|
||||
in the game state reflects the enemy's objective (for example, a bonus for
|
||||
catching the player). The human player can be absent, replaced by a
|
||||
random-moving agent, or driven by a ``TrainedPolicy`` once you have a trained
|
||||
player model.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def create_enemy_training_game():
|
||||
enemy = EnemyAgent() # the character the trainer will control
|
||||
player = RandomPlayer() # a stand-in; no human involved
|
||||
game = Game([enemy, player], {'enemy_reward': 0}, board_size=(32, 16))
|
||||
return game
|
||||
|
||||
2. **Train normally against this variant.**
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game my_game:create_enemy_training_game \
|
||||
--output runs/enemy/
|
||||
% retro-gamer train runs/enemy/
|
||||
|
||||
3. **Embed the trained model in your main game** using ``get_action``, exactly
|
||||
as shown above.
|
||||
|
||||
.. note::
|
||||
|
||||
Because ``retro-gamer`` injects actions through the game's global input
|
||||
source, *all* keyboard-listening agents in the training game will receive
|
||||
the trainer's keystrokes. The cleanest approach is to make the enemy the
|
||||
only keyboard-driven character in the training variant — any other
|
||||
characters should advance on their own without reading from the keyboard.
|
||||
|
||||
Adversarial training
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Once you have separate training runs for the player and the enemy, you can
|
||||
train them *against each other* iteratively. The idea is simple: train the
|
||||
player against the current enemy model, then train the enemy against the
|
||||
updated player model, and repeat. Each side is forced to improve against an
|
||||
increasingly capable opponent.
|
||||
|
||||
The key technique is to load the opponent's model at module level in each
|
||||
training game variant, so it is loaded from disk once per run rather than
|
||||
once per episode:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# enemy_training_game.py
|
||||
from retro_gamer import TrainedPolicy
|
||||
|
||||
_player = TrainedPolicy("runs/player/") # loaded once when the module is imported
|
||||
|
||||
def create_game():
|
||||
enemy = EnemyAgent()
|
||||
player = AIPlayer(_player) # uses _player.get_action in play_turn
|
||||
return Game([enemy, player], {'enemy_reward': 0}, board_size=(32, 16))
|
||||
|
||||
You then alternate training runs:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/player/ # train player against current enemy
|
||||
% retro-gamer train runs/enemy/ # train enemy against updated player
|
||||
% retro-gamer train runs/player/ # train player again
|
||||
# ...
|
||||
|
||||
How many episodes to run before switching is itself a design decision: too
|
||||
few and neither model has time to adapt; too many and each side overfits to
|
||||
its current opponent. Watching how the strategies evolve — and asking *why*
|
||||
each model behaves as it does at each stage — connects directly to concepts
|
||||
in multi-agent reinforcement learning and adversarial training.
|
||||
|
||||
Differences between the two approaches
|
||||
---------------------------------------
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 35 65
|
||||
|
||||
* - ``PolicyInput``
|
||||
- ``TrainedPolicy`` in ``play_turn``
|
||||
* - Replaces human input for the whole game
|
||||
- One autonomous agent among many
|
||||
* - Game code is unchanged
|
||||
- Agent's ``play_turn`` calls ``get_action``
|
||||
* - One model drives all player-controlled agents
|
||||
- Each agent instance has its own model
|
||||
* - Simpler — just pass to ``game.play()``
|
||||
- More flexible — mix human and AI characters
|
||||
@@ -100,12 +100,12 @@ matters.
|
||||
|
||||
**Observation design** determines what information is available to the
|
||||
agent. If you leave a character out of the ``character_set``, the agent
|
||||
will not distinguish it from empty space. If you include a game-state
|
||||
variable in ``observe_state``, the agent can see it directly rather than
|
||||
having to infer it from the board. The consequences of these choices for
|
||||
what the agent can learn are reasonably predictable—and making and
|
||||
checking those predictions is exactly the kind of reasoning the tool is
|
||||
designed to support.
|
||||
will not distinguish it from empty space. If the game module defines a
|
||||
``get_state()`` function, the agent also receives those computed values
|
||||
as part of its observation. The consequences of these choices for what
|
||||
the agent can learn are reasonably predictable — and making and checking
|
||||
those predictions is exactly the kind of reasoning the tool is designed
|
||||
to support.
|
||||
|
||||
**Reward engineering** is the craft of specifying what counts as doing
|
||||
well in a way the agent can actually optimize. Using score as the reward
|
||||
|
||||
@@ -17,8 +17,6 @@ A complete example for the Snake game:
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
spatial = true
|
||||
observe_state = []
|
||||
|
||||
You do not need to specify the board size: ``retro-gamer`` reads it
|
||||
directly from your game's ``board_size`` attribute.
|
||||
@@ -65,54 +63,156 @@ If omitted, ``retro-gamer`` runs an exploration phase to discover the
|
||||
characters that appear in practice. The length of this phase is
|
||||
controlled by the ``exploration_turns`` hyperparameter.
|
||||
|
||||
``spatial``
|
||||
~~~~~~~~~~~
|
||||
Preprocessing options
|
||||
---------------------
|
||||
|
||||
**Optional; default ``true``.** Whether to treat the board as a 2D
|
||||
spatial scene. When ``true``, the trainer uses a convolutional neural
|
||||
network (CNN) that can detect patterns in the relative positions of
|
||||
characters. When ``false``, the trainer uses a multilayer perceptron
|
||||
(MLP) that sees the board as a flat list of numbers without positional
|
||||
structure.
|
||||
Preprocessing options live in the ``[preprocessing]`` section of a run's
|
||||
``config.toml``. They control how the game's board and state are
|
||||
transformed into the observation vector that the neural network sees.
|
||||
``retro-gamer create`` writes sensible defaults; you can edit them by
|
||||
hand before running ``retro-gamer train``.
|
||||
|
||||
.. note::
|
||||
|
||||
Changes to any ``[preprocessing]`` option—or to the game description
|
||||
fields above—make existing checkpoints incompatible. Run
|
||||
``retro-gamer clean`` before retraining after such changes.
|
||||
|
||||
``spatial`` (default: ``false``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Whether to treat the board as a 2D spatial scene. When ``true``, the
|
||||
trainer uses a convolutional neural network (CNN); when ``false``, a
|
||||
multilayer perceptron (MLP) that sees the board as a flat list of
|
||||
numbers.
|
||||
|
||||
``board`` (default: ``true``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Whether to include the board encoding in the observation vector. Set
|
||||
to ``false`` to train on game state variables only, with no board at
|
||||
all. This is useful for games with small, enumerable state spaces where
|
||||
a lookup table (classic Q-learning) is sufficient.
|
||||
|
||||
When ``board = false``:
|
||||
|
||||
- ``spatial`` must also be ``false`` (no board means no 2D scene for a CNN).
|
||||
- At least one key must be listed in ``observe_state``.
|
||||
- ``character_set`` is not required and character discovery is skipped.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
spatial = true
|
||||
[preprocessing]
|
||||
board = false
|
||||
observe_state = ["board_state"]
|
||||
|
||||
``observe_state``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
``observe_state`` (default: ``[]``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Optional; default ``[]``.** A list of keys from the game's state
|
||||
dictionary to append to the observation vector. The values must be
|
||||
numbers (integers, floats, or booleans). The reward key must not
|
||||
appear in this list.
|
||||
A list of keys from ``game.state`` to include in the observation
|
||||
vector, appended after the board encoding (or as the entire
|
||||
observation when ``board = false``). Scalar values contribute one
|
||||
element each; list or tuple values are flattened.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
observe_state = ["lives", "level"]
|
||||
observe_state = ["apple_dx", "apple_dy"]
|
||||
|
||||
The keys must be present in ``game.state`` at every step, initialized
|
||||
in ``create_game()`` before the game starts. All values that are lists
|
||||
or tuples must always have the same length from episode to episode.
|
||||
|
||||
.. warning::
|
||||
|
||||
``observe_state`` keys must be initialized to their final shape in
|
||||
``create_game()`` before the game starts. If a key is absent or its
|
||||
list length changes between episodes, training will crash with an
|
||||
error explaining which key changed and by how much. This happens
|
||||
because the neural network's input layer has a fixed size determined
|
||||
at the start of training; it cannot adapt to a changing observation
|
||||
shape mid-run.
|
||||
|
||||
Always initialize every observed key with a placeholder of the
|
||||
correct type and length before the first ``game.step()`` call.
|
||||
|
||||
``observe_state_sizes`` (auto-discovered)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A table mapping each ``observe_state`` key to its flat size (``1`` for
|
||||
scalars, ``N`` for sequences of length N). This is written automatically
|
||||
to ``config.toml`` the first time ``retro-gamer train`` runs, after the
|
||||
trainer samples ``game.state`` to discover the actual sizes:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
observe_state_sizes = {board_state = 9}
|
||||
|
||||
You do not need to set this manually. Once written, it is used to
|
||||
detect changes in state shape when resuming training—an incompatible
|
||||
change here requires running ``retro-gamer clean`` and starting fresh.
|
||||
|
||||
``egocentric`` (default: ``false``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When ``true``, the board observation is cropped to a square window
|
||||
centred on a specific agent rather than the full board. This gives the
|
||||
agent a local, first-person-like view and makes the observation
|
||||
invariant to the agent's absolute position on the board.
|
||||
|
||||
Requires ``egocentric_player`` and ``egocentric_radius``.
|
||||
|
||||
``egocentric_player``
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The name of the agent to use as the centre of the egocentric crop.
|
||||
Must match the ``name`` attribute of one of the game's agents.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
egocentric_player = "Snake head"
|
||||
|
||||
``egocentric_radius``
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The half-side-length of the egocentric crop window, in cells. The
|
||||
resulting observation covers a ``(2r+1) × (2r+1)`` region. Larger
|
||||
values give the agent a wider view; smaller values focus it on the
|
||||
immediate vicinity.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
egocentric_radius = 8 # 17×17 window
|
||||
|
||||
When ``egocentric_radius`` is set, ``board_size`` in ``[metadata]`` is
|
||||
automatically updated to ``[2r+1, 2r+1]`` so the network is sized
|
||||
correctly.
|
||||
|
||||
.. _hyperparameters:
|
||||
|
||||
Hyperparameters
|
||||
---------------
|
||||
|
||||
Hyperparameters are stored in the ``[hyperparameters]`` section of
|
||||
``config.toml``. They can be set via ``retro-gamer create`` options or
|
||||
edited directly.
|
||||
Hyperparameters are split across two sections of ``config.toml``:
|
||||
|
||||
- ``[model]`` — network architecture (changing these requires starting fresh)
|
||||
- ``[training]`` — learning algorithm parameters (safe to change at any time)
|
||||
|
||||
Both sections can be set via ``retro-gamer create`` options or edited directly.
|
||||
|
||||
Learning and optimization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``learning_rate`` (default: ``0.001``)
|
||||
``learning_rate`` (default: ``0.0001``)
|
||||
The step size used by the Adam optimizer when updating network
|
||||
weights. Larger values converge faster but may be unstable; smaller
|
||||
values are more stable but slower.
|
||||
|
||||
``lr_decay`` (default: ``0.995``)
|
||||
``learning_rate_decay`` (default: ``0.9999``)
|
||||
Multiplicative decay applied to the learning rate after each
|
||||
episode. The learning rate decreases geometrically over training,
|
||||
helping the network fine-tune later without destabilizing early
|
||||
progress.
|
||||
progress. With the default value, the learning rate decays to about
|
||||
13 % of its starting value after 20 000 episodes.
|
||||
|
||||
``gamma`` (default: ``0.99``)
|
||||
The discount factor for future rewards. A value of 1.0 makes the
|
||||
@@ -127,7 +227,7 @@ Exploration
|
||||
random action with probability ``epsilon`` and exploits its current
|
||||
Q-function with probability ``1 - epsilon``.
|
||||
|
||||
``epsilon_decay`` (default: ``0.995``)
|
||||
``epsilon_decay`` (default: ``0.9997``)
|
||||
Multiplicative decay applied to ``epsilon`` after each episode.
|
||||
|
||||
``epsilon_min`` (default: ``0.05``)
|
||||
@@ -142,31 +242,33 @@ Memory and sampling
|
||||
The number of experiences sampled from the replay buffer per
|
||||
training step.
|
||||
|
||||
``memory_capacity`` (default: ``10000``)
|
||||
``memory_capacity`` (default: ``50000``)
|
||||
The maximum number of experiences the replay buffer can hold. When
|
||||
full, the oldest experiences are discarded.
|
||||
|
||||
``prioritize_experiences`` (default: ``false``)
|
||||
``prioritize_experiences`` (default: ``true``)
|
||||
Whether to use prioritized experience replay. When ``true``,
|
||||
experiences with larger TD errors are sampled more frequently.
|
||||
This often improves sample efficiency at a modest computational
|
||||
cost.
|
||||
|
||||
Network architecture
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
Model architecture
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``n_layers`` (default: ``2``)
|
||||
The number of hidden layers in the MLP head (for spatial games,
|
||||
this follows the CNN; for non-spatial games, it is the full
|
||||
network).
|
||||
These live in the ``[model]`` section. Changing them requires starting fresh
|
||||
(run ``retro-gamer clean`` before retraining).
|
||||
|
||||
``layer_size`` (default: ``128``)
|
||||
The width (number of units) in each hidden layer.
|
||||
``hidden_sizes`` (default: ``[128, 64]``)
|
||||
A list of integers giving the size of each hidden layer in the MLP
|
||||
head. The default creates two layers: 128 units then 64. For spatial
|
||||
games this follows the CNN; for non-spatial games it is the full
|
||||
network. Larger or deeper networks can represent more complex
|
||||
Q-functions but train more slowly and may need more episodes.
|
||||
|
||||
Training duration
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
``training_episodes`` (default: ``1000``)
|
||||
``training_episodes`` (default: ``20000``)
|
||||
The total number of game episodes to run. Each episode runs until
|
||||
the game ends or ``max_turns_per_episode`` turns have elapsed.
|
||||
|
||||
@@ -175,12 +277,18 @@ Training duration
|
||||
indefinitely (for example, if the agent finds a way to avoid
|
||||
dying).
|
||||
|
||||
``target_update_freq`` (default: ``100``)
|
||||
``target_update_freq`` (default: ``500``)
|
||||
How many training steps between updates of the target network.
|
||||
More frequent updates make training targets move faster (less
|
||||
stable); less frequent updates make them more stable but slower
|
||||
to reflect new learning.
|
||||
|
||||
``train_every`` (default: ``4``)
|
||||
Run one training step every N game steps. Higher values speed up
|
||||
episode collection at the cost of fewer gradient updates per
|
||||
experience. The default of 4 is a good balance for most games;
|
||||
set to 1 to train on every step.
|
||||
|
||||
Character discovery
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
@@ -207,23 +315,26 @@ game's ``pyproject.toml``; you do not pass it on the command line.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game MODULE --output DIR [OPTIONS]
|
||||
% retro-gamer create --game GAME --output DIR [OPTIONS]
|
||||
|
||||
**Required options:**
|
||||
|
||||
- ``--game MODULE`` — Python module containing ``create_game()``
|
||||
(e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
|
||||
is read from the ``pyproject.toml`` found in or above the module's
|
||||
source directory.
|
||||
- ``--game GAME`` — Your game, specified as a file path or a Python
|
||||
module name:
|
||||
|
||||
- File path: ``--game my_game.py`` or ``--game my_game/``
|
||||
- Module name: ``--game retro.examples.snake``
|
||||
|
||||
The ``[tool.retro-gamer]`` section is read from the ``pyproject.toml``
|
||||
found in or above the game file.
|
||||
- ``--output DIR`` — Directory to create for this training run.
|
||||
|
||||
**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
|
||||
|
||||
- ``--training-episodes N``
|
||||
- ``--n-layers N``
|
||||
- ``--layer-size N``
|
||||
- ``--hidden-sizes SIZES`` — comma-separated, e.g. ``512,256``
|
||||
- ``--learning-rate F``
|
||||
- ``--lr-decay F``
|
||||
- ``--learning-rate-decay F``
|
||||
- ``--gamma F``
|
||||
- ``--epsilon-decay F``
|
||||
- ``--epsilon-min F``
|
||||
@@ -232,20 +343,40 @@ game's ``pyproject.toml``; you do not pass it on the command line.
|
||||
- ``--target-update-freq N``
|
||||
- ``--max-turns-per-episode N``
|
||||
- ``--exploration-turns N``
|
||||
- ``--train-every N``
|
||||
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
|
||||
|
||||
``retro-gamer train``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Train (or resume training) a DQN agent.
|
||||
Train a DQN agent.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train RUN_DIR [--resume CHECKPOINT]
|
||||
% retro-gamer train RUN_DIR
|
||||
|
||||
``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
|
||||
create``. If ``--resume`` is given, training resumes from the specified
|
||||
checkpoint file (relative or absolute path).
|
||||
create``. If checkpoints already exist in ``RUN_DIR``, training
|
||||
automatically resumes from the latest one so prior work is never lost.
|
||||
|
||||
If all configured episodes have already been completed, the command
|
||||
prints a message and exits immediately. To keep training, increase
|
||||
``training_episodes`` in ``config.toml`` and run again.
|
||||
|
||||
**Incompatible changes.** Some config changes make existing checkpoints
|
||||
unusable. If you change any of the following, ``retro-gamer train`` will
|
||||
detect the mismatch and refuse to resume, with a clear explanation:
|
||||
|
||||
- ``actions``, ``reward``, ``character_set``, ``board_size``
|
||||
(``[metadata]``) — game description
|
||||
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
|
||||
``egocentric``, ``egocentric_player``, ``egocentric_radius``
|
||||
(``[preprocessing]``) — observation encoding
|
||||
- ``hidden_sizes`` (``[model]``) — network architecture
|
||||
|
||||
Run ``retro-gamer clean RUN_DIR`` to remove the old checkpoints and start
|
||||
fresh. Other hyperparameter changes (learning rate, epsilon, etc.) are
|
||||
safe and take effect immediately on the next training run.
|
||||
|
||||
``retro-gamer play``
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -256,16 +387,32 @@ Watch a trained agent play the game in the terminal.
|
||||
|
||||
% retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
|
||||
|
||||
``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
|
||||
name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
|
||||
By default, the latest available checkpoint is loaded. Use
|
||||
``--checkpoint`` to load a specific one by name (e.g. ``ep_0100``).
|
||||
``--framerate`` sets the target frames per second (default: 12). Press
|
||||
Enter or Escape to quit.
|
||||
|
||||
``retro-gamer clean``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Remove all checkpoints and the training log from a run directory.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer clean RUN_DIR
|
||||
|
||||
Prompts for confirmation before deleting. Use ``--yes`` / ``-y`` to skip
|
||||
the prompt. The ``config.toml`` is preserved so you can run
|
||||
``retro-gamer train`` immediately to start fresh with the same settings.
|
||||
|
||||
Use this after making an incompatible change (see ``retro-gamer train``
|
||||
above) or any time you want to restart training from scratch.
|
||||
|
||||
``retro-gamer info``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Print a summary of a training run: metadata, hyperparameters, recent
|
||||
episode log, and available checkpoints.
|
||||
checkpoint log, and available checkpoints.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
@@ -285,60 +432,49 @@ contents:
|
||||
└── checkpoints/
|
||||
├── ep_0100.pt # model weights at episode 100
|
||||
├── ep_0200.pt
|
||||
├── ...
|
||||
└── final.pt # model weights at training completion
|
||||
└── ... # one file saved every 100 episodes
|
||||
|
||||
``config.toml`` is written by ``retro-gamer create`` and updated (with
|
||||
the discovered character set and resolved hyperparameters) when
|
||||
``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
|
||||
and ``train`` is the recommended way to adjust hyperparameters.
|
||||
``retro-gamer train`` begins. It has five sections: ``[game]``,
|
||||
``[metadata]``, ``[preprocessing]``, ``[model]``, and ``[training]``.
|
||||
Editing ``config.toml`` between ``create`` and ``train`` is the
|
||||
recommended way to adjust hyperparameters.
|
||||
|
||||
``training.log`` begins with the full architecture description
|
||||
generated at training startup, followed by one line per episode in the
|
||||
format::
|
||||
``training.log`` begins with the full network architecture description,
|
||||
then one line per checkpoint (every 100 episodes) in the format::
|
||||
|
||||
[EP NNNN] total_reward=F steps=N epsilon=F avg_loss=F
|
||||
[ep_NNNN] ep=SSSS-NNNN avg_reward=F avg_steps=N epsilon=F avg_loss=F time=Xm Xs total=Xm Xs
|
||||
|
||||
Checkpoint files are PyTorch state dictionaries containing model
|
||||
weights, optimizer state, the current epsilon, and the total number of
|
||||
training steps completed. They can be loaded with
|
||||
``retro-gamer play`` or directly with the Python API.
|
||||
Each field averages over the episodes since the previous checkpoint:
|
||||
|
||||
- ``ep=SSSS-NNNN`` — episode range covered by this entry
|
||||
- ``avg_reward`` — mean total reward per episode (positive = good)
|
||||
- ``avg_steps`` — mean episode length in game turns
|
||||
- ``epsilon`` — current exploration rate (approaches ``epsilon_min`` over time)
|
||||
- ``avg_loss`` — mean Huber loss across training steps (should decrease as learning
|
||||
stabilises). Huber loss equals ½·(q−t)² for small errors and |q−t|−½ for large
|
||||
ones, so it stays bounded even when Q-values are large. Values in the range
|
||||
0–10 are typical; a slow downward trend over thousands of episodes is the
|
||||
healthy pattern. A loss that grows without bound indicates a learning rate
|
||||
that is too high.
|
||||
- ``time`` — wall-clock time for this checkpoint interval
|
||||
- ``total`` — cumulative training time across all sessions
|
||||
|
||||
When training is resumed, a ``=== Resumed from ... ===`` line is appended
|
||||
so the log records the full history of a run across multiple sessions.
|
||||
|
||||
Python API
|
||||
----------
|
||||
|
||||
For advanced use, ``retro-gamer``'s components are importable as a
|
||||
library.
|
||||
library. See the :doc:`api` reference for full details.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
|
||||
from retro_gamer import GameMetadata, DQNTrainer
|
||||
from retro.examples.snake import create_game
|
||||
|
||||
# Read metadata from [tool.retro-gamer] in the game's pyproject.toml
|
||||
metadata = GameMetadata.from_pyproject("retro.examples.snake")
|
||||
|
||||
trainer = DQNTrainer(
|
||||
create_game, metadata, "runs/snake/",
|
||||
training_episodes=500,
|
||||
n_layers=2,
|
||||
layer_size=128,
|
||||
)
|
||||
trainer = DQNTrainer(create_game, metadata, "runs/snake/")
|
||||
trainer.train()
|
||||
|
||||
``GameEnvironment`` provides a gym-style interface for stepping through
|
||||
a game programmatically:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameEnvironment
|
||||
|
||||
env = GameEnvironment(create_game, metadata)
|
||||
obs = env.reset() # returns initial observation vector
|
||||
obs, reward, done = env.step("KEY_RIGHT")
|
||||
|
||||
The observation is a flat NumPy array of dtype ``float32``. For spatial
|
||||
games, the first ``C × H × W`` elements are the board (channel-first
|
||||
one-hot encoding); for non-spatial games, the board is encoded
|
||||
``H × W × C`` and then flattened. Any ``observe_state`` values are
|
||||
appended at the end.
|
||||
|
||||
287
docs/troubleshooting.rst
Normal file
287
docs/troubleshooting.rst
Normal file
@@ -0,0 +1,287 @@
|
||||
Troubleshooting
|
||||
===============
|
||||
|
||||
This section describes problems that commonly arise when training an agent
|
||||
with ``retro-gamer``. Each entry names the issue, describes what you will
|
||||
see in the training log or when watching the agent play, explains what is
|
||||
happening in terms of the underlying reinforcement learning, and suggests
|
||||
how to fix it.
|
||||
|
||||
.. contents:: Issues
|
||||
:local:
|
||||
:depth: 1
|
||||
|
||||
|
||||
Loss grows rapidly over training
|
||||
---------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
The ``avg_loss`` column in the training log grows steadily from one
|
||||
checkpoint to the next, often at an accelerating rate::
|
||||
|
||||
[ep_0100] avg_loss=22.2
|
||||
[ep_0200] avg_loss=128.5
|
||||
[ep_0300] avg_loss=2918.5
|
||||
[ep_0400] avg_loss=163825.1
|
||||
|
||||
Left unchecked, the loss eventually reaches extreme values and the agent's
|
||||
behavior becomes erratic or degenerates entirely.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
This is called *Q-value divergence*. The Q-network is trained to predict
|
||||
the total future reward of each action. To do that, it computes a *target*
|
||||
for each prediction — but the target itself is computed using the
|
||||
Q-network's own current predictions. This creates a feedback loop: if
|
||||
the predictions are slightly off, the targets drift, which makes the next
|
||||
predictions slightly more off, which drifts the targets further.
|
||||
|
||||
Under normal conditions, the learning rate is small enough and the target
|
||||
network stable enough that this loop stays controlled. Divergence happens
|
||||
when the learning rate is too high, causing each update to overshoot.
|
||||
The problem is amplified by larger networks (more parameters to overshoot)
|
||||
and by prioritized experience replay, which deliberately samples the
|
||||
experiences the network is most wrong about — exactly the experiences most
|
||||
likely to destabilize it.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction
|
||||
(for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize
|
||||
training. If you recently increased the size of the network (via
|
||||
``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning
|
||||
rate than you used before is likely necessary — larger, more capable
|
||||
networks need smaller, more careful updates.
|
||||
|
||||
Also consider increasing ``target_update_freq``. The target network is a
|
||||
frozen copy of the Q-network used to compute stable training targets; the
|
||||
less frequently it is updated, the more stable those targets are. The
|
||||
default is 200 steps; raising it to 500 or 1000 slows learning slightly
|
||||
but reduces the chance of divergence.
|
||||
|
||||
Because divergence compounds over many episodes, a run that has begun
|
||||
diverging cannot simply be resumed with a lower learning rate — the
|
||||
weights have already drifted far from useful values. Use
|
||||
``retro-gamer clean`` to remove the existing checkpoints and start fresh.
|
||||
|
||||
|
||||
Agent ignores some actions entirely
|
||||
-------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
After training, the agent never (or almost never) turns in certain
|
||||
directions, regardless of the board state. If you compare checkpoints at
|
||||
different stages of training, the missing directions are absent from the
|
||||
very beginning and never appear. The agent may survive for a while but
|
||||
always move in only a subset of the possible directions.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
If some actions lead to immediate death every time they are tried early in
|
||||
training, the Q-network quickly learns to assign them very low values.
|
||||
This is correct in the specific situation where those actions are always
|
||||
fatal — but the network then generalizes that association across *all*
|
||||
board positions, even positions where those actions would be safe.
|
||||
|
||||
A common cause is a fixed starting position at the edge or corner of the
|
||||
board. A snake that always starts in the top-left corner and always begins
|
||||
moving downward will die immediately whenever it turns up or left in the
|
||||
first step. After thousands of early episodes where those actions produce
|
||||
instant death, the network has seen so much evidence that "turn left →
|
||||
die" and "turn up → die" that it assigns them low Q-values everywhere.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Make sure the game's starting conditions give the agent a chance to try
|
||||
every action safely. For a snake game, this means randomizing both the
|
||||
starting position (keeping at least one cell away from every edge) and
|
||||
the starting direction at the beginning of each episode. An agent that
|
||||
starts in different places and orientations each time will quickly learn
|
||||
that all four directions can be appropriate depending on context.
|
||||
|
||||
|
||||
Agent survives but never moves toward the goal
|
||||
-----------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
The ``avg_steps`` column in the training log increases steadily — the
|
||||
agent is surviving longer — but ``avg_reward`` stays negative or barely
|
||||
improves. When you watch the agent play, it wanders around the board
|
||||
without ever approaching the target object. Episodes end because the
|
||||
agent runs into a wall, not because it reached the goal.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
The reward signal is *asymmetric*: it penalizes moving away from the goal
|
||||
but gives no reward for moving toward it. With this signal, the agent
|
||||
learns to avoid the penalty by surviving, but it has no positive gradient
|
||||
pointing it in the right direction. The eventual goal-reaching reward
|
||||
(eating the apple, reaching the exit, etc.) is too rare — especially
|
||||
early in training when the agent is mostly acting randomly — to provide
|
||||
meaningful learning signal on its own.
|
||||
|
||||
From the Q-network's perspective, all directions look roughly equivalent:
|
||||
moving toward the goal is 0 reward, moving away is −1. On a large board,
|
||||
the probability of eating the apple by chance is small enough that the
|
||||
network may never see the positive terminal reward at all during the
|
||||
exploration phase.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Make the distance-based reward symmetric: give **+1 for moving toward the
|
||||
goal** and **−1 for moving away**. This way, every single step provides a
|
||||
meaningful signal in the correct direction, and the agent does not need to
|
||||
reach the goal by chance in order to start learning. In a snake game,
|
||||
computing this signal requires only one line of arithmetic — the change
|
||||
in Manhattan distance between the head and the apple from one step to the
|
||||
next.
|
||||
|
||||
Note that the shaped ±1 signal is a *proxy* for the real objective. If the
|
||||
agent learns to follow it too literally, it may take direct paths that run
|
||||
through its own body. The −10 death penalty and +50 apple reward are still
|
||||
necessary; the shaping only accelerates early learning.
|
||||
|
||||
|
||||
Exploration ends before learning is complete
|
||||
---------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
The ``epsilon`` column in the training log reaches ``epsilon_min`` well
|
||||
before training is finished. After that point, ``avg_reward`` stops
|
||||
improving even though many episodes remain. When you watch the agent play,
|
||||
it commits to the same strategy regardless of what is happening on the
|
||||
board.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
Epsilon controls the balance between exploration (random actions) and
|
||||
exploitation (using the learned policy). Early in training, when the
|
||||
Q-network has seen little data, exploration is essential: the agent needs
|
||||
to try different things to accumulate the varied experiences that make
|
||||
Q-value estimates reliable. Once epsilon reaches its minimum, the agent
|
||||
stops exploring and commits fully to whatever policy it has learned so far.
|
||||
|
||||
If ``training_episodes`` is too small relative to ``epsilon_decay``, the
|
||||
exploration phase ends while the Q-network is still unreliable. The agent
|
||||
then exploits a half-learned policy that cannot improve because it never
|
||||
tries anything new.
|
||||
|
||||
You can calculate when epsilon will reach its minimum:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import math
|
||||
episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay)
|
||||
|
||||
With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``,
|
||||
``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The
|
||||
agent should have substantial training time *after* the exploration phase
|
||||
ends — so ``training_episodes`` should be at least several times this
|
||||
number.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Increase ``training_episodes`` so that the agent has many episodes of
|
||||
exploitation after the exploration phase ends. For simple games on small
|
||||
boards, 10,000 episodes is a reasonable starting point; for complex games
|
||||
or large boards, 50,000–100,000 may be needed.
|
||||
|
||||
This is always safe to change. Because ``training_episodes`` does not
|
||||
affect the network architecture or the reward signal, you can increase it
|
||||
in ``config.toml`` and resume training from the latest checkpoint without
|
||||
starting fresh.
|
||||
|
||||
|
||||
Death penalty dominates all other signals
|
||||
-------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
After a period of training, the agent survives for many steps but rarely
|
||||
or never scores. It tends to circle, hug walls, or otherwise avoid the
|
||||
goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains
|
||||
persistently negative. The agent behaves as if staying alive is the only
|
||||
objective.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
When the penalty for dying is much larger than any other reward in the
|
||||
game, the Q-network learns that staying alive is overwhelmingly the most
|
||||
important thing to do. Scoring — which requires taking some risk —
|
||||
becomes unattractive because a single death outweighs many successful
|
||||
goal-reaching events.
|
||||
|
||||
For example, if the death penalty is −1000 and each successful apple is
|
||||
+50, then dying once costs the equivalent of twenty apples. The agent
|
||||
learns that the safest strategy is to avoid risk entirely, even if that
|
||||
means never eating. From the Q-network's perspective, this is rational:
|
||||
it is correctly optimizing the reward signal you gave it.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Keep all reward magnitudes in the same order of magnitude. If per-step
|
||||
shaping gives ±1 and the goal reward is +50, a death penalty of −10 is
|
||||
appropriate: death is clearly bad (ten times worse than a bad step) but
|
||||
not so catastrophic that it crowds out everything else. As a rule of
|
||||
thumb, no single reward should be more than ten to twenty times larger
|
||||
than the typical per-step reward.
|
||||
|
||||
Increasing ``gamma`` (the discount factor) is a better way to make the
|
||||
agent care more about long-term consequences. A higher gamma causes
|
||||
future rewards — including the eventual death penalty — to count more
|
||||
heavily in the agent's current decisions, without distorting the relative
|
||||
scale of the rewards.
|
||||
|
||||
|
||||
Reward signal and human score interfere with each other
|
||||
---------------------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
Human players see scores that go negative, or that include penalties and
|
||||
adjustments that make no sense in the context of a normal game. Conversely,
|
||||
adjustments made to improve training (removing a per-step shaping penalty,
|
||||
changing a death penalty) change the game's visible score in ways that
|
||||
affect the experience for human players.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
Using the same state variable for both the training reward and the
|
||||
human-visible score conflates two separate concerns. Training rewards
|
||||
benefit from shaping — intermediate signals like "moved toward the goal"
|
||||
and "died" that accelerate learning. Scores for human players should
|
||||
reflect only the game's actual objectives (apples eaten, enemies defeated,
|
||||
distance covered) so that they are legible and motivating.
|
||||
|
||||
When these are the same variable, every design decision about one
|
||||
necessarily affects the other.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Use two separate keys in the game's state dictionary: one for the
|
||||
human-facing score (updated only by meaningful in-game events) and one
|
||||
for the training reward (updated every step with shaping signals and
|
||||
penalties). In the game code:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Only updated when the snake eats an apple — clean for human players.
|
||||
game.state['score'] += 50
|
||||
|
||||
# Updated every step — used only by the trainer.
|
||||
game.state['reward'] += old_dist - new_dist # +1 toward apple, -1 away
|
||||
game.state['reward'] += 50 # also reward eating
|
||||
game.state['reward'] -= 10 # death penalty
|
||||
|
||||
Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of
|
||||
``pyproject.toml`` so the trainer watches the right key. The score display
|
||||
remains clean for human players, and you can adjust the training reward
|
||||
freely without affecting it.
|
||||
|
||||
Note that changing the ``reward`` key is an incompatible change: existing
|
||||
checkpoints trained on the old signal will be rejected when you try to
|
||||
resume. Run ``retro-gamer clean`` and start fresh after making this change.
|
||||
@@ -21,9 +21,9 @@ You will need:
|
||||
Preparing your game
|
||||
-------------------
|
||||
|
||||
``retro-gamer`` loads your game by importing a Python module and
|
||||
calling a function named ``create_game``. The ``create_game`` function
|
||||
must take no arguments and return a new ``Game`` instance.
|
||||
``retro-gamer`` loads your game by calling a function named
|
||||
``create_game``. The function must take no arguments and return a new
|
||||
``Game`` instance.
|
||||
|
||||
Here is the ``create_game`` function for Snake:
|
||||
|
||||
@@ -32,12 +32,20 @@ Here is the ``create_game`` function for Snake:
|
||||
def create_game():
|
||||
head = SnakeHead()
|
||||
apple = Apple()
|
||||
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
|
||||
game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
|
||||
apple.relocate(game)
|
||||
return game
|
||||
|
||||
If your game module does not already have a ``create_game`` function,
|
||||
add one following this pattern.
|
||||
If your game file does not already have a ``create_game`` function, add
|
||||
one following this pattern.
|
||||
|
||||
When you run ``retro-gamer create``, you can point to your game file
|
||||
directly by path or by Python module name:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game my_game.py --output runs/my_game/
|
||||
% retro-gamer create --game retro.examples.snake --output runs/snake/
|
||||
|
||||
|
||||
Describing your game
|
||||
@@ -57,8 +65,6 @@ Here is the ``[tool.retro-gamer]`` section for the Snake example:
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
spatial = true
|
||||
observe_state = []
|
||||
|
||||
Let's go through each field.
|
||||
|
||||
@@ -80,9 +86,10 @@ implicitly has access to a no-op (doing nothing).
|
||||
|
||||
The key in the game's state dictionary to use as the reward signal.
|
||||
``retro-gamer`` computes the reward for each turn as the *change* in
|
||||
this value from one turn to the next. For Snake, score increases by 1
|
||||
(or more) each time the apple is eaten, so the agent receives a reward
|
||||
of 1 when it eats an apple and 0 otherwise.
|
||||
this value from one turn to the next. For Snake, the score changes when
|
||||
the snake eats an apple (+50), when it moves away from the apple (−1),
|
||||
and when it dies (−10). These incremental changes are what the agent
|
||||
tries to maximize.
|
||||
|
||||
Choosing an appropriate reward is one of the most consequential
|
||||
decisions in RL. Some considerations:
|
||||
@@ -115,15 +122,48 @@ phase before training to discover which characters actually appear.
|
||||
The number of exploration turns is controlled by the
|
||||
``exploration_turns`` hyperparameter.
|
||||
|
||||
``spatial``
|
||||
~~~~~~~~~~~
|
||||
``spatial`` and other preprocessing options
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Whether to treat the board as a spatial scene (default: ``true``). A
|
||||
spatial game uses a *convolutional neural network* (CNN) that can
|
||||
detect patterns in the relative arrangement of characters. A
|
||||
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
|
||||
ignores positional relationships. Set to ``false`` for games where
|
||||
position is irrelevant.
|
||||
The ``[tool.retro-gamer]`` section describes the game. Preprocessing
|
||||
options—such as ``spatial`` (whether to use a CNN or MLP, default:
|
||||
``false``), ``egocentric``, and ``observe_state``—live in the
|
||||
``[preprocessing]`` section of the generated ``config.toml``. You can
|
||||
edit them there after running ``retro-gamer create``.
|
||||
|
||||
``observe_state``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default the agent only sees the board. You can also give it access
|
||||
to computed values from ``game.state`` by listing the relevant keys in
|
||||
the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
|
||||
For example, Snake exposes the normalized direction to the apple:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
[preprocessing]
|
||||
observe_state = ["apple_dx", "apple_dy"]
|
||||
|
||||
The trainer appends these values to the observation vector after the
|
||||
board encoding (or uses them as the entire observation when
|
||||
``board = false``).
|
||||
|
||||
These values must be set in ``game.state`` at the start of every
|
||||
episode—typically inside ``create_game()``—and must keep the same
|
||||
type and length from episode to episode.
|
||||
|
||||
.. warning::
|
||||
|
||||
Always initialize every key listed in ``observe_state`` before the
|
||||
game starts. If a key is missing or its length changes between
|
||||
episodes, training stops immediately with a clear error explaining
|
||||
what changed. The neural network's input size is fixed when training
|
||||
begins; it cannot adapt to a changing observation shape mid-run.
|
||||
|
||||
This is a good place to ask: *can a human player see this information?*
|
||||
The apple's location is visible on screen; the normalized distance vector
|
||||
is not. Whether that asymmetry is appropriate is a design choice worth
|
||||
examining.
|
||||
|
||||
Once you have written this section, create the training run directory:
|
||||
|
||||
@@ -139,7 +179,7 @@ Once you have written this section, create the training run directory:
|
||||
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
|
||||
reward : score
|
||||
characters : ['@', '*', '>', '<', '^', 'v']
|
||||
architecture: CNN (spatial)
|
||||
architecture: MLP
|
||||
|
||||
``retro-gamer create`` reads your game metadata directly from
|
||||
``pyproject.toml`` and writes it—along with all hyperparameters—to
|
||||
@@ -153,64 +193,141 @@ With the ``config.toml`` in place, start training:
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/
|
||||
Training for 1000 episodes…
|
||||
Done. Checkpoints in runs/snake/checkpoints/
|
||||
100%|████████████████████| 1000/1000 [12:34<00:00, 1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
|
||||
Done. Checkpoints saved in runs/snake/checkpoints/
|
||||
|
||||
Training saves checkpoints every 100 episodes and a ``final.pt``
|
||||
checkpoint when complete. You can follow progress in the training log:
|
||||
A progress bar shows how far training has gone, along with the most
|
||||
recent episode's reward, the current exploration rate (``eps``), and
|
||||
the average prediction error (``loss``).
|
||||
|
||||
Training saves a checkpoint every 100 episodes to
|
||||
``runs/snake/checkpoints/``. You can stop training at any time with
|
||||
Ctrl-C and resume it later—the next ``retro-gamer train`` command will
|
||||
automatically pick up from the latest checkpoint.
|
||||
|
||||
Reading the training log
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For a longer view of how training is progressing, inspect the training
|
||||
log:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% tail -f runs/snake/training.log
|
||||
% cat runs/snake/training.log
|
||||
|
||||
The log shows one line per episode:
|
||||
The log begins with the full network architecture, followed by one line
|
||||
per checkpoint (every 100 episodes):
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
|
||||
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
|
||||
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
|
||||
[ep_0100] ep=0001-0100 avg_reward=-31.4 avg_steps=47 epsilon=0.938 avg_loss=7.2 time=0m12s total=0m12s
|
||||
[ep_0200] ep=0101-0200 avg_reward=-18.6 avg_steps=89 epsilon=0.879 avg_loss=6.8 time=0m14s total=0m26s
|
||||
[ep_0300] ep=0201-0300 avg_reward= -4.1 avg_steps=134 epsilon=0.824 avg_loss=6.1 time=0m15s total=0m41s
|
||||
[ep_0500] ep=0401-0500 avg_reward= +8.7 avg_steps=210 epsilon=0.724 avg_loss=5.4 time=0m16s total=1m12s
|
||||
[ep_1000] ep=0901-1000 avg_reward=+22.3 avg_steps=389 epsilon=0.557 avg_loss=4.9 time=0m18s total=2m30s
|
||||
|
||||
- **total_reward**: the total score earned during the episode (how many
|
||||
apples the snake ate, for Snake).
|
||||
- **steps**: how many turns the episode lasted.
|
||||
- **epsilon**: the current exploration rate. Early in training this is
|
||||
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
|
||||
- **avg_loss**: the average temporal-difference error across training
|
||||
steps in this episode. A decreasing loss generally indicates that the
|
||||
Q-value estimates are converging.
|
||||
Here is what each field means:
|
||||
|
||||
Resuming training
|
||||
~~~~~~~~~~~~~~~~~
|
||||
- **avg_reward**: Average total reward per episode over the past 100 episodes.
|
||||
Positive values mean the agent is accumulating reward; negative values mean
|
||||
it is accumulating penalties. An upward trend over time is the main signal
|
||||
that learning is working.
|
||||
- **avg_steps**: Average number of turns per episode. If episodes are ending
|
||||
quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
|
||||
generally indicate the agent is surviving longer.
|
||||
- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
|
||||
and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
|
||||
behavior is expected.
|
||||
- **avg_loss**: Average Huber loss across training steps. Huber loss is
|
||||
quadratic for small prediction errors and linear for large ones, which keeps
|
||||
it stable even when rewards have a wide range (such as a large bonus for
|
||||
reaching a goal). Values in the range 0–10 are typical for most games.
|
||||
A slow downward trend is the healthy pattern. A loss that grows without bound
|
||||
indicates the learning rate is too high.
|
||||
- **time**: Wall-clock time for this 100-episode interval.
|
||||
- **total**: Cumulative training time across all sessions.
|
||||
|
||||
Training can be resumed from a checkpoint:
|
||||
When training is resumed after a stop, a header line marks the break::
|
||||
|
||||
=== Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===
|
||||
|
||||
This lets you track exactly when each session took place.
|
||||
|
||||
Stopping training to watch the agent play
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You do not need to wait for training to finish before watching the
|
||||
agent. Training can be stopped at any time with Ctrl-C, and the latest
|
||||
checkpoint is always available immediately:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
|
||||
% retro-gamer play runs/snake/
|
||||
|
||||
Watching a trained agent play
|
||||
------------------------------
|
||||
This loads the most recent checkpoint and runs the agent in your
|
||||
terminal. Press Enter or Escape to quit.
|
||||
|
||||
To watch a trained agent play the game in your terminal:
|
||||
.. note::
|
||||
|
||||
.. code-block:: console
|
||||
The game is rendered directly in your terminal. If the window is
|
||||
smaller than the board plus borders, ``retro-gamer play`` will raise
|
||||
a ``TerminalTooSmall`` error — enlarge the terminal window and try
|
||||
again.
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint final
|
||||
|
||||
You can substitute any checkpoint name:
|
||||
To watch an earlier stage of training, use ``--checkpoint``:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||||
|
||||
Press Enter or Escape to quit.
|
||||
Comparing what the agent at episode 100 does versus the agent at episode
|
||||
500 can reveal exactly what the agent has (and has not) learned. For
|
||||
Snake, you might notice the episode-100 agent moving somewhat randomly,
|
||||
while the episode-500 agent consistently navigates toward the apple.
|
||||
Articulating *why* the later agent behaves differently—what the training
|
||||
process produced—connects observation directly to the concepts underlying
|
||||
DQN.
|
||||
|
||||
Comparing agents trained at different checkpoints is a useful activity:
|
||||
the agent at episode 100 has learned *something*, but typically much
|
||||
less than the agent at episode 500. Articulating *what* the earlier
|
||||
agent has and has not learned, and *why*, is productive reasoning about
|
||||
the training process.
|
||||
Resuming training after watching
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
After watching the agent play, resume training with exactly the same
|
||||
command you used before:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/
|
||||
|
||||
``retro-gamer`` automatically detects and resumes from the latest
|
||||
checkpoint. No extra flags are needed. If all configured episodes have
|
||||
already been completed, it prints a message and exits:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
Training already complete (1000 episodes). To keep training,
|
||||
increase training_episodes in config.toml.
|
||||
|
||||
To continue training, open ``runs/snake/config.toml``, increase the
|
||||
``training_episodes`` value, and run ``retro-gamer train`` again.
|
||||
|
||||
Watching a trained agent play
|
||||
------------------------------
|
||||
|
||||
Once training is complete, watch the final agent:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/
|
||||
|
||||
By default the latest checkpoint is loaded. You can also compare the
|
||||
agent's performance at different stages of training:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0500
|
||||
|
||||
Press Enter or Escape to quit.
|
||||
|
||||
Inspecting a run
|
||||
----------------
|
||||
@@ -220,18 +337,20 @@ To review the configuration and recent training progress for a run:
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer info runs/snake/
|
||||
Game module : retro.examples.snake
|
||||
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
|
||||
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
|
||||
Game module : retro.examples.snake
|
||||
Metadata : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
|
||||
Preprocessing : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
|
||||
Model : {'hidden_sizes': [128, 64]}
|
||||
Training : {'learning_rate': 0.0001, 'gamma': 0.99, ...}
|
||||
|
||||
Last 5 episodes:
|
||||
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
|
||||
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
|
||||
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
|
||||
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
|
||||
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
|
||||
Last 5 checkpoints:
|
||||
[ep_0600] ep=0501-0600 avg_reward=+12.1 ...
|
||||
[ep_0700] ep=0601-0700 avg_reward=+14.8 ...
|
||||
[ep_0800] ep=0701-0800 avg_reward=+16.3 ...
|
||||
[ep_0900] ep=0801-0900 avg_reward=+19.0 ...
|
||||
[ep_1000] ep=0901-1000 avg_reward=+22.3 ...
|
||||
|
||||
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
|
||||
Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']
|
||||
|
||||
Adjusting hyperparameters
|
||||
--------------------------
|
||||
@@ -241,7 +360,8 @@ before training, or by passing them as options to ``retro-gamer
|
||||
create``. Common adjustments and their effects:
|
||||
|
||||
**``training_episodes``** — How long to train. More episodes give the
|
||||
agent more time to learn, but also take longer to run.
|
||||
agent more time to learn, but also take longer to run. This is always
|
||||
safe to increase.
|
||||
|
||||
**``epsilon_decay``** — How quickly exploration decreases. A faster
|
||||
decay (smaller ``epsilon_decay``) means the agent commits to its early
|
||||
@@ -257,14 +377,124 @@ a small learning rate is stable but slow.
|
||||
means the agent values long-term consequences; closer to 0 makes the
|
||||
agent focus on immediate reward.
|
||||
|
||||
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
|
||||
head. Larger networks can represent more complex Q-functions but are
|
||||
slower to train and may overfit.
|
||||
**``hidden_sizes``** — The shape of the MLP head as a list of layer
|
||||
sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
|
||||
more complex Q-functions but are slower to train and may overfit.
|
||||
|
||||
**``prioritize_experiences``** — Whether to use prioritized experience
|
||||
replay. This often improves sample efficiency but is slightly slower
|
||||
per step.
|
||||
|
||||
.. _incompatible-changes:
|
||||
|
||||
Why some changes require starting fresh
|
||||
----------------------------------------
|
||||
|
||||
Not all changes to ``config.toml`` are equal. Some can be applied
|
||||
immediately to an existing training run; others make the existing
|
||||
checkpoints unusable.
|
||||
|
||||
**Safe to change at any time** (``[training]`` section) — These affect
|
||||
*how* the agent learns, not *what* it is learning to do. Existing
|
||||
checkpoints remain valid:
|
||||
|
||||
- ``training_episodes``, ``max_turns_per_episode``
|
||||
- ``learning_rate``, ``learning_rate_decay``, ``gamma``
|
||||
- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
|
||||
- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
|
||||
- ``target_update_freq``, ``train_every``
|
||||
|
||||
**Requires starting fresh** — These changes alter the shape of the
|
||||
game or the shape of the network. The saved model weights are
|
||||
incompatible with the new configuration:
|
||||
|
||||
- ``actions``, ``reward``, ``character_set``, ``board_size``
|
||||
(``[metadata]``) — These define what the agent perceives and what it
|
||||
can do. Changing them changes the size of the network's input or
|
||||
output layers; the existing weights no longer fit.
|
||||
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
|
||||
``egocentric``, ``egocentric_player``, ``egocentric_radius``
|
||||
(``[preprocessing]``) — These control how the observation is
|
||||
constructed. Any change here alters the input shape or meaning and
|
||||
makes existing weights invalid.
|
||||
- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
|
||||
layers. Changing it changes the shape of the network; the existing
|
||||
weights no longer fit.
|
||||
|
||||
If you try to resume training after making one of these changes,
|
||||
``retro-gamer train`` detects the mismatch and stops with a clear
|
||||
explanation, for example::
|
||||
|
||||
Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.
|
||||
|
||||
The following changes require starting fresh. The existing model was
|
||||
trained on a different problem and its weights cannot be reused:
|
||||
|
||||
character_set
|
||||
was : ['@', '*', '>', '<', '^', 'v']
|
||||
now : ['@', '*', '>', '<', '^', 'v', '#']
|
||||
why : the set of board characters (changes input layer size)
|
||||
|
||||
Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
|
||||
training log, then run 'retro-gamer train RUN_DIR' to start fresh.
|
||||
|
||||
To clear out the old checkpoints and begin again:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer clean runs/snake/
|
||||
Will remove 5 checkpoint(s) and training log from runs/snake/:
|
||||
checkpoints/ep_0100.pt
|
||||
checkpoints/ep_0200.pt
|
||||
...
|
||||
training.log
|
||||
|
||||
Proceed? [y/N]: y
|
||||
Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.
|
||||
|
||||
The ``config.toml`` is always preserved so you do not need to run
|
||||
``retro-gamer create`` again.
|
||||
|
||||
Reasoning about training from the log
|
||||
--------------------------------------
|
||||
|
||||
The training log is one of the most useful tools for understanding what
|
||||
is happening during training. Here are some patterns to look for and
|
||||
what they mean.
|
||||
|
||||
**Reward increasing steadily** is the normal, healthy pattern. Each
|
||||
checkpoint block should show a higher ``avg_reward`` than the last.
|
||||
The rate of increase typically slows as training progresses.
|
||||
|
||||
**Reward flat or negative through early episodes** is normal. Early in
|
||||
training, ``epsilon`` is high and the agent is mostly acting randomly.
|
||||
It has not yet discovered effective strategies. Patience—and a look at
|
||||
the ``epsilon`` column—will confirm whether this is just the exploration
|
||||
phase.
|
||||
|
||||
**Loss decreasing** is also healthy. As the Q-network's estimates
|
||||
improve, the difference between predicted and target Q-values (the TD
|
||||
error) should shrink. A loss that stabilizes near zero is usually a
|
||||
good sign.
|
||||
|
||||
**Loss growing without bound** indicates the learning rate is too high.
|
||||
The trainer uses Huber loss, which is robust to large reward scales, but
|
||||
a learning rate above roughly ``0.001`` can still destabilise training.
|
||||
Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
|
||||
and restarting training.
|
||||
|
||||
**Short episodes (low ``avg_steps``)** combined with low reward
|
||||
suggests the agent is dying frequently. Early in training this is
|
||||
normal. If it persists late in training, the agent may have settled on
|
||||
a bad policy—consider extending training or adjusting
|
||||
``epsilon_decay`` to explore longer.
|
||||
|
||||
**Reward that improves and then regresses** can indicate that the
|
||||
agent has discovered a suboptimal but consistent strategy and is stuck.
|
||||
Increasing ``epsilon_min`` to keep some exploration active, or
|
||||
adjusting the reward signal to better differentiate good moves from
|
||||
bad ones, can help.
|
||||
|
||||
Questions for investigation
|
||||
----------------------------
|
||||
|
||||
@@ -297,3 +527,4 @@ concepts underlying the training algorithm.
|
||||
episode 1000 and watch each play the same game. What has the later
|
||||
agent learned that the earlier one has not? How would you describe
|
||||
this difference to someone who does not know about neural networks?
|
||||
|
||||
|
||||
Reference in New Issue
Block a user