Updates across the board
This commit is contained in:
@@ -21,9 +21,9 @@ You will need:
|
||||
Preparing your game
|
||||
-------------------
|
||||
|
||||
``retro-gamer`` loads your game by importing a Python module and
|
||||
calling a function named ``create_game``. The ``create_game`` function
|
||||
must take no arguments and return a new ``Game`` instance.
|
||||
``retro-gamer`` loads your game by calling a function named
|
||||
``create_game``. The function must take no arguments and return a new
|
||||
``Game`` instance.
|
||||
|
||||
Here is the ``create_game`` function for Snake:
|
||||
|
||||
@@ -32,12 +32,20 @@ Here is the ``create_game`` function for Snake:
|
||||
def create_game():
|
||||
head = SnakeHead()
|
||||
apple = Apple()
|
||||
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
|
||||
game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12)
|
||||
apple.relocate(game)
|
||||
return game
|
||||
|
||||
If your game module does not already have a ``create_game`` function,
|
||||
add one following this pattern.
|
||||
If your game file does not already have a ``create_game`` function, add
|
||||
one following this pattern.
|
||||
|
||||
When you run ``retro-gamer create``, you can point to your game file
|
||||
directly by path or by Python module name:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game my_game.py --output runs/my_game/
|
||||
% retro-gamer create --game retro.examples.snake --output runs/snake/
|
||||
|
||||
|
||||
Describing your game
|
||||
@@ -57,8 +65,6 @@ Here is the ``[tool.retro-gamer]`` section for the Snake example:
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
spatial = true
|
||||
observe_state = []
|
||||
|
||||
Let's go through each field.
|
||||
|
||||
@@ -80,9 +86,10 @@ implicitly has access to a no-op (doing nothing).
|
||||
|
||||
The key in the game's state dictionary to use as the reward signal.
|
||||
``retro-gamer`` computes the reward for each turn as the *change* in
|
||||
this value from one turn to the next. For Snake, score increases by 1
|
||||
(or more) each time the apple is eaten, so the agent receives a reward
|
||||
of 1 when it eats an apple and 0 otherwise.
|
||||
this value from one turn to the next. For Snake, the score changes when
|
||||
the snake eats an apple (+50), when it moves away from the apple (−1),
|
||||
and when it dies (−10). These incremental changes are what the agent
|
||||
tries to maximize.
|
||||
|
||||
Choosing an appropriate reward is one of the most consequential
|
||||
decisions in RL. Some considerations:
|
||||
@@ -115,15 +122,48 @@ phase before training to discover which characters actually appear.
|
||||
The number of exploration turns is controlled by the
|
||||
``exploration_turns`` hyperparameter.
|
||||
|
||||
``spatial``
|
||||
~~~~~~~~~~~
|
||||
``spatial`` and other preprocessing options
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Whether to treat the board as a spatial scene (default: ``true``). A
|
||||
spatial game uses a *convolutional neural network* (CNN) that can
|
||||
detect patterns in the relative arrangement of characters. A
|
||||
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
|
||||
ignores positional relationships. Set to ``false`` for games where
|
||||
position is irrelevant.
|
||||
The ``[tool.retro-gamer]`` section describes the game. Preprocessing
|
||||
options—such as ``spatial`` (whether to use a CNN or MLP, default:
|
||||
``false``), ``egocentric``, and ``observe_state``—live in the
|
||||
``[preprocessing]`` section of the generated ``config.toml``. You can
|
||||
edit them there after running ``retro-gamer create``.
|
||||
|
||||
``observe_state``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default the agent only sees the board. You can also give it access
|
||||
to computed values from ``game.state`` by listing the relevant keys in
|
||||
the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``.
|
||||
For example, Snake exposes the normalized direction to the apple:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
[preprocessing]
|
||||
observe_state = ["apple_dx", "apple_dy"]
|
||||
|
||||
The trainer appends these values to the observation vector after the
|
||||
board encoding (or uses them as the entire observation when
|
||||
``board = false``).
|
||||
|
||||
These values must be set in ``game.state`` at the start of every
|
||||
episode—typically inside ``create_game()``—and must keep the same
|
||||
type and length from episode to episode.
|
||||
|
||||
.. warning::
|
||||
|
||||
Always initialize every key listed in ``observe_state`` before the
|
||||
game starts. If a key is missing or its length changes between
|
||||
episodes, training stops immediately with a clear error explaining
|
||||
what changed. The neural network's input size is fixed when training
|
||||
begins; it cannot adapt to a changing observation shape mid-run.
|
||||
|
||||
This is a good place to ask: *can a human player see this information?*
|
||||
The apple's location is visible on screen; the normalized distance vector
|
||||
is not. Whether that asymmetry is appropriate is a design choice worth
|
||||
examining.
|
||||
|
||||
Once you have written this section, create the training run directory:
|
||||
|
||||
@@ -139,7 +179,7 @@ Once you have written this section, create the training run directory:
|
||||
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
|
||||
reward : score
|
||||
characters : ['@', '*', '>', '<', '^', 'v']
|
||||
architecture: CNN (spatial)
|
||||
architecture: MLP
|
||||
|
||||
``retro-gamer create`` reads your game metadata directly from
|
||||
``pyproject.toml`` and writes it—along with all hyperparameters—to
|
||||
@@ -153,64 +193,141 @@ With the ``config.toml`` in place, start training:
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/
|
||||
Training for 1000 episodes…
|
||||
Done. Checkpoints in runs/snake/checkpoints/
|
||||
100%|████████████████████| 1000/1000 [12:34<00:00, 1.32ep/s, reward=9.0, eps=0.007, loss=0.0003]
|
||||
Done. Checkpoints saved in runs/snake/checkpoints/
|
||||
|
||||
Training saves checkpoints every 100 episodes and a ``final.pt``
|
||||
checkpoint when complete. You can follow progress in the training log:
|
||||
A progress bar shows how far training has gone, along with the most
|
||||
recent episode's reward, the current exploration rate (``eps``), and
|
||||
the average prediction error (``loss``).
|
||||
|
||||
Training saves a checkpoint every 100 episodes to
|
||||
``runs/snake/checkpoints/``. You can stop training at any time with
|
||||
Ctrl-C and resume it later—the next ``retro-gamer train`` command will
|
||||
automatically pick up from the latest checkpoint.
|
||||
|
||||
Reading the training log
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For a longer view of how training is progressing, inspect the training
|
||||
log:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% tail -f runs/snake/training.log
|
||||
% cat runs/snake/training.log
|
||||
|
||||
The log shows one line per episode:
|
||||
The log begins with the full network architecture, followed by one line
|
||||
per checkpoint (every 100 episodes):
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
|
||||
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
|
||||
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
|
||||
[ep_0100] ep=0001-0100 avg_reward=-31.4 avg_steps=47 epsilon=0.938 avg_loss=7.2 time=0m12s total=0m12s
|
||||
[ep_0200] ep=0101-0200 avg_reward=-18.6 avg_steps=89 epsilon=0.879 avg_loss=6.8 time=0m14s total=0m26s
|
||||
[ep_0300] ep=0201-0300 avg_reward= -4.1 avg_steps=134 epsilon=0.824 avg_loss=6.1 time=0m15s total=0m41s
|
||||
[ep_0500] ep=0401-0500 avg_reward= +8.7 avg_steps=210 epsilon=0.724 avg_loss=5.4 time=0m16s total=1m12s
|
||||
[ep_1000] ep=0901-1000 avg_reward=+22.3 avg_steps=389 epsilon=0.557 avg_loss=4.9 time=0m18s total=2m30s
|
||||
|
||||
- **total_reward**: the total score earned during the episode (how many
|
||||
apples the snake ate, for Snake).
|
||||
- **steps**: how many turns the episode lasted.
|
||||
- **epsilon**: the current exploration rate. Early in training this is
|
||||
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
|
||||
- **avg_loss**: the average temporal-difference error across training
|
||||
steps in this episode. A decreasing loss generally indicates that the
|
||||
Q-value estimates are converging.
|
||||
Here is what each field means:
|
||||
|
||||
Resuming training
|
||||
~~~~~~~~~~~~~~~~~
|
||||
- **avg_reward**: Average total reward per episode over the past 100 episodes.
|
||||
Positive values mean the agent is accumulating reward; negative values mean
|
||||
it is accumulating penalties. An upward trend over time is the main signal
|
||||
that learning is working.
|
||||
- **avg_steps**: Average number of turns per episode. If episodes are ending
|
||||
quickly (small ``avg_steps``), the agent may be dying often. Longer episodes
|
||||
generally indicate the agent is surviving longer.
|
||||
- **epsilon**: The current exploration rate. Starts near 1.0 (mostly random)
|
||||
and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic
|
||||
behavior is expected.
|
||||
- **avg_loss**: Average Huber loss across training steps. Huber loss is
|
||||
quadratic for small prediction errors and linear for large ones, which keeps
|
||||
it stable even when rewards have a wide range (such as a large bonus for
|
||||
reaching a goal). Values in the range 0–10 are typical for most games.
|
||||
A slow downward trend is the healthy pattern. A loss that grows without bound
|
||||
indicates the learning rate is too high.
|
||||
- **time**: Wall-clock time for this 100-episode interval.
|
||||
- **total**: Cumulative training time across all sessions.
|
||||
|
||||
Training can be resumed from a checkpoint:
|
||||
When training is resumed after a stop, a header line marks the break::
|
||||
|
||||
=== Resumed from ep_0500.pt | 2026-05-09 14:22:01 ===
|
||||
|
||||
This lets you track exactly when each session took place.
|
||||
|
||||
Stopping training to watch the agent play
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You do not need to wait for training to finish before watching the
|
||||
agent. Training can be stopped at any time with Ctrl-C, and the latest
|
||||
checkpoint is always available immediately:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
|
||||
% retro-gamer play runs/snake/
|
||||
|
||||
Watching a trained agent play
|
||||
------------------------------
|
||||
This loads the most recent checkpoint and runs the agent in your
|
||||
terminal. Press Enter or Escape to quit.
|
||||
|
||||
To watch a trained agent play the game in your terminal:
|
||||
.. note::
|
||||
|
||||
.. code-block:: console
|
||||
The game is rendered directly in your terminal. If the window is
|
||||
smaller than the board plus borders, ``retro-gamer play`` will raise
|
||||
a ``TerminalTooSmall`` error — enlarge the terminal window and try
|
||||
again.
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint final
|
||||
|
||||
You can substitute any checkpoint name:
|
||||
To watch an earlier stage of training, use ``--checkpoint``:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||||
|
||||
Press Enter or Escape to quit.
|
||||
Comparing what the agent at episode 100 does versus the agent at episode
|
||||
500 can reveal exactly what the agent has (and has not) learned. For
|
||||
Snake, you might notice the episode-100 agent moving somewhat randomly,
|
||||
while the episode-500 agent consistently navigates toward the apple.
|
||||
Articulating *why* the later agent behaves differently—what the training
|
||||
process produced—connects observation directly to the concepts underlying
|
||||
DQN.
|
||||
|
||||
Comparing agents trained at different checkpoints is a useful activity:
|
||||
the agent at episode 100 has learned *something*, but typically much
|
||||
less than the agent at episode 500. Articulating *what* the earlier
|
||||
agent has and has not learned, and *why*, is productive reasoning about
|
||||
the training process.
|
||||
Resuming training after watching
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
After watching the agent play, resume training with exactly the same
|
||||
command you used before:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train runs/snake/
|
||||
|
||||
``retro-gamer`` automatically detects and resumes from the latest
|
||||
checkpoint. No extra flags are needed. If all configured episodes have
|
||||
already been completed, it prints a message and exits:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
Training already complete (1000 episodes). To keep training,
|
||||
increase training_episodes in config.toml.
|
||||
|
||||
To continue training, open ``runs/snake/config.toml``, increase the
|
||||
``training_episodes`` value, and run ``retro-gamer train`` again.
|
||||
|
||||
Watching a trained agent play
|
||||
------------------------------
|
||||
|
||||
Once training is complete, watch the final agent:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/
|
||||
|
||||
By default the latest checkpoint is loaded. You can also compare the
|
||||
agent's performance at different stages of training:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||||
% retro-gamer play runs/snake/ --checkpoint ep_0500
|
||||
|
||||
Press Enter or Escape to quit.
|
||||
|
||||
Inspecting a run
|
||||
----------------
|
||||
@@ -220,18 +337,20 @@ To review the configuration and recent training progress for a run:
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer info runs/snake/
|
||||
Game module : retro.examples.snake
|
||||
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
|
||||
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
|
||||
Game module : retro.examples.snake
|
||||
Metadata : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...}
|
||||
Preprocessing : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...}
|
||||
Model : {'hidden_sizes': [128, 64]}
|
||||
Training : {'learning_rate': 0.0001, 'gamma': 0.99, ...}
|
||||
|
||||
Last 5 episodes:
|
||||
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
|
||||
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
|
||||
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
|
||||
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
|
||||
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
|
||||
Last 5 checkpoints:
|
||||
[ep_0600] ep=0501-0600 avg_reward=+12.1 ...
|
||||
[ep_0700] ep=0601-0700 avg_reward=+14.8 ...
|
||||
[ep_0800] ep=0701-0800 avg_reward=+16.3 ...
|
||||
[ep_0900] ep=0801-0900 avg_reward=+19.0 ...
|
||||
[ep_1000] ep=0901-1000 avg_reward=+22.3 ...
|
||||
|
||||
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
|
||||
Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt']
|
||||
|
||||
Adjusting hyperparameters
|
||||
--------------------------
|
||||
@@ -241,7 +360,8 @@ before training, or by passing them as options to ``retro-gamer
|
||||
create``. Common adjustments and their effects:
|
||||
|
||||
**``training_episodes``** — How long to train. More episodes give the
|
||||
agent more time to learn, but also take longer to run.
|
||||
agent more time to learn, but also take longer to run. This is always
|
||||
safe to increase.
|
||||
|
||||
**``epsilon_decay``** — How quickly exploration decreases. A faster
|
||||
decay (smaller ``epsilon_decay``) means the agent commits to its early
|
||||
@@ -257,14 +377,124 @@ a small learning rate is stable but slow.
|
||||
means the agent values long-term consequences; closer to 0 makes the
|
||||
agent focus on immediate reward.
|
||||
|
||||
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
|
||||
head. Larger networks can represent more complex Q-functions but are
|
||||
slower to train and may overfit.
|
||||
**``hidden_sizes``** — The shape of the MLP head as a list of layer
|
||||
sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent
|
||||
more complex Q-functions but are slower to train and may overfit.
|
||||
|
||||
**``prioritize_experiences``** — Whether to use prioritized experience
|
||||
replay. This often improves sample efficiency but is slightly slower
|
||||
per step.
|
||||
|
||||
.. _incompatible-changes:
|
||||
|
||||
Why some changes require starting fresh
|
||||
----------------------------------------
|
||||
|
||||
Not all changes to ``config.toml`` are equal. Some can be applied
|
||||
immediately to an existing training run; others make the existing
|
||||
checkpoints unusable.
|
||||
|
||||
**Safe to change at any time** (``[training]`` section) — These affect
|
||||
*how* the agent learns, not *what* it is learning to do. Existing
|
||||
checkpoints remain valid:
|
||||
|
||||
- ``training_episodes``, ``max_turns_per_episode``
|
||||
- ``learning_rate``, ``learning_rate_decay``, ``gamma``
|
||||
- ``epsilon``, ``epsilon_decay``, ``epsilon_min``
|
||||
- ``batch_size``, ``memory_capacity``, ``prioritize_experiences``
|
||||
- ``target_update_freq``, ``train_every``
|
||||
|
||||
**Requires starting fresh** — These changes alter the shape of the
|
||||
game or the shape of the network. The saved model weights are
|
||||
incompatible with the new configuration:
|
||||
|
||||
- ``actions``, ``reward``, ``character_set``, ``board_size``
|
||||
(``[metadata]``) — These define what the agent perceives and what it
|
||||
can do. Changing them changes the size of the network's input or
|
||||
output layers; the existing weights no longer fit.
|
||||
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
|
||||
``egocentric``, ``egocentric_player``, ``egocentric_radius``
|
||||
(``[preprocessing]``) — These control how the observation is
|
||||
constructed. Any change here alters the input shape or meaning and
|
||||
makes existing weights invalid.
|
||||
- ``hidden_sizes`` (``[model]``) — This defines the network's hidden
|
||||
layers. Changing it changes the shape of the network; the existing
|
||||
weights no longer fit.
|
||||
|
||||
If you try to resume training after making one of these changes,
|
||||
``retro-gamer train`` detects the mismatch and stops with a clear
|
||||
explanation, for example::
|
||||
|
||||
Cannot resume from ep_0500.pt: incompatible changes detected in config.toml.
|
||||
|
||||
The following changes require starting fresh. The existing model was
|
||||
trained on a different problem and its weights cannot be reused:
|
||||
|
||||
character_set
|
||||
was : ['@', '*', '>', '<', '^', 'v']
|
||||
now : ['@', '*', '>', '<', '^', 'v', '#']
|
||||
why : the set of board characters (changes input layer size)
|
||||
|
||||
Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the
|
||||
training log, then run 'retro-gamer train RUN_DIR' to start fresh.
|
||||
|
||||
To clear out the old checkpoints and begin again:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer clean runs/snake/
|
||||
Will remove 5 checkpoint(s) and training log from runs/snake/:
|
||||
checkpoints/ep_0100.pt
|
||||
checkpoints/ep_0200.pt
|
||||
...
|
||||
training.log
|
||||
|
||||
Proceed? [y/N]: y
|
||||
Cleaned. Run 'retro-gamer train runs/snake/' to start fresh.
|
||||
|
||||
The ``config.toml`` is always preserved so you do not need to run
|
||||
``retro-gamer create`` again.
|
||||
|
||||
Reasoning about training from the log
|
||||
--------------------------------------
|
||||
|
||||
The training log is one of the most useful tools for understanding what
|
||||
is happening during training. Here are some patterns to look for and
|
||||
what they mean.
|
||||
|
||||
**Reward increasing steadily** is the normal, healthy pattern. Each
|
||||
checkpoint block should show a higher ``avg_reward`` than the last.
|
||||
The rate of increase typically slows as training progresses.
|
||||
|
||||
**Reward flat or negative through early episodes** is normal. Early in
|
||||
training, ``epsilon`` is high and the agent is mostly acting randomly.
|
||||
It has not yet discovered effective strategies. Patience—and a look at
|
||||
the ``epsilon`` column—will confirm whether this is just the exploration
|
||||
phase.
|
||||
|
||||
**Loss decreasing** is also healthy. As the Q-network's estimates
|
||||
improve, the difference between predicted and target Q-values (the TD
|
||||
error) should shrink. A loss that stabilizes near zero is usually a
|
||||
good sign.
|
||||
|
||||
**Loss growing without bound** indicates the learning rate is too high.
|
||||
The trainer uses Huber loss, which is robust to large reward scales, but
|
||||
a learning rate above roughly ``0.001`` can still destabilise training.
|
||||
Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``)
|
||||
and restarting training.
|
||||
|
||||
**Short episodes (low ``avg_steps``)** combined with low reward
|
||||
suggests the agent is dying frequently. Early in training this is
|
||||
normal. If it persists late in training, the agent may have settled on
|
||||
a bad policy—consider extending training or adjusting
|
||||
``epsilon_decay`` to explore longer.
|
||||
|
||||
**Reward that improves and then regresses** can indicate that the
|
||||
agent has discovered a suboptimal but consistent strategy and is stuck.
|
||||
Increasing ``epsilon_min`` to keep some exploration active, or
|
||||
adjusting the reward signal to better differentiate good moves from
|
||||
bad ones, can help.
|
||||
|
||||
Questions for investigation
|
||||
----------------------------
|
||||
|
||||
@@ -297,3 +527,4 @@ concepts underlying the training algorithm.
|
||||
episode 1000 and watch each play the same game. What has the later
|
||||
agent learned that the earlier one has not? How would you describe
|
||||
this difference to someone who does not know about neural networks?
|
||||
|
||||
|
||||
Reference in New Issue
Block a user