Updates across the board
This commit is contained in:
@@ -17,8 +17,6 @@ A complete example for the Snake game:
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
spatial = true
|
||||
observe_state = []
|
||||
|
||||
You do not need to specify the board size: ``retro-gamer`` reads it
|
||||
directly from your game's ``board_size`` attribute.
|
||||
@@ -65,54 +63,156 @@ If omitted, ``retro-gamer`` runs an exploration phase to discover the
|
||||
characters that appear in practice. The length of this phase is
|
||||
controlled by the ``exploration_turns`` hyperparameter.
|
||||
|
||||
``spatial``
|
||||
~~~~~~~~~~~
|
||||
Preprocessing options
|
||||
---------------------
|
||||
|
||||
**Optional; default ``true``.** Whether to treat the board as a 2D
|
||||
spatial scene. When ``true``, the trainer uses a convolutional neural
|
||||
network (CNN) that can detect patterns in the relative positions of
|
||||
characters. When ``false``, the trainer uses a multilayer perceptron
|
||||
(MLP) that sees the board as a flat list of numbers without positional
|
||||
structure.
|
||||
Preprocessing options live in the ``[preprocessing]`` section of a run's
|
||||
``config.toml``. They control how the game's board and state are
|
||||
transformed into the observation vector that the neural network sees.
|
||||
``retro-gamer create`` writes sensible defaults; you can edit them by
|
||||
hand before running ``retro-gamer train``.
|
||||
|
||||
.. note::
|
||||
|
||||
Changes to any ``[preprocessing]`` option—or to the game description
|
||||
fields above—make existing checkpoints incompatible. Run
|
||||
``retro-gamer clean`` before retraining after such changes.
|
||||
|
||||
``spatial`` (default: ``false``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Whether to treat the board as a 2D spatial scene. When ``true``, the
|
||||
trainer uses a convolutional neural network (CNN); when ``false``, a
|
||||
multilayer perceptron (MLP) that sees the board as a flat list of
|
||||
numbers.
|
||||
|
||||
``board`` (default: ``true``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Whether to include the board encoding in the observation vector. Set
|
||||
to ``false`` to train on game state variables only, with no board at
|
||||
all. This is useful for games with small, enumerable state spaces where
|
||||
a lookup table (classic Q-learning) is sufficient.
|
||||
|
||||
When ``board = false``:
|
||||
|
||||
- ``spatial`` must also be ``false`` (no board means no 2D scene for a CNN).
|
||||
- At least one key must be listed in ``observe_state``.
|
||||
- ``character_set`` is not required and character discovery is skipped.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
spatial = true
|
||||
[preprocessing]
|
||||
board = false
|
||||
observe_state = ["board_state"]
|
||||
|
||||
``observe_state``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
``observe_state`` (default: ``[]``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Optional; default ``[]``.** A list of keys from the game's state
|
||||
dictionary to append to the observation vector. The values must be
|
||||
numbers (integers, floats, or booleans). The reward key must not
|
||||
appear in this list.
|
||||
A list of keys from ``game.state`` to include in the observation
|
||||
vector, appended after the board encoding (or as the entire
|
||||
observation when ``board = false``). Scalar values contribute one
|
||||
element each; list or tuple values are flattened.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
observe_state = ["lives", "level"]
|
||||
observe_state = ["apple_dx", "apple_dy"]
|
||||
|
||||
The keys must be present in ``game.state`` at every step, initialized
|
||||
in ``create_game()`` before the game starts. All values that are lists
|
||||
or tuples must always have the same length from episode to episode.
|
||||
|
||||
.. warning::
|
||||
|
||||
``observe_state`` keys must be initialized to their final shape in
|
||||
``create_game()`` before the game starts. If a key is absent or its
|
||||
list length changes between episodes, training will crash with an
|
||||
error explaining which key changed and by how much. This happens
|
||||
because the neural network's input layer has a fixed size determined
|
||||
at the start of training; it cannot adapt to a changing observation
|
||||
shape mid-run.
|
||||
|
||||
Always initialize every observed key with a placeholder of the
|
||||
correct type and length before the first ``game.step()`` call.
|
||||
|
||||
``observe_state_sizes`` (auto-discovered)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A table mapping each ``observe_state`` key to its flat size (``1`` for
|
||||
scalars, ``N`` for sequences of length N). This is written automatically
|
||||
to ``config.toml`` the first time ``retro-gamer train`` runs, after the
|
||||
trainer samples ``game.state`` to discover the actual sizes:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
observe_state_sizes = {board_state = 9}
|
||||
|
||||
You do not need to set this manually. Once written, it is used to
|
||||
detect changes in state shape when resuming training—an incompatible
|
||||
change here requires running ``retro-gamer clean`` and starting fresh.
|
||||
|
||||
``egocentric`` (default: ``false``)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When ``true``, the board observation is cropped to a square window
|
||||
centred on a specific agent rather than the full board. This gives the
|
||||
agent a local, first-person-like view and makes the observation
|
||||
invariant to the agent's absolute position on the board.
|
||||
|
||||
Requires ``egocentric_player`` and ``egocentric_radius``.
|
||||
|
||||
``egocentric_player``
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The name of the agent to use as the centre of the egocentric crop.
|
||||
Must match the ``name`` attribute of one of the game's agents.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
egocentric_player = "Snake head"
|
||||
|
||||
``egocentric_radius``
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The half-side-length of the egocentric crop window, in cells. The
|
||||
resulting observation covers a ``(2r+1) × (2r+1)`` region. Larger
|
||||
values give the agent a wider view; smaller values focus it on the
|
||||
immediate vicinity.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
egocentric_radius = 8 # 17×17 window
|
||||
|
||||
When ``egocentric_radius`` is set, ``board_size`` in ``[metadata]`` is
|
||||
automatically updated to ``[2r+1, 2r+1]`` so the network is sized
|
||||
correctly.
|
||||
|
||||
.. _hyperparameters:
|
||||
|
||||
Hyperparameters
|
||||
---------------
|
||||
|
||||
Hyperparameters are stored in the ``[hyperparameters]`` section of
|
||||
``config.toml``. They can be set via ``retro-gamer create`` options or
|
||||
edited directly.
|
||||
Hyperparameters are split across two sections of ``config.toml``:
|
||||
|
||||
- ``[model]`` — network architecture (changing these requires starting fresh)
|
||||
- ``[training]`` — learning algorithm parameters (safe to change at any time)
|
||||
|
||||
Both sections can be set via ``retro-gamer create`` options or edited directly.
|
||||
|
||||
Learning and optimization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``learning_rate`` (default: ``0.001``)
|
||||
``learning_rate`` (default: ``0.0001``)
|
||||
The step size used by the Adam optimizer when updating network
|
||||
weights. Larger values converge faster but may be unstable; smaller
|
||||
values are more stable but slower.
|
||||
|
||||
``lr_decay`` (default: ``0.995``)
|
||||
``learning_rate_decay`` (default: ``0.9999``)
|
||||
Multiplicative decay applied to the learning rate after each
|
||||
episode. The learning rate decreases geometrically over training,
|
||||
helping the network fine-tune later without destabilizing early
|
||||
progress.
|
||||
progress. With the default value, the learning rate decays to about
|
||||
13 % of its starting value after 20 000 episodes.
|
||||
|
||||
``gamma`` (default: ``0.99``)
|
||||
The discount factor for future rewards. A value of 1.0 makes the
|
||||
@@ -127,7 +227,7 @@ Exploration
|
||||
random action with probability ``epsilon`` and exploits its current
|
||||
Q-function with probability ``1 - epsilon``.
|
||||
|
||||
``epsilon_decay`` (default: ``0.995``)
|
||||
``epsilon_decay`` (default: ``0.9997``)
|
||||
Multiplicative decay applied to ``epsilon`` after each episode.
|
||||
|
||||
``epsilon_min`` (default: ``0.05``)
|
||||
@@ -142,31 +242,33 @@ Memory and sampling
|
||||
The number of experiences sampled from the replay buffer per
|
||||
training step.
|
||||
|
||||
``memory_capacity`` (default: ``10000``)
|
||||
``memory_capacity`` (default: ``50000``)
|
||||
The maximum number of experiences the replay buffer can hold. When
|
||||
full, the oldest experiences are discarded.
|
||||
|
||||
``prioritize_experiences`` (default: ``false``)
|
||||
``prioritize_experiences`` (default: ``true``)
|
||||
Whether to use prioritized experience replay. When ``true``,
|
||||
experiences with larger TD errors are sampled more frequently.
|
||||
This often improves sample efficiency at a modest computational
|
||||
cost.
|
||||
|
||||
Network architecture
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
Model architecture
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``n_layers`` (default: ``2``)
|
||||
The number of hidden layers in the MLP head (for spatial games,
|
||||
this follows the CNN; for non-spatial games, it is the full
|
||||
network).
|
||||
These live in the ``[model]`` section. Changing them requires starting fresh
|
||||
(run ``retro-gamer clean`` before retraining).
|
||||
|
||||
``layer_size`` (default: ``128``)
|
||||
The width (number of units) in each hidden layer.
|
||||
``hidden_sizes`` (default: ``[128, 64]``)
|
||||
A list of integers giving the size of each hidden layer in the MLP
|
||||
head. The default creates two layers: 128 units then 64. For spatial
|
||||
games this follows the CNN; for non-spatial games it is the full
|
||||
network. Larger or deeper networks can represent more complex
|
||||
Q-functions but train more slowly and may need more episodes.
|
||||
|
||||
Training duration
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
``training_episodes`` (default: ``1000``)
|
||||
``training_episodes`` (default: ``20000``)
|
||||
The total number of game episodes to run. Each episode runs until
|
||||
the game ends or ``max_turns_per_episode`` turns have elapsed.
|
||||
|
||||
@@ -175,12 +277,18 @@ Training duration
|
||||
indefinitely (for example, if the agent finds a way to avoid
|
||||
dying).
|
||||
|
||||
``target_update_freq`` (default: ``100``)
|
||||
``target_update_freq`` (default: ``500``)
|
||||
How many training steps between updates of the target network.
|
||||
More frequent updates make training targets move faster (less
|
||||
stable); less frequent updates make them more stable but slower
|
||||
to reflect new learning.
|
||||
|
||||
``train_every`` (default: ``4``)
|
||||
Run one training step every N game steps. Higher values speed up
|
||||
episode collection at the cost of fewer gradient updates per
|
||||
experience. The default of 4 is a good balance for most games;
|
||||
set to 1 to train on every step.
|
||||
|
||||
Character discovery
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
@@ -207,23 +315,26 @@ game's ``pyproject.toml``; you do not pass it on the command line.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game MODULE --output DIR [OPTIONS]
|
||||
% retro-gamer create --game GAME --output DIR [OPTIONS]
|
||||
|
||||
**Required options:**
|
||||
|
||||
- ``--game MODULE`` — Python module containing ``create_game()``
|
||||
(e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
|
||||
is read from the ``pyproject.toml`` found in or above the module's
|
||||
source directory.
|
||||
- ``--game GAME`` — Your game, specified as a file path or a Python
|
||||
module name:
|
||||
|
||||
- File path: ``--game my_game.py`` or ``--game my_game/``
|
||||
- Module name: ``--game retro.examples.snake``
|
||||
|
||||
The ``[tool.retro-gamer]`` section is read from the ``pyproject.toml``
|
||||
found in or above the game file.
|
||||
- ``--output DIR`` — Directory to create for this training run.
|
||||
|
||||
**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
|
||||
|
||||
- ``--training-episodes N``
|
||||
- ``--n-layers N``
|
||||
- ``--layer-size N``
|
||||
- ``--hidden-sizes SIZES`` — comma-separated, e.g. ``512,256``
|
||||
- ``--learning-rate F``
|
||||
- ``--lr-decay F``
|
||||
- ``--learning-rate-decay F``
|
||||
- ``--gamma F``
|
||||
- ``--epsilon-decay F``
|
||||
- ``--epsilon-min F``
|
||||
@@ -232,20 +343,40 @@ game's ``pyproject.toml``; you do not pass it on the command line.
|
||||
- ``--target-update-freq N``
|
||||
- ``--max-turns-per-episode N``
|
||||
- ``--exploration-turns N``
|
||||
- ``--train-every N``
|
||||
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
|
||||
|
||||
``retro-gamer train``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Train (or resume training) a DQN agent.
|
||||
Train a DQN agent.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train RUN_DIR [--resume CHECKPOINT]
|
||||
% retro-gamer train RUN_DIR
|
||||
|
||||
``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
|
||||
create``. If ``--resume`` is given, training resumes from the specified
|
||||
checkpoint file (relative or absolute path).
|
||||
create``. If checkpoints already exist in ``RUN_DIR``, training
|
||||
automatically resumes from the latest one so prior work is never lost.
|
||||
|
||||
If all configured episodes have already been completed, the command
|
||||
prints a message and exits immediately. To keep training, increase
|
||||
``training_episodes`` in ``config.toml`` and run again.
|
||||
|
||||
**Incompatible changes.** Some config changes make existing checkpoints
|
||||
unusable. If you change any of the following, ``retro-gamer train`` will
|
||||
detect the mismatch and refuse to resume, with a clear explanation:
|
||||
|
||||
- ``actions``, ``reward``, ``character_set``, ``board_size``
|
||||
(``[metadata]``) — game description
|
||||
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
|
||||
``egocentric``, ``egocentric_player``, ``egocentric_radius``
|
||||
(``[preprocessing]``) — observation encoding
|
||||
- ``hidden_sizes`` (``[model]``) — network architecture
|
||||
|
||||
Run ``retro-gamer clean RUN_DIR`` to remove the old checkpoints and start
|
||||
fresh. Other hyperparameter changes (learning rate, epsilon, etc.) are
|
||||
safe and take effect immediately on the next training run.
|
||||
|
||||
``retro-gamer play``
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -256,16 +387,32 @@ Watch a trained agent play the game in the terminal.
|
||||
|
||||
% retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
|
||||
|
||||
``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
|
||||
name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
|
||||
By default, the latest available checkpoint is loaded. Use
|
||||
``--checkpoint`` to load a specific one by name (e.g. ``ep_0100``).
|
||||
``--framerate`` sets the target frames per second (default: 12). Press
|
||||
Enter or Escape to quit.
|
||||
|
||||
``retro-gamer clean``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Remove all checkpoints and the training log from a run directory.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer clean RUN_DIR
|
||||
|
||||
Prompts for confirmation before deleting. Use ``--yes`` / ``-y`` to skip
|
||||
the prompt. The ``config.toml`` is preserved so you can run
|
||||
``retro-gamer train`` immediately to start fresh with the same settings.
|
||||
|
||||
Use this after making an incompatible change (see ``retro-gamer train``
|
||||
above) or any time you want to restart training from scratch.
|
||||
|
||||
``retro-gamer info``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Print a summary of a training run: metadata, hyperparameters, recent
|
||||
episode log, and available checkpoints.
|
||||
checkpoint log, and available checkpoints.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
@@ -285,60 +432,49 @@ contents:
|
||||
└── checkpoints/
|
||||
├── ep_0100.pt # model weights at episode 100
|
||||
├── ep_0200.pt
|
||||
├── ...
|
||||
└── final.pt # model weights at training completion
|
||||
└── ... # one file saved every 100 episodes
|
||||
|
||||
``config.toml`` is written by ``retro-gamer create`` and updated (with
|
||||
the discovered character set and resolved hyperparameters) when
|
||||
``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
|
||||
and ``train`` is the recommended way to adjust hyperparameters.
|
||||
``retro-gamer train`` begins. It has five sections: ``[game]``,
|
||||
``[metadata]``, ``[preprocessing]``, ``[model]``, and ``[training]``.
|
||||
Editing ``config.toml`` between ``create`` and ``train`` is the
|
||||
recommended way to adjust hyperparameters.
|
||||
|
||||
``training.log`` begins with the full architecture description
|
||||
generated at training startup, followed by one line per episode in the
|
||||
format::
|
||||
``training.log`` begins with the full network architecture description,
|
||||
then one line per checkpoint (every 100 episodes) in the format::
|
||||
|
||||
[EP NNNN] total_reward=F steps=N epsilon=F avg_loss=F
|
||||
[ep_NNNN] ep=SSSS-NNNN avg_reward=F avg_steps=N epsilon=F avg_loss=F time=Xm Xs total=Xm Xs
|
||||
|
||||
Checkpoint files are PyTorch state dictionaries containing model
|
||||
weights, optimizer state, the current epsilon, and the total number of
|
||||
training steps completed. They can be loaded with
|
||||
``retro-gamer play`` or directly with the Python API.
|
||||
Each field averages over the episodes since the previous checkpoint:
|
||||
|
||||
- ``ep=SSSS-NNNN`` — episode range covered by this entry
|
||||
- ``avg_reward`` — mean total reward per episode (positive = good)
|
||||
- ``avg_steps`` — mean episode length in game turns
|
||||
- ``epsilon`` — current exploration rate (approaches ``epsilon_min`` over time)
|
||||
- ``avg_loss`` — mean Huber loss across training steps (should decrease as learning
|
||||
stabilises). Huber loss equals ½·(q−t)² for small errors and |q−t|−½ for large
|
||||
ones, so it stays bounded even when Q-values are large. Values in the range
|
||||
0–10 are typical; a slow downward trend over thousands of episodes is the
|
||||
healthy pattern. A loss that grows without bound indicates a learning rate
|
||||
that is too high.
|
||||
- ``time`` — wall-clock time for this checkpoint interval
|
||||
- ``total`` — cumulative training time across all sessions
|
||||
|
||||
When training is resumed, a ``=== Resumed from ... ===`` line is appended
|
||||
so the log records the full history of a run across multiple sessions.
|
||||
|
||||
Python API
|
||||
----------
|
||||
|
||||
For advanced use, ``retro-gamer``'s components are importable as a
|
||||
library.
|
||||
library. See the :doc:`api` reference for full details.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
|
||||
from retro_gamer import GameMetadata, DQNTrainer
|
||||
from retro.examples.snake import create_game
|
||||
|
||||
# Read metadata from [tool.retro-gamer] in the game's pyproject.toml
|
||||
metadata = GameMetadata.from_pyproject("retro.examples.snake")
|
||||
|
||||
trainer = DQNTrainer(
|
||||
create_game, metadata, "runs/snake/",
|
||||
training_episodes=500,
|
||||
n_layers=2,
|
||||
layer_size=128,
|
||||
)
|
||||
trainer = DQNTrainer(create_game, metadata, "runs/snake/")
|
||||
trainer.train()
|
||||
|
||||
``GameEnvironment`` provides a gym-style interface for stepping through
|
||||
a game programmatically:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameEnvironment
|
||||
|
||||
env = GameEnvironment(create_game, metadata)
|
||||
obs = env.reset() # returns initial observation vector
|
||||
obs, reward, done = env.step("KEY_RIGHT")
|
||||
|
||||
The observation is a flat NumPy array of dtype ``float32``. For spatial
|
||||
games, the first ``C × H × W`` elements are the board (channel-first
|
||||
one-hot encoding); for non-spatial games, the board is encoded
|
||||
``H × W × C`` and then flattened. Any ``observe_state`` values are
|
||||
appended at the end.
|
||||
|
||||
Reference in New Issue
Block a user