345 lines
10 KiB
ReStructuredText
345 lines
10 KiB
ReStructuredText
Reference
|
||
=========
|
||
|
||
Game description fields
|
||
-----------------------
|
||
|
||
Game descriptions are written in the ``[tool.retro-gamer]`` section of
|
||
your game project's ``pyproject.toml``. ``retro-gamer create`` reads
|
||
this section and copies the metadata into the training run's
|
||
``config.toml``, where it can also be inspected or hand-edited.
|
||
|
||
A complete example for the Snake game:
|
||
|
||
.. code-block:: toml
|
||
|
||
[tool.retro-gamer]
|
||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||
reward = "score"
|
||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||
spatial = true
|
||
observe_state = []
|
||
|
||
You do not need to specify the board size: ``retro-gamer`` reads it
|
||
directly from your game's ``board_size`` attribute.
|
||
|
||
The fields are described below.
|
||
|
||
``actions``
|
||
~~~~~~~~~~~
|
||
|
||
**Required.** A list of keystroke names the agent may send to the game
|
||
each turn. Use arrow key names for directional games, or single
|
||
characters for character-key games.
|
||
|
||
.. code-block:: toml
|
||
|
||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||
|
||
The agent also has access to a no-op action (doing nothing). The total
|
||
number of actions in the Q-network output is ``len(actions) + 1``.
|
||
|
||
``reward``
|
||
~~~~~~~~~~
|
||
|
||
**Required.** The key in the game's state dictionary to use as the
|
||
reward signal. The reward computed for each turn is the *change* in
|
||
this value from the previous turn.
|
||
|
||
.. code-block:: toml
|
||
|
||
reward = "score"
|
||
|
||
``character_set``
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
**Optional.** A list of single characters that may appear on the board.
|
||
Each character occupies one "slot" in the one-hot encoding. Characters
|
||
not in this list are treated as empty space.
|
||
|
||
.. code-block:: toml
|
||
|
||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||
|
||
If omitted, ``retro-gamer`` runs an exploration phase to discover the
|
||
characters that appear in practice. The length of this phase is
|
||
controlled by the ``exploration_turns`` hyperparameter.
|
||
|
||
``spatial``
|
||
~~~~~~~~~~~
|
||
|
||
**Optional; default ``true``.** Whether to treat the board as a 2D
|
||
spatial scene. When ``true``, the trainer uses a convolutional neural
|
||
network (CNN) that can detect patterns in the relative positions of
|
||
characters. When ``false``, the trainer uses a multilayer perceptron
|
||
(MLP) that sees the board as a flat list of numbers without positional
|
||
structure.
|
||
|
||
.. code-block:: toml
|
||
|
||
spatial = true
|
||
|
||
``observe_state``
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
**Optional; default ``[]``.** A list of keys from the game's state
|
||
dictionary to append to the observation vector. The values must be
|
||
numbers (integers, floats, or booleans). The reward key must not
|
||
appear in this list.
|
||
|
||
.. code-block:: toml
|
||
|
||
observe_state = ["lives", "level"]
|
||
|
||
.. _hyperparameters:
|
||
|
||
Hyperparameters
|
||
---------------
|
||
|
||
Hyperparameters are stored in the ``[hyperparameters]`` section of
|
||
``config.toml``. They can be set via ``retro-gamer create`` options or
|
||
edited directly.
|
||
|
||
Learning and optimization
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
``learning_rate`` (default: ``0.001``)
|
||
The step size used by the Adam optimizer when updating network
|
||
weights. Larger values converge faster but may be unstable; smaller
|
||
values are more stable but slower.
|
||
|
||
``lr_decay`` (default: ``0.995``)
|
||
Multiplicative decay applied to the learning rate after each
|
||
episode. The learning rate decreases geometrically over training,
|
||
helping the network fine-tune later without destabilizing early
|
||
progress.
|
||
|
||
``gamma`` (default: ``0.99``)
|
||
The discount factor for future rewards. A value of 1.0 makes the
|
||
agent value all future rewards equally; smaller values make the
|
||
agent increasingly myopic.
|
||
|
||
Exploration
|
||
~~~~~~~~~~~
|
||
|
||
``epsilon`` (default: ``1.0``)
|
||
The initial exploration rate. At each turn, the agent takes a
|
||
random action with probability ``epsilon`` and exploits its current
|
||
Q-function with probability ``1 - epsilon``.
|
||
|
||
``epsilon_decay`` (default: ``0.995``)
|
||
Multiplicative decay applied to ``epsilon`` after each episode.
|
||
|
||
``epsilon_min`` (default: ``0.05``)
|
||
The floor below which ``epsilon`` will not fall. A small amount of
|
||
continued exploration prevents the agent from becoming permanently
|
||
committed to a suboptimal policy.
|
||
|
||
Memory and sampling
|
||
~~~~~~~~~~~~~~~~~~~
|
||
|
||
``batch_size`` (default: ``64``)
|
||
The number of experiences sampled from the replay buffer per
|
||
training step.
|
||
|
||
``memory_capacity`` (default: ``10000``)
|
||
The maximum number of experiences the replay buffer can hold. When
|
||
full, the oldest experiences are discarded.
|
||
|
||
``prioritize_experiences`` (default: ``false``)
|
||
Whether to use prioritized experience replay. When ``true``,
|
||
experiences with larger TD errors are sampled more frequently.
|
||
This often improves sample efficiency at a modest computational
|
||
cost.
|
||
|
||
Network architecture
|
||
~~~~~~~~~~~~~~~~~~~~
|
||
|
||
``n_layers`` (default: ``2``)
|
||
The number of hidden layers in the MLP head (for spatial games,
|
||
this follows the CNN; for non-spatial games, it is the full
|
||
network).
|
||
|
||
``layer_size`` (default: ``128``)
|
||
The width (number of units) in each hidden layer.
|
||
|
||
Training duration
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
``training_episodes`` (default: ``1000``)
|
||
The total number of game episodes to run. Each episode runs until
|
||
the game ends or ``max_turns_per_episode`` turns have elapsed.
|
||
|
||
``max_turns_per_episode`` (default: ``2000``)
|
||
A safety cutoff preventing a single episode from running
|
||
indefinitely (for example, if the agent finds a way to avoid
|
||
dying).
|
||
|
||
``target_update_freq`` (default: ``100``)
|
||
How many training steps between updates of the target network.
|
||
More frequent updates make training targets move faster (less
|
||
stable); less frequent updates make them more stable but slower
|
||
to reflect new learning.
|
||
|
||
Character discovery
|
||
~~~~~~~~~~~~~~~~~~~
|
||
|
||
``exploration_turns`` (default: ``200``)
|
||
When ``character_set`` is not specified, the number of random
|
||
turns to run at the start of training to discover which
|
||
characters appear on the board.
|
||
|
||
``unknown_character_strategy`` (default: ``"ignore"``)
|
||
What to do when a character appears during training that is not
|
||
in the established ``character_set``. ``"ignore"`` treats it as
|
||
an empty cell; ``"extend"`` rebuilds the model with an extended
|
||
character set.
|
||
|
||
CLI reference
|
||
-------------
|
||
|
||
``retro-gamer create``
|
||
~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Create a new training run directory with ``config.toml``. Game metadata
|
||
is read automatically from the ``[tool.retro-gamer]`` section of your
|
||
game's ``pyproject.toml``; you do not pass it on the command line.
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer create --game MODULE --output DIR [OPTIONS]
|
||
|
||
**Required options:**
|
||
|
||
- ``--game MODULE`` — Python module containing ``create_game()``
|
||
(e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
|
||
is read from the ``pyproject.toml`` found in or above the module's
|
||
source directory.
|
||
- ``--output DIR`` — Directory to create for this training run.
|
||
|
||
**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
|
||
|
||
- ``--training-episodes N``
|
||
- ``--n-layers N``
|
||
- ``--layer-size N``
|
||
- ``--learning-rate F``
|
||
- ``--lr-decay F``
|
||
- ``--gamma F``
|
||
- ``--epsilon-decay F``
|
||
- ``--epsilon-min F``
|
||
- ``--batch-size N``
|
||
- ``--memory-capacity N``
|
||
- ``--target-update-freq N``
|
||
- ``--max-turns-per-episode N``
|
||
- ``--exploration-turns N``
|
||
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
|
||
|
||
``retro-gamer train``
|
||
~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Train (or resume training) a DQN agent.
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer train RUN_DIR [--resume CHECKPOINT]
|
||
|
||
``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
|
||
create``. If ``--resume`` is given, training resumes from the specified
|
||
checkpoint file (relative or absolute path).
|
||
|
||
``retro-gamer play``
|
||
~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Watch a trained agent play the game in the terminal.
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
|
||
|
||
``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
|
||
name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
|
||
``--framerate`` sets the target frames per second (default: 12). Press
|
||
Enter or Escape to quit.
|
||
|
||
``retro-gamer info``
|
||
~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Print a summary of a training run: metadata, hyperparameters, recent
|
||
episode log, and available checkpoints.
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer info RUN_DIR
|
||
|
||
Training run directory structure
|
||
---------------------------------
|
||
|
||
A training run is a self-contained directory with the following
|
||
contents:
|
||
|
||
.. code-block:: text
|
||
|
||
runs/snake/
|
||
├── config.toml # game description + hyperparameters
|
||
├── training.log # architecture rationale + per-episode log
|
||
└── checkpoints/
|
||
├── ep_0100.pt # model weights at episode 100
|
||
├── ep_0200.pt
|
||
├── ...
|
||
└── final.pt # model weights at training completion
|
||
|
||
``config.toml`` is written by ``retro-gamer create`` and updated (with
|
||
the discovered character set and resolved hyperparameters) when
|
||
``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
|
||
and ``train`` is the recommended way to adjust hyperparameters.
|
||
|
||
``training.log`` begins with the full architecture description
|
||
generated at training startup, followed by one line per episode in the
|
||
format::
|
||
|
||
[EP NNNN] total_reward=F steps=N epsilon=F avg_loss=F
|
||
|
||
Checkpoint files are PyTorch state dictionaries containing model
|
||
weights, optimizer state, the current epsilon, and the total number of
|
||
training steps completed. They can be loaded with
|
||
``retro-gamer play`` or directly with the Python API.
|
||
|
||
Python API
|
||
----------
|
||
|
||
For advanced use, ``retro-gamer``'s components are importable as a
|
||
library.
|
||
|
||
.. code-block:: python
|
||
|
||
from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
|
||
from retro.examples.snake import create_game
|
||
|
||
# Read metadata from [tool.retro-gamer] in the game's pyproject.toml
|
||
metadata = GameMetadata.from_pyproject("retro.examples.snake")
|
||
|
||
trainer = DQNTrainer(
|
||
create_game, metadata, "runs/snake/",
|
||
training_episodes=500,
|
||
n_layers=2,
|
||
layer_size=128,
|
||
)
|
||
trainer.train()
|
||
|
||
``GameEnvironment`` provides a gym-style interface for stepping through
|
||
a game programmatically:
|
||
|
||
.. code-block:: python
|
||
|
||
from retro_gamer import GameEnvironment
|
||
|
||
env = GameEnvironment(create_game, metadata)
|
||
obs = env.reset() # returns initial observation vector
|
||
obs, reward, done = env.step("KEY_RIGHT")
|
||
|
||
The observation is a flat NumPy array of dtype ``float32``. For spatial
|
||
games, the first ``C × H × W`` elements are the board (channel-first
|
||
one-hot encoding); for non-spatial games, the board is encoded
|
||
``H × W × C`` and then flattened. Any ``observe_state`` values are
|
||
appended at the end.
|