Initial commit
This commit is contained in:
344
docs/reference.rst
Normal file
344
docs/reference.rst
Normal file
@@ -0,0 +1,344 @@
|
||||
Reference
|
||||
=========
|
||||
|
||||
Game description fields
|
||||
-----------------------
|
||||
|
||||
Game descriptions are written in the ``[tool.retro-gamer]`` section of
|
||||
your game project's ``pyproject.toml``. ``retro-gamer create`` reads
|
||||
this section and copies the metadata into the training run's
|
||||
``config.toml``, where it can also be inspected or hand-edited.
|
||||
|
||||
A complete example for the Snake game:
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
[tool.retro-gamer]
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
reward = "score"
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
spatial = true
|
||||
observe_state = []
|
||||
|
||||
You do not need to specify the board size: ``retro-gamer`` reads it
|
||||
directly from your game's ``board_size`` attribute.
|
||||
|
||||
The fields are described below.
|
||||
|
||||
``actions``
|
||||
~~~~~~~~~~~
|
||||
|
||||
**Required.** A list of keystroke names the agent may send to the game
|
||||
each turn. Use arrow key names for directional games, or single
|
||||
characters for character-key games.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||||
|
||||
The agent also has access to a no-op action (doing nothing). The total
|
||||
number of actions in the Q-network output is ``len(actions) + 1``.
|
||||
|
||||
``reward``
|
||||
~~~~~~~~~~
|
||||
|
||||
**Required.** The key in the game's state dictionary to use as the
|
||||
reward signal. The reward computed for each turn is the *change* in
|
||||
this value from the previous turn.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
reward = "score"
|
||||
|
||||
``character_set``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Optional.** A list of single characters that may appear on the board.
|
||||
Each character occupies one "slot" in the one-hot encoding. Characters
|
||||
not in this list are treated as empty space.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||||
|
||||
If omitted, ``retro-gamer`` runs an exploration phase to discover the
|
||||
characters that appear in practice. The length of this phase is
|
||||
controlled by the ``exploration_turns`` hyperparameter.
|
||||
|
||||
``spatial``
|
||||
~~~~~~~~~~~
|
||||
|
||||
**Optional; default ``true``.** Whether to treat the board as a 2D
|
||||
spatial scene. When ``true``, the trainer uses a convolutional neural
|
||||
network (CNN) that can detect patterns in the relative positions of
|
||||
characters. When ``false``, the trainer uses a multilayer perceptron
|
||||
(MLP) that sees the board as a flat list of numbers without positional
|
||||
structure.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
spatial = true
|
||||
|
||||
``observe_state``
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Optional; default ``[]``.** A list of keys from the game's state
|
||||
dictionary to append to the observation vector. The values must be
|
||||
numbers (integers, floats, or booleans). The reward key must not
|
||||
appear in this list.
|
||||
|
||||
.. code-block:: toml
|
||||
|
||||
observe_state = ["lives", "level"]
|
||||
|
||||
.. _hyperparameters:
|
||||
|
||||
Hyperparameters
|
||||
---------------
|
||||
|
||||
Hyperparameters are stored in the ``[hyperparameters]`` section of
|
||||
``config.toml``. They can be set via ``retro-gamer create`` options or
|
||||
edited directly.
|
||||
|
||||
Learning and optimization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``learning_rate`` (default: ``0.001``)
|
||||
The step size used by the Adam optimizer when updating network
|
||||
weights. Larger values converge faster but may be unstable; smaller
|
||||
values are more stable but slower.
|
||||
|
||||
``lr_decay`` (default: ``0.995``)
|
||||
Multiplicative decay applied to the learning rate after each
|
||||
episode. The learning rate decreases geometrically over training,
|
||||
helping the network fine-tune later without destabilizing early
|
||||
progress.
|
||||
|
||||
``gamma`` (default: ``0.99``)
|
||||
The discount factor for future rewards. A value of 1.0 makes the
|
||||
agent value all future rewards equally; smaller values make the
|
||||
agent increasingly myopic.
|
||||
|
||||
Exploration
|
||||
~~~~~~~~~~~
|
||||
|
||||
``epsilon`` (default: ``1.0``)
|
||||
The initial exploration rate. At each turn, the agent takes a
|
||||
random action with probability ``epsilon`` and exploits its current
|
||||
Q-function with probability ``1 - epsilon``.
|
||||
|
||||
``epsilon_decay`` (default: ``0.995``)
|
||||
Multiplicative decay applied to ``epsilon`` after each episode.
|
||||
|
||||
``epsilon_min`` (default: ``0.05``)
|
||||
The floor below which ``epsilon`` will not fall. A small amount of
|
||||
continued exploration prevents the agent from becoming permanently
|
||||
committed to a suboptimal policy.
|
||||
|
||||
Memory and sampling
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``batch_size`` (default: ``64``)
|
||||
The number of experiences sampled from the replay buffer per
|
||||
training step.
|
||||
|
||||
``memory_capacity`` (default: ``10000``)
|
||||
The maximum number of experiences the replay buffer can hold. When
|
||||
full, the oldest experiences are discarded.
|
||||
|
||||
``prioritize_experiences`` (default: ``false``)
|
||||
Whether to use prioritized experience replay. When ``true``,
|
||||
experiences with larger TD errors are sampled more frequently.
|
||||
This often improves sample efficiency at a modest computational
|
||||
cost.
|
||||
|
||||
Network architecture
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``n_layers`` (default: ``2``)
|
||||
The number of hidden layers in the MLP head (for spatial games,
|
||||
this follows the CNN; for non-spatial games, it is the full
|
||||
network).
|
||||
|
||||
``layer_size`` (default: ``128``)
|
||||
The width (number of units) in each hidden layer.
|
||||
|
||||
Training duration
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
``training_episodes`` (default: ``1000``)
|
||||
The total number of game episodes to run. Each episode runs until
|
||||
the game ends or ``max_turns_per_episode`` turns have elapsed.
|
||||
|
||||
``max_turns_per_episode`` (default: ``2000``)
|
||||
A safety cutoff preventing a single episode from running
|
||||
indefinitely (for example, if the agent finds a way to avoid
|
||||
dying).
|
||||
|
||||
``target_update_freq`` (default: ``100``)
|
||||
How many training steps between updates of the target network.
|
||||
More frequent updates make training targets move faster (less
|
||||
stable); less frequent updates make them more stable but slower
|
||||
to reflect new learning.
|
||||
|
||||
Character discovery
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
``exploration_turns`` (default: ``200``)
|
||||
When ``character_set`` is not specified, the number of random
|
||||
turns to run at the start of training to discover which
|
||||
characters appear on the board.
|
||||
|
||||
``unknown_character_strategy`` (default: ``"ignore"``)
|
||||
What to do when a character appears during training that is not
|
||||
in the established ``character_set``. ``"ignore"`` treats it as
|
||||
an empty cell; ``"extend"`` rebuilds the model with an extended
|
||||
character set.
|
||||
|
||||
CLI reference
|
||||
-------------
|
||||
|
||||
``retro-gamer create``
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Create a new training run directory with ``config.toml``. Game metadata
|
||||
is read automatically from the ``[tool.retro-gamer]`` section of your
|
||||
game's ``pyproject.toml``; you do not pass it on the command line.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer create --game MODULE --output DIR [OPTIONS]
|
||||
|
||||
**Required options:**
|
||||
|
||||
- ``--game MODULE`` — Python module containing ``create_game()``
|
||||
(e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
|
||||
is read from the ``pyproject.toml`` found in or above the module's
|
||||
source directory.
|
||||
- ``--output DIR`` — Directory to create for this training run.
|
||||
|
||||
**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
|
||||
|
||||
- ``--training-episodes N``
|
||||
- ``--n-layers N``
|
||||
- ``--layer-size N``
|
||||
- ``--learning-rate F``
|
||||
- ``--lr-decay F``
|
||||
- ``--gamma F``
|
||||
- ``--epsilon-decay F``
|
||||
- ``--epsilon-min F``
|
||||
- ``--batch-size N``
|
||||
- ``--memory-capacity N``
|
||||
- ``--target-update-freq N``
|
||||
- ``--max-turns-per-episode N``
|
||||
- ``--exploration-turns N``
|
||||
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
|
||||
|
||||
``retro-gamer train``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Train (or resume training) a DQN agent.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer train RUN_DIR [--resume CHECKPOINT]
|
||||
|
||||
``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
|
||||
create``. If ``--resume`` is given, training resumes from the specified
|
||||
checkpoint file (relative or absolute path).
|
||||
|
||||
``retro-gamer play``
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Watch a trained agent play the game in the terminal.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
|
||||
|
||||
``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
|
||||
name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
|
||||
``--framerate`` sets the target frames per second (default: 12). Press
|
||||
Enter or Escape to quit.
|
||||
|
||||
``retro-gamer info``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Print a summary of a training run: metadata, hyperparameters, recent
|
||||
episode log, and available checkpoints.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
% retro-gamer info RUN_DIR
|
||||
|
||||
Training run directory structure
|
||||
---------------------------------
|
||||
|
||||
A training run is a self-contained directory with the following
|
||||
contents:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
runs/snake/
|
||||
├── config.toml # game description + hyperparameters
|
||||
├── training.log # architecture rationale + per-episode log
|
||||
└── checkpoints/
|
||||
├── ep_0100.pt # model weights at episode 100
|
||||
├── ep_0200.pt
|
||||
├── ...
|
||||
└── final.pt # model weights at training completion
|
||||
|
||||
``config.toml`` is written by ``retro-gamer create`` and updated (with
|
||||
the discovered character set and resolved hyperparameters) when
|
||||
``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
|
||||
and ``train`` is the recommended way to adjust hyperparameters.
|
||||
|
||||
``training.log`` begins with the full architecture description
|
||||
generated at training startup, followed by one line per episode in the
|
||||
format::
|
||||
|
||||
[EP NNNN] total_reward=F steps=N epsilon=F avg_loss=F
|
||||
|
||||
Checkpoint files are PyTorch state dictionaries containing model
|
||||
weights, optimizer state, the current epsilon, and the total number of
|
||||
training steps completed. They can be loaded with
|
||||
``retro-gamer play`` or directly with the Python API.
|
||||
|
||||
Python API
|
||||
----------
|
||||
|
||||
For advanced use, ``retro-gamer``'s components are importable as a
|
||||
library.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
|
||||
from retro.examples.snake import create_game
|
||||
|
||||
# Read metadata from [tool.retro-gamer] in the game's pyproject.toml
|
||||
metadata = GameMetadata.from_pyproject("retro.examples.snake")
|
||||
|
||||
trainer = DQNTrainer(
|
||||
create_game, metadata, "runs/snake/",
|
||||
training_episodes=500,
|
||||
n_layers=2,
|
||||
layer_size=128,
|
||||
)
|
||||
trainer.train()
|
||||
|
||||
``GameEnvironment`` provides a gym-style interface for stepping through
|
||||
a game programmatically:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from retro_gamer import GameEnvironment
|
||||
|
||||
env = GameEnvironment(create_game, metadata)
|
||||
obs = env.reset() # returns initial observation vector
|
||||
obs, reward, done = env.step("KEY_RIGHT")
|
||||
|
||||
The observation is a flat NumPy array of dtype ``float32``. For spatial
|
||||
games, the first ``C × H × W`` elements are the board (channel-first
|
||||
one-hot encoding); for non-spatial games, the board is encoded
|
||||
``H × W × C`` and then flattened. Any ``observe_state`` values are
|
||||
appended at the end.
|
||||
Reference in New Issue
Block a user