retro-gamer/docs/reference.rst

Reference
=========

Game description fields
-----------------------

Game descriptions are written in the ``[tool.retro-gamer]`` section of
your game project's ``pyproject.toml``. ``retro-gamer create`` reads
this section and copies the metadata into the training run's
``config.toml``, where it can also be inspected or hand-edited.

A complete example for the Snake game:

.. code-block:: toml

   [tool.retro-gamer]
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]

You do not need to specify the board size: ``retro-gamer`` reads it
directly from your game's ``board_size`` attribute.

The fields are described below.

``actions``
~~~~~~~~~~~

**Required.** A list of keystroke names the agent may send to the game
each turn. Use arrow key names for directional games, or single
characters for character-key games.

.. code-block:: toml

   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]

The agent also has access to a no-op action (doing nothing). The total
number of actions in the Q-network output is ``len(actions) + 1``.

``reward``
~~~~~~~~~~

**Required.** The key in the game's state dictionary to use as the
reward signal. The reward computed for each turn is the *change* in
this value from the previous turn.

.. code-block:: toml

   reward = "score"

``character_set``
~~~~~~~~~~~~~~~~~

**Optional.** A list of single characters that may appear on the board.
Each character occupies one "slot" in the one-hot encoding. Characters
not in this list are treated as empty space.

.. code-block:: toml

   character_set = ["@", "*", ">", "<", "^", "v"]

If omitted, ``retro-gamer`` runs an exploration phase to discover the
characters that appear in practice. The length of this phase is
controlled by the ``exploration_turns`` hyperparameter.

Preprocessing options
---------------------

Preprocessing options live in the ``[preprocessing]`` section of a run's
``config.toml``. They control how the game's board and state are
transformed into the observation vector that the neural network sees.
``retro-gamer create`` writes sensible defaults; you can edit them by
hand before running ``retro-gamer train``.

.. note::

   Changes to any ``[preprocessing]`` option—or to the game description
   fields above—make existing checkpoints incompatible. Run
   ``retro-gamer clean`` before retraining after such changes.

``spatial`` (default: ``false``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Whether to treat the board as a 2D spatial scene. When ``true``, the
trainer uses a convolutional neural network (CNN); when ``false``, a
multilayer perceptron (MLP) that sees the board as a flat list of
numbers.

``board`` (default: ``true``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Whether to include the board encoding in the observation vector. Set
to ``false`` to train on game state variables only, with no board at
all. This is useful for games with small, enumerable state spaces where
a lookup table (classic Q-learning) is sufficient.

When ``board = false``:

- ``spatial`` must also be ``false`` (no board means no 2D scene for a CNN).
- At least one key must be listed in ``observe_state``.
- ``character_set`` is not required and character discovery is skipped.

.. code-block:: toml

   [preprocessing]
   board = false
   observe_state = ["board_state"]

``observe_state`` (default: ``[]``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A list of keys from ``game.state`` to include in the observation
vector, appended after the board encoding (or as the entire
observation when ``board = false``). Scalar values contribute one
element each; list or tuple values are flattened.

.. code-block:: toml

   observe_state = ["apple_dx", "apple_dy"]

The keys must be present in ``game.state`` at every step, initialized
in ``create_game()`` before the game starts. All values that are lists
or tuples must always have the same length from episode to episode.

.. warning::

   ``observe_state`` keys must be initialized to their final shape in
   ``create_game()`` before the game starts. If a key is absent or its
   list length changes between episodes, training will crash with an
   error explaining which key changed and by how much. This happens
   because the neural network's input layer has a fixed size determined
   at the start of training; it cannot adapt to a changing observation
   shape mid-run.

   Always initialize every observed key with a placeholder of the
   correct type and length before the first ``game.step()`` call.

``observe_state_sizes`` (auto-discovered)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A table mapping each ``observe_state`` key to its flat size (``1`` for
scalars, ``N`` for sequences of length N). This is written automatically
to ``config.toml`` the first time ``retro-gamer train`` runs, after the
trainer samples ``game.state`` to discover the actual sizes:

.. code-block:: toml

   observe_state_sizes = {board_state = 9}

You do not need to set this manually. Once written, it is used to
detect changes in state shape when resuming training—an incompatible
change here requires running ``retro-gamer clean`` and starting fresh.

``egocentric`` (default: ``false``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When ``true``, the board observation is cropped to a square window
centred on a specific agent rather than the full board. This gives the
agent a local, first-person-like view and makes the observation
invariant to the agent's absolute position on the board.

Requires ``egocentric_player`` and ``egocentric_radius``.

``egocentric_player``
~~~~~~~~~~~~~~~~~~~~~~

The name of the agent to use as the centre of the egocentric crop.
Must match the ``name`` attribute of one of the game's agents.

.. code-block:: toml

   egocentric_player = "Snake head"

``egocentric_radius``
~~~~~~~~~~~~~~~~~~~~~~

The half-side-length of the egocentric crop window, in cells. The
resulting observation covers a ``(2r+1) × (2r+1)`` region. Larger
values give the agent a wider view; smaller values focus it on the
immediate vicinity.

.. code-block:: toml

   egocentric_radius = 8   # 17×17 window

When ``egocentric_radius`` is set, ``board_size`` in ``[metadata]`` is
automatically updated to ``[2r+1, 2r+1]`` so the network is sized
correctly.

.. _hyperparameters:

Hyperparameters
---------------

Hyperparameters are split across two sections of ``config.toml``:

- ``[model]`` — network architecture (changing these requires starting fresh)
- ``[training]`` — learning algorithm parameters (safe to change at any time)

Both sections can be set via ``retro-gamer create`` options or edited directly.

Learning and optimization
~~~~~~~~~~~~~~~~~~~~~~~~~

``learning_rate`` (default: ``0.0001``)
    The step size used by the Adam optimizer when updating network
    weights. Larger values converge faster but may be unstable; smaller
    values are more stable but slower.

``learning_rate_decay`` (default: ``0.9999``)
    Multiplicative decay applied to the learning rate after each
    episode. The learning rate decreases geometrically over training,
    helping the network fine-tune later without destabilizing early
    progress. With the default value, the learning rate decays to about
    13 % of its starting value after 20 000 episodes.

``gamma`` (default: ``0.99``)
    The discount factor for future rewards. A value of 1.0 makes the
    agent value all future rewards equally; smaller values make the
    agent increasingly myopic.

Exploration
~~~~~~~~~~~

``epsilon`` (default: ``1.0``)
    The initial exploration rate. At each turn, the agent takes a
    random action with probability ``epsilon`` and exploits its current
    Q-function with probability ``1 - epsilon``.

``epsilon_decay`` (default: ``0.9997``)
    Multiplicative decay applied to ``epsilon`` after each episode.

``epsilon_min`` (default: ``0.05``)
    The floor below which ``epsilon`` will not fall. A small amount of
    continued exploration prevents the agent from becoming permanently
    committed to a suboptimal policy.

Memory and sampling
~~~~~~~~~~~~~~~~~~~

``batch_size`` (default: ``64``)
    The number of experiences sampled from the replay buffer per
    training step.

``memory_capacity`` (default: ``50000``)
    The maximum number of experiences the replay buffer can hold. When
    full, the oldest experiences are discarded.

``prioritize_experiences`` (default: ``true``)
    Whether to use prioritized experience replay. When ``true``,
    experiences with larger TD errors are sampled more frequently.
    This often improves sample efficiency at a modest computational
    cost.

Model architecture
~~~~~~~~~~~~~~~~~~

These live in the ``[model]`` section. Changing them requires starting fresh
(run ``retro-gamer clean`` before retraining).

``hidden_sizes`` (default: ``[128, 64]``)
    A list of integers giving the size of each hidden layer in the MLP
    head. The default creates two layers: 128 units then 64. For spatial
    games this follows the CNN; for non-spatial games it is the full
    network. Larger or deeper networks can represent more complex
    Q-functions but train more slowly and may need more episodes.

Training duration
~~~~~~~~~~~~~~~~~

``training_episodes`` (default: ``20000``)
    The total number of game episodes to run. Each episode runs until
    the game ends or ``max_turns_per_episode`` turns have elapsed.

``max_turns_per_episode`` (default: ``2000``)
    A safety cutoff preventing a single episode from running
    indefinitely (for example, if the agent finds a way to avoid
    dying).

``target_update_freq`` (default: ``500``)
    How many training steps between updates of the target network.
    More frequent updates make training targets move faster (less
    stable); less frequent updates make them more stable but slower
    to reflect new learning.

``train_every`` (default: ``4``)
    Run one training step every N game steps. Higher values speed up
    episode collection at the cost of fewer gradient updates per
    experience. The default of 4 is a good balance for most games;
    set to 1 to train on every step.

Character discovery
~~~~~~~~~~~~~~~~~~~

``exploration_turns`` (default: ``200``)
    When ``character_set`` is not specified, the number of random
    turns to run at the start of training to discover which
    characters appear on the board.

``unknown_character_strategy`` (default: ``"ignore"``)
    What to do when a character appears during training that is not
    in the established ``character_set``. ``"ignore"`` treats it as
    an empty cell; ``"extend"`` rebuilds the model with an extended
    character set.

CLI reference
-------------

``retro-gamer create``
~~~~~~~~~~~~~~~~~~~~~~

Create a new training run directory with ``config.toml``. Game metadata
is read automatically from the ``[tool.retro-gamer]`` section of your
game's ``pyproject.toml``; you do not pass it on the command line.

.. code-block:: console

   % retro-gamer create --game GAME --output DIR [OPTIONS]

**Required options:**

- ``--game GAME`` — Your game, specified as a file path or a Python
  module name:

  - File path: ``--game my_game.py`` or ``--game my_game/``
  - Module name: ``--game retro.examples.snake``

  The ``[tool.retro-gamer]`` section is read from the ``pyproject.toml``
  found in or above the game file.
- ``--output DIR`` — Directory to create for this training run.

**Hyperparameter options** (all optional; see :ref:`hyperparameters`):

- ``--training-episodes N``
- ``--hidden-sizes SIZES`` — comma-separated, e.g. ``512,256``
- ``--learning-rate F``
- ``--learning-rate-decay F``
- ``--gamma F``
- ``--epsilon-decay F``
- ``--epsilon-min F``
- ``--batch-size N``
- ``--memory-capacity N``
- ``--target-update-freq N``
- ``--max-turns-per-episode N``
- ``--exploration-turns N``
- ``--train-every N``
- ``--prioritize-experiences`` / ``--no-prioritize-experiences``

``retro-gamer train``
~~~~~~~~~~~~~~~~~~~~~

Train a DQN agent.

.. code-block:: console

   % retro-gamer train RUN_DIR

``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
create``. If checkpoints already exist in ``RUN_DIR``, training
automatically resumes from the latest one so prior work is never lost.

If all configured episodes have already been completed, the command
prints a message and exits immediately. To keep training, increase
``training_episodes`` in ``config.toml`` and run again.

**Incompatible changes.** Some config changes make existing checkpoints
unusable. If you change any of the following, ``retro-gamer train`` will
detect the mismatch and refuse to resume, with a clear explanation:

- ``actions``, ``reward``, ``character_set``, ``board_size``
  (``[metadata]``) — game description
- ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``,
  ``egocentric``, ``egocentric_player``, ``egocentric_radius``
  (``[preprocessing]``) — observation encoding
- ``hidden_sizes`` (``[model]``) — network architecture

Run ``retro-gamer clean RUN_DIR`` to remove the old checkpoints and start
fresh. Other hyperparameter changes (learning rate, epsilon, etc.) are
safe and take effect immediately on the next training run.

``retro-gamer play``
~~~~~~~~~~~~~~~~~~~~

Watch a trained agent play the game in the terminal.

.. code-block:: console

   % retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]

By default, the latest available checkpoint is loaded. Use
``--checkpoint`` to load a specific one by name (e.g. ``ep_0100``).
``--framerate`` sets the target frames per second (default: 12). Press
Enter or Escape to quit.

``retro-gamer clean``
~~~~~~~~~~~~~~~~~~~~~

Remove all checkpoints and the training log from a run directory.

.. code-block:: console

   % retro-gamer clean RUN_DIR

Prompts for confirmation before deleting. Use ``--yes`` / ``-y`` to skip
the prompt. The ``config.toml`` is preserved so you can run
``retro-gamer train`` immediately to start fresh with the same settings.

Use this after making an incompatible change (see ``retro-gamer train``
above) or any time you want to restart training from scratch.

``retro-gamer info``
~~~~~~~~~~~~~~~~~~~~~

Print a summary of a training run: metadata, hyperparameters, recent
checkpoint log, and available checkpoints.

.. code-block:: console

   % retro-gamer info RUN_DIR

Training run directory structure
---------------------------------

A training run is a self-contained directory with the following
contents:

.. code-block:: text

   runs/snake/
   ├── config.toml       # game description + hyperparameters
   ├── training.log      # architecture rationale + per-episode log
   └── checkpoints/
       ├── ep_0100.pt    # model weights at episode 100
       ├── ep_0200.pt
       └── ...           # one file saved every 100 episodes

``config.toml`` is written by ``retro-gamer create`` and updated (with
the discovered character set and resolved hyperparameters) when
``retro-gamer train`` begins. It has five sections: ``[game]``,
``[metadata]``, ``[preprocessing]``, ``[model]``, and ``[training]``.
Editing ``config.toml`` between ``create`` and ``train`` is the
recommended way to adjust hyperparameters.

``training.log`` begins with the full network architecture description,
then one line per checkpoint (every 100 episodes) in the format::

   [ep_NNNN]  ep=SSSS-NNNN  avg_reward=F  avg_steps=N  epsilon=F  avg_loss=F  time=Xm Xs  total=Xm Xs

Each field averages over the episodes since the previous checkpoint:

- ``ep=SSSS-NNNN`` — episode range covered by this entry
- ``avg_reward`` — mean total reward per episode (positive = good)
- ``avg_steps`` — mean episode length in game turns
- ``epsilon`` — current exploration rate (approaches ``epsilon_min`` over time)
- ``avg_loss`` — mean Huber loss across training steps (should decrease as learning
  stabilises). Huber loss equals ½·(q−t)² for small errors and |q−t|−½ for large
  ones, so it stays bounded even when Q-values are large. Values in the range
  0–10 are typical; a slow downward trend over thousands of episodes is the
  healthy pattern. A loss that grows without bound indicates a learning rate
  that is too high.
- ``time`` — wall-clock time for this checkpoint interval
- ``total`` — cumulative training time across all sessions

When training is resumed, a ``=== Resumed from ... ===`` line is appended
so the log records the full history of a run across multiple sessions.

Python API
----------

For advanced use, ``retro-gamer``'s components are importable as a
library. See the :doc:`api` reference for full details.

.. code-block:: python

   from retro_gamer import GameMetadata, DQNTrainer
   from retro.examples.snake import create_game

   metadata = GameMetadata.from_pyproject("retro.examples.snake")
   trainer = DQNTrainer(create_game, metadata, "runs/snake/")
   trainer.train()