Initial commit

2026-05-08 14:07:17 -04:00
commit 5ca97dc5d0
36 changed files with 4147 additions and 0 deletions
--- a/docs/reference.rst
+++ b/docs/reference.rst
@@ -0,0 +1,344 @@
+Reference
+=========
+
+Game description fields
+-----------------------
+
+Game descriptions are written in the ``[tool.retro-gamer]`` section of
+your game project's ``pyproject.toml``. ``retro-gamer create`` reads
+this section and copies the metadata into the training run's
+``config.toml``, where it can also be inspected or hand-edited.
+
+A complete example for the Snake game:
+
+.. code-block:: toml
+
+   [tool.retro-gamer]
+   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
+   reward = "score"
+   character_set = ["@", "*", ">", "<", "^", "v"]
+   spatial = true
+   observe_state = []
+
+You do not need to specify the board size: ``retro-gamer`` reads it
+directly from your game's ``board_size`` attribute.
+
+The fields are described below.
+
+``actions``
+~~~~~~~~~~~
+
+**Required.** A list of keystroke names the agent may send to the game
+each turn. Use arrow key names for directional games, or single
+characters for character-key games.
+
+.. code-block:: toml
+
+   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
+
+The agent also has access to a no-op action (doing nothing). The total
+number of actions in the Q-network output is ``len(actions) + 1``.
+
+``reward``
+~~~~~~~~~~
+
+**Required.** The key in the game's state dictionary to use as the
+reward signal. The reward computed for each turn is the *change* in
+this value from the previous turn.
+
+.. code-block:: toml
+
+   reward = "score"
+
+``character_set``
+~~~~~~~~~~~~~~~~~
+
+**Optional.** A list of single characters that may appear on the board.
+Each character occupies one "slot" in the one-hot encoding. Characters
+not in this list are treated as empty space.
+
+.. code-block:: toml
+
+   character_set = ["@", "*", ">", "<", "^", "v"]
+
+If omitted, ``retro-gamer`` runs an exploration phase to discover the
+characters that appear in practice. The length of this phase is
+controlled by the ``exploration_turns`` hyperparameter.
+
+``spatial``
+~~~~~~~~~~~
+
+**Optional; default ``true``.** Whether to treat the board as a 2D
+spatial scene. When ``true``, the trainer uses a convolutional neural
+network (CNN) that can detect patterns in the relative positions of
+characters. When ``false``, the trainer uses a multilayer perceptron
+(MLP) that sees the board as a flat list of numbers without positional
+structure.
+
+.. code-block:: toml
+
+   spatial = true
+
+``observe_state``
+~~~~~~~~~~~~~~~~~
+
+**Optional; default ``[]``.** A list of keys from the game's state
+dictionary to append to the observation vector. The values must be
+numbers (integers, floats, or booleans). The reward key must not
+appear in this list.
+
+.. code-block:: toml
+
+   observe_state = ["lives", "level"]
+
+.. _hyperparameters:
+
+Hyperparameters
+---------------
+
+Hyperparameters are stored in the ``[hyperparameters]`` section of
+``config.toml``. They can be set via ``retro-gamer create`` options or
+edited directly.
+
+Learning and optimization
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``learning_rate`` (default: ``0.001``)
+    The step size used by the Adam optimizer when updating network
+    weights. Larger values converge faster but may be unstable; smaller
+    values are more stable but slower.
+
+``lr_decay`` (default: ``0.995``)
+    Multiplicative decay applied to the learning rate after each
+    episode. The learning rate decreases geometrically over training,
+    helping the network fine-tune later without destabilizing early
+    progress.
+
+``gamma`` (default: ``0.99``)
+    The discount factor for future rewards. A value of 1.0 makes the
+    agent value all future rewards equally; smaller values make the
+    agent increasingly myopic.
+
+Exploration
+~~~~~~~~~~~
+
+``epsilon`` (default: ``1.0``)
+    The initial exploration rate. At each turn, the agent takes a
+    random action with probability ``epsilon`` and exploits its current
+    Q-function with probability ``1 - epsilon``.
+
+``epsilon_decay`` (default: ``0.995``)
+    Multiplicative decay applied to ``epsilon`` after each episode.
+
+``epsilon_min`` (default: ``0.05``)
+    The floor below which ``epsilon`` will not fall. A small amount of
+    continued exploration prevents the agent from becoming permanently
+    committed to a suboptimal policy.
+
+Memory and sampling
+~~~~~~~~~~~~~~~~~~~
+
+``batch_size`` (default: ``64``)
+    The number of experiences sampled from the replay buffer per
+    training step.
+
+``memory_capacity`` (default: ``10000``)
+    The maximum number of experiences the replay buffer can hold. When
+    full, the oldest experiences are discarded.
+
+``prioritize_experiences`` (default: ``false``)
+    Whether to use prioritized experience replay. When ``true``,
+    experiences with larger TD errors are sampled more frequently.
+    This often improves sample efficiency at a modest computational
+    cost.
+
+Network architecture
+~~~~~~~~~~~~~~~~~~~~
+
+``n_layers`` (default: ``2``)
+    The number of hidden layers in the MLP head (for spatial games,
+    this follows the CNN; for non-spatial games, it is the full
+    network).
+
+``layer_size`` (default: ``128``)
+    The width (number of units) in each hidden layer.
+
+Training duration
+~~~~~~~~~~~~~~~~~
+
+``training_episodes`` (default: ``1000``)
+    The total number of game episodes to run. Each episode runs until
+    the game ends or ``max_turns_per_episode`` turns have elapsed.
+
+``max_turns_per_episode`` (default: ``2000``)
+    A safety cutoff preventing a single episode from running
+    indefinitely (for example, if the agent finds a way to avoid
+    dying).
+
+``target_update_freq`` (default: ``100``)
+    How many training steps between updates of the target network.
+    More frequent updates make training targets move faster (less
+    stable); less frequent updates make them more stable but slower
+    to reflect new learning.
+
+Character discovery
+~~~~~~~~~~~~~~~~~~~
+
+``exploration_turns`` (default: ``200``)
+    When ``character_set`` is not specified, the number of random
+    turns to run at the start of training to discover which
+    characters appear on the board.
+
+``unknown_character_strategy`` (default: ``"ignore"``)
+    What to do when a character appears during training that is not
+    in the established ``character_set``. ``"ignore"`` treats it as
+    an empty cell; ``"extend"`` rebuilds the model with an extended
+    character set.
+
+CLI reference
+-------------
+
+``retro-gamer create``
+~~~~~~~~~~~~~~~~~~~~~~
+
+Create a new training run directory with ``config.toml``. Game metadata
+is read automatically from the ``[tool.retro-gamer]`` section of your
+game's ``pyproject.toml``; you do not pass it on the command line.
+
+.. code-block:: console
+
+   % retro-gamer create --game MODULE --output DIR [OPTIONS]
+
+**Required options:**
+
+- ``--game MODULE`` — Python module containing ``create_game()``
+  (e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
+  is read from the ``pyproject.toml`` found in or above the module's
+  source directory.
+- ``--output DIR`` — Directory to create for this training run.
+
+**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
+
+- ``--training-episodes N``
+- ``--n-layers N``
+- ``--layer-size N``
+- ``--learning-rate F``
+- ``--lr-decay F``
+- ``--gamma F``
+- ``--epsilon-decay F``
+- ``--epsilon-min F``
+- ``--batch-size N``
+- ``--memory-capacity N``
+- ``--target-update-freq N``
+- ``--max-turns-per-episode N``
+- ``--exploration-turns N``
+- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
+
+``retro-gamer train``
+~~~~~~~~~~~~~~~~~~~~~
+
+Train (or resume training) a DQN agent.
+
+.. code-block:: console
+
+   % retro-gamer train RUN_DIR [--resume CHECKPOINT]
+
+``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
+create``. If ``--resume`` is given, training resumes from the specified
+checkpoint file (relative or absolute path).
+
+``retro-gamer play``
+~~~~~~~~~~~~~~~~~~~~
+
+Watch a trained agent play the game in the terminal.
+
+.. code-block:: console
+
+   % retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
+
+``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
+name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
+``--framerate`` sets the target frames per second (default: 12). Press
+Enter or Escape to quit.
+
+``retro-gamer info``
+~~~~~~~~~~~~~~~~~~~~~
+
+Print a summary of a training run: metadata, hyperparameters, recent
+episode log, and available checkpoints.
+
+.. code-block:: console
+
+   % retro-gamer info RUN_DIR
+
+Training run directory structure
+---------------------------------
+
+A training run is a self-contained directory with the following
+contents:
+
+.. code-block:: text
+
+   runs/snake/
+   ├── config.toml       # game description + hyperparameters
+   ├── training.log      # architecture rationale + per-episode log
+   └── checkpoints/
+       ├── ep_0100.pt    # model weights at episode 100
+       ├── ep_0200.pt
+       ├── ...
+       └── final.pt      # model weights at training completion
+
+``config.toml`` is written by ``retro-gamer create`` and updated (with
+the discovered character set and resolved hyperparameters) when
+``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
+and ``train`` is the recommended way to adjust hyperparameters.
+
+``training.log`` begins with the full architecture description
+generated at training startup, followed by one line per episode in the
+format::
+
+   [EP NNNN] total_reward=F  steps=N  epsilon=F  avg_loss=F
+
+Checkpoint files are PyTorch state dictionaries containing model
+weights, optimizer state, the current epsilon, and the total number of
+training steps completed. They can be loaded with
+``retro-gamer play`` or directly with the Python API.
+
+Python API
+----------
+
+For advanced use, ``retro-gamer``'s components are importable as a
+library.
+
+.. code-block:: python
+
+   from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
+   from retro.examples.snake import create_game
+
+   # Read metadata from [tool.retro-gamer] in the game's pyproject.toml
+   metadata = GameMetadata.from_pyproject("retro.examples.snake")
+
+   trainer = DQNTrainer(
+       create_game, metadata, "runs/snake/",
+       training_episodes=500,
+       n_layers=2,
+       layer_size=128,
+   )
+   trainer.train()
+
+``GameEnvironment`` provides a gym-style interface for stepping through
+a game programmatically:
+
+.. code-block:: python
+
+   from retro_gamer import GameEnvironment
+
+   env = GameEnvironment(create_game, metadata)
+   obs = env.reset()             # returns initial observation vector
+   obs, reward, done = env.step("KEY_RIGHT")
+
+The observation is a flat NumPy array of dtype ``float32``. For spatial
+games, the first ``C × H × W`` elements are the board (channel-first
+one-hot encoding); for non-spatial games, the board is encoded
+``H × W × C`` and then flattened. Any ``observe_state`` values are
+appended at the end.