Initial commit

2026-05-08 14:07:17 -04:00
commit 5ca97dc5d0
36 changed files with 4147 additions and 0 deletions
--- a/docs/walkthrough.rst
+++ b/docs/walkthrough.rst
@@ -0,0 +1,299 @@
+Walkthrough
+===========
+
+This section walks through a complete ``retro-gamer`` workflow, from
+preparing a game to watching a trained agent play. The game used here
+is the Snake example included with the ``retro-games`` framework, but
+the same steps apply to any game you build.
+
+Prerequisites
+-------------
+
+You will need:
+
+- Python 3.11 or higher.
+- The ``retro-games`` framework installed and a game you have written
+  (or the built-in Snake example). See the
+  `retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
+  for help writing games.
+- ``retro-gamer`` installed (see :ref:`installation`).
+
+Preparing your game
+-------------------
+
+``retro-gamer`` loads your game by importing a Python module and
+calling a function named ``create_game``. The ``create_game`` function
+must take no arguments and return a new ``Game`` instance.
+
+Here is the ``create_game`` function for Snake:
+
+.. code-block:: python
+
+   def create_game():
+       head = SnakeHead()
+       apple = Apple()
+       game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
+       apple.relocate(game)
+       return game
+
+If your game module does not already have a ``create_game`` function,
+add one following this pattern.
+
+
+Describing your game
+--------------------
+
+Every training run begins with a description of your game. This
+description belongs in the ``[tool.retro-gamer]`` section of your game
+project's ``pyproject.toml``—the same file that defines the project's
+name, version, and dependencies. Placing it there keeps the description
+with the game itself, where it belongs.
+
+Here is the ``[tool.retro-gamer]`` section for the Snake example:
+
+.. code-block:: toml
+
+   [tool.retro-gamer]
+   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
+   reward = "score"
+   character_set = ["@", "*", ">", "<", "^", "v"]
+   spatial = true
+   observe_state = []
+
+Let's go through each field.
+
+``actions``
+~~~~~~~~~~~
+
+A list of the keystrokes the agent may send to the game. For Snake,
+the four arrow keys control the direction of travel. The agent also
+implicitly has access to a no-op (doing nothing).
+
+.. note::
+
+   Only include actions that the game actually responds to. Listing
+   unreachable keys wastes part of the agent's action space and may slow
+   training.
+
+``reward``
+~~~~~~~~~~
+
+The key in the game's state dictionary to use as the reward signal.
+``retro-gamer`` computes the reward for each turn as the *change* in
+this value from one turn to the next. For Snake, score increases by 1
+(or more) each time the apple is eaten, so the agent receives a reward
+of 1 when it eats an apple and 0 otherwise.
+
+Choosing an appropriate reward is one of the most consequential
+decisions in RL. Some considerations:
+
+- A reward that is too sparse—where the agent goes many turns without
+  receiving any signal—makes learning slow. A snake that dies without
+  ever eating an apple receives no positive reward at all in the first
+  episodes, giving the learning algorithm almost nothing to work with.
+- A reward that is too dense—assigned every turn—may not reflect the
+  true goal of the game.
+- An artificial reward, such as giving a point for moving toward the
+  apple, can accelerate early training but may cause the agent to
+  optimize the proxy rather than the real objective.
+
+``character_set``
+~~~~~~~~~~~~~~~~~
+
+The characters that can appear on the board, as a list of
+single-character strings. Each cell of the board will be *one-hot
+encoded* using this list: the agent represents the content of each cell
+as a vector of zeros with a single 1 at the position corresponding to
+the character. A cell containing a character not in this list is treated
+as empty.
+
+For Snake, the characters are: ``@`` (the apple), ``*`` (body
+segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).
+
+If you omit this field, ``retro-gamer`` will run a brief exploration
+phase before training to discover which characters actually appear.
+The number of exploration turns is controlled by the
+``exploration_turns`` hyperparameter.
+
+``spatial``
+~~~~~~~~~~~
+
+Whether to treat the board as a spatial scene (default: ``true``). A
+spatial game uses a *convolutional neural network* (CNN) that can
+detect patterns in the relative arrangement of characters. A
+non-spatial game uses a simpler *multilayer perceptron* (MLP) that
+ignores positional relationships. Set to ``false`` for games where
+position is irrelevant.
+
+Once you have written this section, create the training run directory:
+
+.. code-block:: console
+
+   % retro-gamer create                    \
+       --game retro.examples.snake         \
+       --output runs/snake/
+
+   Created training run at runs/snake/config.toml
+     game        : retro.examples.snake
+     board_size  : 32×16
+     actions     : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
+     reward      : score
+     characters  : ['@', '*', '>', '<', '^', 'v']
+     architecture: CNN (spatial)
+
+``retro-gamer create`` reads your game metadata directly from
+``pyproject.toml`` and writes it—along with all hyperparameters—to
+``runs/snake/config.toml``.
+
+Training the agent
+------------------
+
+With the ``config.toml`` in place, start training:
+
+.. code-block:: console
+
+   % retro-gamer train runs/snake/
+   Training for 1000 episodes…
+   Done. Checkpoints in runs/snake/checkpoints/
+
+Training saves checkpoints every 100 episodes and a ``final.pt``
+checkpoint when complete. You can follow progress in the training log:
+
+.. code-block:: console
+
+   % tail -f runs/snake/training.log
+
+The log shows one line per episode:
+
+.. code-block:: text
+
+   [EP 0001] total_reward=0.0  steps=2000  epsilon=0.9950  avg_loss=0.023540
+   [EP 0050] total_reward=1.0  steps=1921  epsilon=0.7783  avg_loss=0.003217
+   [EP 0100] total_reward=3.0  steps=1847  epsilon=0.6065  avg_loss=0.001204
+
+- **total_reward**: the total score earned during the episode (how many
+  apples the snake ate, for Snake).
+- **steps**: how many turns the episode lasted.
+- **epsilon**: the current exploration rate. Early in training this is
+  close to 1 (mostly random actions); it decays toward ``epsilon_min``.
+- **avg_loss**: the average temporal-difference error across training
+  steps in this episode. A decreasing loss generally indicates that the
+  Q-value estimates are converging.
+
+Resuming training
+~~~~~~~~~~~~~~~~~
+
+Training can be resumed from a checkpoint:
+
+.. code-block:: console
+
+   % retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
+
+Watching a trained agent play
+------------------------------
+
+To watch a trained agent play the game in your terminal:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/ --checkpoint final
+
+You can substitute any checkpoint name:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/ --checkpoint ep_0100
+
+Press Enter or Escape to quit.
+
+Comparing agents trained at different checkpoints is a useful activity:
+the agent at episode 100 has learned *something*, but typically much
+less than the agent at episode 500. Articulating *what* the earlier
+agent has and has not learned, and *why*, is productive reasoning about
+the training process.
+
+Inspecting a run
+----------------
+
+To review the configuration and recent training progress for a run:
+
+.. code-block:: console
+
+   % retro-gamer info runs/snake/
+   Game module : retro.examples.snake
+   Metadata    : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
+   Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
+
+   Last 5 episodes:
+     [EP 0996] total_reward=9.0   steps=1203  epsilon=0.0074  avg_loss=0.000312
+     [EP 0997] total_reward=11.0  steps=1051  epsilon=0.0074  avg_loss=0.000289
+     [EP 0998] total_reward=14.0  steps=987   epsilon=0.0074  avg_loss=0.000274
+     [EP 0999] total_reward=8.0   steps=1142  epsilon=0.0074  avg_loss=0.000261
+     [EP 1000] total_reward=12.0  steps=1089  epsilon=0.0074  avg_loss=0.000248
+
+   Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
+
+Adjusting hyperparameters
+--------------------------
+
+The training hyperparameters can be changed by editing ``config.toml``
+before training, or by passing them as options to ``retro-gamer
+create``. Common adjustments and their effects:
+
+**``training_episodes``** — How long to train. More episodes give the
+agent more time to learn, but also take longer to run.
+
+**``epsilon_decay``** — How quickly exploration decreases. A faster
+decay (smaller ``epsilon_decay``) means the agent commits to its early
+Q-estimates before they are fully reliable. A slower decay (larger
+``epsilon_decay``, closer to 1) gives the agent more time to explore
+but may waste training time on random actions.
+
+**``learning_rate``** — How large the weight updates are at each
+training step. A large learning rate learns fast but may overshoot;
+a small learning rate is stable but slow.
+
+**``gamma``** — The discount factor for future rewards. Closer to 1
+means the agent values long-term consequences; closer to 0 makes the
+agent focus on immediate reward.
+
+**``n_layers`` and ``layer_size``** — The depth and width of the MLP
+head. Larger networks can represent more complex Q-functions but are
+slower to train and may overfit.
+
+**``prioritize_experiences``** — Whether to use prioritized experience
+replay. This often improves sample efficiency but is slightly slower
+per step.
+
+Questions for investigation
+----------------------------
+
+The following questions are intended to guide productive investigation
+using ``retro-gamer``. They are chosen because they have specific,
+reasoned answers that connect what you know about the game to the
+concepts underlying the training algorithm.
+
+1. **Character set completeness.** Train two agents: one with the full
+   character set, one missing a character that frequently appears on the
+   board. Compare their performance. What did the second agent lose the
+   ability to perceive, and how did that affect its behavior?
+
+2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
+   true`` and ``spatial = false``. How does training efficiency differ?
+   Can you explain the difference in terms of what each architecture
+   can and cannot learn?
+
+3. **Reward shaping.** If the game currently rewards only the final
+   objective (e.g., reaching a goal), add intermediate rewards for
+   sub-goals. How does this change the early training curve? Does it
+   change the agent's final strategy?
+
+4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
+   (so the agent commits to exploiting early) and a very slow one (so
+   exploration continues for a long time). How do the training curves
+   differ? What is the agent doing in each case when ``epsilon`` is low?
+
+5. **Checkpoint comparison.** Load the agent at episode 100 and at
+   episode 1000 and watch each play the same game. What has the later
+   agent learned that the earlier one has not? How would you describe
+   this difference to someone who does not know about neural networks?