Initial commit

This commit is contained in:
Chris Proctor
2026-05-08 14:07:17 -04:00
commit 5ca97dc5d0
36 changed files with 4147 additions and 0 deletions

299
docs/walkthrough.rst Normal file
View File

@@ -0,0 +1,299 @@
Walkthrough
===========
This section walks through a complete ``retro-gamer`` workflow, from
preparing a game to watching a trained agent play. The game used here
is the Snake example included with the ``retro-games`` framework, but
the same steps apply to any game you build.
Prerequisites
-------------
You will need:
- Python 3.11 or higher.
- The ``retro-games`` framework installed and a game you have written
(or the built-in Snake example). See the
`retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
for help writing games.
- ``retro-gamer`` installed (see :ref:`installation`).
Preparing your game
-------------------
``retro-gamer`` loads your game by importing a Python module and
calling a function named ``create_game``. The ``create_game`` function
must take no arguments and return a new ``Game`` instance.
Here is the ``create_game`` function for Snake:
.. code-block:: python
def create_game():
head = SnakeHead()
apple = Apple()
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
apple.relocate(game)
return game
If your game module does not already have a ``create_game`` function,
add one following this pattern.
Describing your game
--------------------
Every training run begins with a description of your game. This
description belongs in the ``[tool.retro-gamer]`` section of your game
project's ``pyproject.toml``—the same file that defines the project's
name, version, and dependencies. Placing it there keeps the description
with the game itself, where it belongs.
Here is the ``[tool.retro-gamer]`` section for the Snake example:
.. code-block:: toml
[tool.retro-gamer]
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
reward = "score"
character_set = ["@", "*", ">", "<", "^", "v"]
spatial = true
observe_state = []
Let's go through each field.
``actions``
~~~~~~~~~~~
A list of the keystrokes the agent may send to the game. For Snake,
the four arrow keys control the direction of travel. The agent also
implicitly has access to a no-op (doing nothing).
.. note::
Only include actions that the game actually responds to. Listing
unreachable keys wastes part of the agent's action space and may slow
training.
``reward``
~~~~~~~~~~
The key in the game's state dictionary to use as the reward signal.
``retro-gamer`` computes the reward for each turn as the *change* in
this value from one turn to the next. For Snake, score increases by 1
(or more) each time the apple is eaten, so the agent receives a reward
of 1 when it eats an apple and 0 otherwise.
Choosing an appropriate reward is one of the most consequential
decisions in RL. Some considerations:
- A reward that is too sparse—where the agent goes many turns without
receiving any signal—makes learning slow. A snake that dies without
ever eating an apple receives no positive reward at all in the first
episodes, giving the learning algorithm almost nothing to work with.
- A reward that is too dense—assigned every turn—may not reflect the
true goal of the game.
- An artificial reward, such as giving a point for moving toward the
apple, can accelerate early training but may cause the agent to
optimize the proxy rather than the real objective.
``character_set``
~~~~~~~~~~~~~~~~~
The characters that can appear on the board, as a list of
single-character strings. Each cell of the board will be *one-hot
encoded* using this list: the agent represents the content of each cell
as a vector of zeros with a single 1 at the position corresponding to
the character. A cell containing a character not in this list is treated
as empty.
For Snake, the characters are: ``@`` (the apple), ``*`` (body
segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).
If you omit this field, ``retro-gamer`` will run a brief exploration
phase before training to discover which characters actually appear.
The number of exploration turns is controlled by the
``exploration_turns`` hyperparameter.
``spatial``
~~~~~~~~~~~
Whether to treat the board as a spatial scene (default: ``true``). A
spatial game uses a *convolutional neural network* (CNN) that can
detect patterns in the relative arrangement of characters. A
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
ignores positional relationships. Set to ``false`` for games where
position is irrelevant.
Once you have written this section, create the training run directory:
.. code-block:: console
% retro-gamer create \
--game retro.examples.snake \
--output runs/snake/
Created training run at runs/snake/config.toml
game : retro.examples.snake
board_size : 32×16
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
reward : score
characters : ['@', '*', '>', '<', '^', 'v']
architecture: CNN (spatial)
``retro-gamer create`` reads your game metadata directly from
``pyproject.toml`` and writes it—along with all hyperparameters—to
``runs/snake/config.toml``.
Training the agent
------------------
With the ``config.toml`` in place, start training:
.. code-block:: console
% retro-gamer train runs/snake/
Training for 1000 episodes…
Done. Checkpoints in runs/snake/checkpoints/
Training saves checkpoints every 100 episodes and a ``final.pt``
checkpoint when complete. You can follow progress in the training log:
.. code-block:: console
% tail -f runs/snake/training.log
The log shows one line per episode:
.. code-block:: text
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
- **total_reward**: the total score earned during the episode (how many
apples the snake ate, for Snake).
- **steps**: how many turns the episode lasted.
- **epsilon**: the current exploration rate. Early in training this is
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
- **avg_loss**: the average temporal-difference error across training
steps in this episode. A decreasing loss generally indicates that the
Q-value estimates are converging.
Resuming training
~~~~~~~~~~~~~~~~~
Training can be resumed from a checkpoint:
.. code-block:: console
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
Watching a trained agent play
------------------------------
To watch a trained agent play the game in your terminal:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint final
You can substitute any checkpoint name:
.. code-block:: console
% retro-gamer play runs/snake/ --checkpoint ep_0100
Press Enter or Escape to quit.
Comparing agents trained at different checkpoints is a useful activity:
the agent at episode 100 has learned *something*, but typically much
less than the agent at episode 500. Articulating *what* the earlier
agent has and has not learned, and *why*, is productive reasoning about
the training process.
Inspecting a run
----------------
To review the configuration and recent training progress for a run:
.. code-block:: console
% retro-gamer info runs/snake/
Game module : retro.examples.snake
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
Last 5 episodes:
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
Adjusting hyperparameters
--------------------------
The training hyperparameters can be changed by editing ``config.toml``
before training, or by passing them as options to ``retro-gamer
create``. Common adjustments and their effects:
**``training_episodes``** — How long to train. More episodes give the
agent more time to learn, but also take longer to run.
**``epsilon_decay``** — How quickly exploration decreases. A faster
decay (smaller ``epsilon_decay``) means the agent commits to its early
Q-estimates before they are fully reliable. A slower decay (larger
``epsilon_decay``, closer to 1) gives the agent more time to explore
but may waste training time on random actions.
**``learning_rate``** — How large the weight updates are at each
training step. A large learning rate learns fast but may overshoot;
a small learning rate is stable but slow.
**``gamma``** — The discount factor for future rewards. Closer to 1
means the agent values long-term consequences; closer to 0 makes the
agent focus on immediate reward.
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
head. Larger networks can represent more complex Q-functions but are
slower to train and may overfit.
**``prioritize_experiences``** — Whether to use prioritized experience
replay. This often improves sample efficiency but is slightly slower
per step.
Questions for investigation
----------------------------
The following questions are intended to guide productive investigation
using ``retro-gamer``. They are chosen because they have specific,
reasoned answers that connect what you know about the game to the
concepts underlying the training algorithm.
1. **Character set completeness.** Train two agents: one with the full
character set, one missing a character that frequently appears on the
board. Compare their performance. What did the second agent lose the
ability to perceive, and how did that affect its behavior?
2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
true`` and ``spatial = false``. How does training efficiency differ?
Can you explain the difference in terms of what each architecture
can and cannot learn?
3. **Reward shaping.** If the game currently rewards only the final
objective (e.g., reaching a goal), add intermediate rewards for
sub-goals. How does this change the early training curve? Does it
change the agent's final strategy?
4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
(so the agent commits to exploiting early) and a very slow one (so
exploration continues for a long time). How do the training curves
differ? What is the agent doing in each case when ``epsilon`` is low?
5. **Checkpoint comparison.** Load the agent at episode 100 and at
episode 1000 and watch each play the same game. What has the later
agent learned that the earlier one has not? How would you describe
this difference to someone who does not know about neural networks?