retro-gamer/docs/walkthrough.rst

Walkthrough
===========

This section walks through a complete ``retro-gamer`` workflow, from
preparing a game to watching a trained agent play. The game used here
is the Snake example included with the ``retro-games`` framework, but
the same steps apply to any game you build.

Prerequisites
-------------

You will need:

- Python 3.11 or higher.
- The ``retro-games`` framework installed and a game you have written
  (or the built-in Snake example). See the
  `retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
  for help writing games.
- ``retro-gamer`` installed (see :ref:`installation`).

Preparing your game
-------------------

``retro-gamer`` loads your game by importing a Python module and
calling a function named ``create_game``. The ``create_game`` function
must take no arguments and return a new ``Game`` instance.

Here is the ``create_game`` function for Snake:

.. code-block:: python

   def create_game():
       head = SnakeHead()
       apple = Apple()
       game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
       apple.relocate(game)
       return game

If your game module does not already have a ``create_game`` function,
add one following this pattern.


Describing your game
--------------------

Every training run begins with a description of your game. This
description belongs in the ``[tool.retro-gamer]`` section of your game
project's ``pyproject.toml``—the same file that defines the project's
name, version, and dependencies. Placing it there keeps the description
with the game itself, where it belongs.

Here is the ``[tool.retro-gamer]`` section for the Snake example:

.. code-block:: toml

   [tool.retro-gamer]
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]
   spatial = true
   observe_state = []

Let's go through each field.

``actions``
~~~~~~~~~~~

A list of the keystrokes the agent may send to the game. For Snake,
the four arrow keys control the direction of travel. The agent also
implicitly has access to a no-op (doing nothing).

.. note::

   Only include actions that the game actually responds to. Listing
   unreachable keys wastes part of the agent's action space and may slow
   training.

``reward``
~~~~~~~~~~

The key in the game's state dictionary to use as the reward signal.
``retro-gamer`` computes the reward for each turn as the *change* in
this value from one turn to the next. For Snake, score increases by 1
(or more) each time the apple is eaten, so the agent receives a reward
of 1 when it eats an apple and 0 otherwise.

Choosing an appropriate reward is one of the most consequential
decisions in RL. Some considerations:

- A reward that is too sparse—where the agent goes many turns without
  receiving any signal—makes learning slow. A snake that dies without
  ever eating an apple receives no positive reward at all in the first
  episodes, giving the learning algorithm almost nothing to work with.
- A reward that is too dense—assigned every turn—may not reflect the
  true goal of the game.
- An artificial reward, such as giving a point for moving toward the
  apple, can accelerate early training but may cause the agent to
  optimize the proxy rather than the real objective.

``character_set``
~~~~~~~~~~~~~~~~~

The characters that can appear on the board, as a list of
single-character strings. Each cell of the board will be *one-hot
encoded* using this list: the agent represents the content of each cell
as a vector of zeros with a single 1 at the position corresponding to
the character. A cell containing a character not in this list is treated
as empty.

For Snake, the characters are: ``@`` (the apple), ``*`` (body
segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).

If you omit this field, ``retro-gamer`` will run a brief exploration
phase before training to discover which characters actually appear.
The number of exploration turns is controlled by the
``exploration_turns`` hyperparameter.

``spatial``
~~~~~~~~~~~

Whether to treat the board as a spatial scene (default: ``true``). A
spatial game uses a *convolutional neural network* (CNN) that can
detect patterns in the relative arrangement of characters. A
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
ignores positional relationships. Set to ``false`` for games where
position is irrelevant.

Once you have written this section, create the training run directory:

.. code-block:: console

   % retro-gamer create                    \
       --game retro.examples.snake         \
       --output runs/snake/

   Created training run at runs/snake/config.toml
     game        : retro.examples.snake
     board_size  : 32×16
     actions     : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
     reward      : score
     characters  : ['@', '*', '>', '<', '^', 'v']
     architecture: CNN (spatial)

``retro-gamer create`` reads your game metadata directly from
``pyproject.toml`` and writes it—along with all hyperparameters—to
``runs/snake/config.toml``.

Training the agent
------------------

With the ``config.toml`` in place, start training:

.. code-block:: console

   % retro-gamer train runs/snake/
   Training for 1000 episodes…
   Done. Checkpoints in runs/snake/checkpoints/

Training saves checkpoints every 100 episodes and a ``final.pt``
checkpoint when complete. You can follow progress in the training log:

.. code-block:: console

   % tail -f runs/snake/training.log

The log shows one line per episode:

.. code-block:: text

   [EP 0001] total_reward=0.0  steps=2000  epsilon=0.9950  avg_loss=0.023540
   [EP 0050] total_reward=1.0  steps=1921  epsilon=0.7783  avg_loss=0.003217
   [EP 0100] total_reward=3.0  steps=1847  epsilon=0.6065  avg_loss=0.001204

- **total_reward**: the total score earned during the episode (how many
  apples the snake ate, for Snake).
- **steps**: how many turns the episode lasted.
- **epsilon**: the current exploration rate. Early in training this is
  close to 1 (mostly random actions); it decays toward ``epsilon_min``.
- **avg_loss**: the average temporal-difference error across training
  steps in this episode. A decreasing loss generally indicates that the
  Q-value estimates are converging.

Resuming training
~~~~~~~~~~~~~~~~~

Training can be resumed from a checkpoint:

.. code-block:: console

   % retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt

Watching a trained agent play
------------------------------

To watch a trained agent play the game in your terminal:

.. code-block:: console

   % retro-gamer play runs/snake/ --checkpoint final

You can substitute any checkpoint name:

.. code-block:: console

   % retro-gamer play runs/snake/ --checkpoint ep_0100

Press Enter or Escape to quit.

Comparing agents trained at different checkpoints is a useful activity:
the agent at episode 100 has learned *something*, but typically much
less than the agent at episode 500. Articulating *what* the earlier
agent has and has not learned, and *why*, is productive reasoning about
the training process.

Inspecting a run
----------------

To review the configuration and recent training progress for a run:

.. code-block:: console

   % retro-gamer info runs/snake/
   Game module : retro.examples.snake
   Metadata    : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
   Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}

   Last 5 episodes:
     [EP 0996] total_reward=9.0   steps=1203  epsilon=0.0074  avg_loss=0.000312
     [EP 0997] total_reward=11.0  steps=1051  epsilon=0.0074  avg_loss=0.000289
     [EP 0998] total_reward=14.0  steps=987   epsilon=0.0074  avg_loss=0.000274
     [EP 0999] total_reward=8.0   steps=1142  epsilon=0.0074  avg_loss=0.000261
     [EP 1000] total_reward=12.0  steps=1089  epsilon=0.0074  avg_loss=0.000248

   Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']

Adjusting hyperparameters
--------------------------

The training hyperparameters can be changed by editing ``config.toml``
before training, or by passing them as options to ``retro-gamer
create``. Common adjustments and their effects:

**``training_episodes``** — How long to train. More episodes give the
agent more time to learn, but also take longer to run.

**``epsilon_decay``** — How quickly exploration decreases. A faster
decay (smaller ``epsilon_decay``) means the agent commits to its early
Q-estimates before they are fully reliable. A slower decay (larger
``epsilon_decay``, closer to 1) gives the agent more time to explore
but may waste training time on random actions.

**``learning_rate``** — How large the weight updates are at each
training step. A large learning rate learns fast but may overshoot;
a small learning rate is stable but slow.

**``gamma``** — The discount factor for future rewards. Closer to 1
means the agent values long-term consequences; closer to 0 makes the
agent focus on immediate reward.

**``n_layers`` and ``layer_size``** — The depth and width of the MLP
head. Larger networks can represent more complex Q-functions but are
slower to train and may overfit.

**``prioritize_experiences``** — Whether to use prioritized experience
replay. This often improves sample efficiency but is slightly slower
per step.

Questions for investigation
----------------------------

The following questions are intended to guide productive investigation
using ``retro-gamer``. They are chosen because they have specific,
reasoned answers that connect what you know about the game to the
concepts underlying the training algorithm.

1. **Character set completeness.** Train two agents: one with the full
   character set, one missing a character that frequently appears on the
   board. Compare their performance. What did the second agent lose the
   ability to perceive, and how did that affect its behavior?

2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
   true`` and ``spatial = false``. How does training efficiency differ?
   Can you explain the difference in terms of what each architecture
   can and cannot learn?

3. **Reward shaping.** If the game currently rewards only the final
   objective (e.g., reaching a goal), add intermediate rewards for
   sub-goals. How does this change the early training curve? Does it
   change the agent's final strategy?

4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
   (so the agent commits to exploiting early) and a very slow one (so
   exploration continues for a long time). How do the training curves
   differ? What is the agent doing in each case when ``epsilon`` is low?

5. **Checkpoint comparison.** Load the agent at episode 100 and at
   episode 1000 and watch each play the same game. What has the later
   agent learned that the earlier one has not? How would you describe
   this difference to someone who does not know about neural networks?