300 lines
11 KiB
ReStructuredText
300 lines
11 KiB
ReStructuredText
Walkthrough
|
||
===========
|
||
|
||
This section walks through a complete ``retro-gamer`` workflow, from
|
||
preparing a game to watching a trained agent play. The game used here
|
||
is the Snake example included with the ``retro-games`` framework, but
|
||
the same steps apply to any game you build.
|
||
|
||
Prerequisites
|
||
-------------
|
||
|
||
You will need:
|
||
|
||
- Python 3.11 or higher.
|
||
- The ``retro-games`` framework installed and a game you have written
|
||
(or the built-in Snake example). See the
|
||
`retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
|
||
for help writing games.
|
||
- ``retro-gamer`` installed (see :ref:`installation`).
|
||
|
||
Preparing your game
|
||
-------------------
|
||
|
||
``retro-gamer`` loads your game by importing a Python module and
|
||
calling a function named ``create_game``. The ``create_game`` function
|
||
must take no arguments and return a new ``Game`` instance.
|
||
|
||
Here is the ``create_game`` function for Snake:
|
||
|
||
.. code-block:: python
|
||
|
||
def create_game():
|
||
head = SnakeHead()
|
||
apple = Apple()
|
||
game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
|
||
apple.relocate(game)
|
||
return game
|
||
|
||
If your game module does not already have a ``create_game`` function,
|
||
add one following this pattern.
|
||
|
||
|
||
Describing your game
|
||
--------------------
|
||
|
||
Every training run begins with a description of your game. This
|
||
description belongs in the ``[tool.retro-gamer]`` section of your game
|
||
project's ``pyproject.toml``—the same file that defines the project's
|
||
name, version, and dependencies. Placing it there keeps the description
|
||
with the game itself, where it belongs.
|
||
|
||
Here is the ``[tool.retro-gamer]`` section for the Snake example:
|
||
|
||
.. code-block:: toml
|
||
|
||
[tool.retro-gamer]
|
||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||
reward = "score"
|
||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||
spatial = true
|
||
observe_state = []
|
||
|
||
Let's go through each field.
|
||
|
||
``actions``
|
||
~~~~~~~~~~~
|
||
|
||
A list of the keystrokes the agent may send to the game. For Snake,
|
||
the four arrow keys control the direction of travel. The agent also
|
||
implicitly has access to a no-op (doing nothing).
|
||
|
||
.. note::
|
||
|
||
Only include actions that the game actually responds to. Listing
|
||
unreachable keys wastes part of the agent's action space and may slow
|
||
training.
|
||
|
||
``reward``
|
||
~~~~~~~~~~
|
||
|
||
The key in the game's state dictionary to use as the reward signal.
|
||
``retro-gamer`` computes the reward for each turn as the *change* in
|
||
this value from one turn to the next. For Snake, score increases by 1
|
||
(or more) each time the apple is eaten, so the agent receives a reward
|
||
of 1 when it eats an apple and 0 otherwise.
|
||
|
||
Choosing an appropriate reward is one of the most consequential
|
||
decisions in RL. Some considerations:
|
||
|
||
- A reward that is too sparse—where the agent goes many turns without
|
||
receiving any signal—makes learning slow. A snake that dies without
|
||
ever eating an apple receives no positive reward at all in the first
|
||
episodes, giving the learning algorithm almost nothing to work with.
|
||
- A reward that is too dense—assigned every turn—may not reflect the
|
||
true goal of the game.
|
||
- An artificial reward, such as giving a point for moving toward the
|
||
apple, can accelerate early training but may cause the agent to
|
||
optimize the proxy rather than the real objective.
|
||
|
||
``character_set``
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
The characters that can appear on the board, as a list of
|
||
single-character strings. Each cell of the board will be *one-hot
|
||
encoded* using this list: the agent represents the content of each cell
|
||
as a vector of zeros with a single 1 at the position corresponding to
|
||
the character. A cell containing a character not in this list is treated
|
||
as empty.
|
||
|
||
For Snake, the characters are: ``@`` (the apple), ``*`` (body
|
||
segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).
|
||
|
||
If you omit this field, ``retro-gamer`` will run a brief exploration
|
||
phase before training to discover which characters actually appear.
|
||
The number of exploration turns is controlled by the
|
||
``exploration_turns`` hyperparameter.
|
||
|
||
``spatial``
|
||
~~~~~~~~~~~
|
||
|
||
Whether to treat the board as a spatial scene (default: ``true``). A
|
||
spatial game uses a *convolutional neural network* (CNN) that can
|
||
detect patterns in the relative arrangement of characters. A
|
||
non-spatial game uses a simpler *multilayer perceptron* (MLP) that
|
||
ignores positional relationships. Set to ``false`` for games where
|
||
position is irrelevant.
|
||
|
||
Once you have written this section, create the training run directory:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer create \
|
||
--game retro.examples.snake \
|
||
--output runs/snake/
|
||
|
||
Created training run at runs/snake/config.toml
|
||
game : retro.examples.snake
|
||
board_size : 32×16
|
||
actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
|
||
reward : score
|
||
characters : ['@', '*', '>', '<', '^', 'v']
|
||
architecture: CNN (spatial)
|
||
|
||
``retro-gamer create`` reads your game metadata directly from
|
||
``pyproject.toml`` and writes it—along with all hyperparameters—to
|
||
``runs/snake/config.toml``.
|
||
|
||
Training the agent
|
||
------------------
|
||
|
||
With the ``config.toml`` in place, start training:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer train runs/snake/
|
||
Training for 1000 episodes…
|
||
Done. Checkpoints in runs/snake/checkpoints/
|
||
|
||
Training saves checkpoints every 100 episodes and a ``final.pt``
|
||
checkpoint when complete. You can follow progress in the training log:
|
||
|
||
.. code-block:: console
|
||
|
||
% tail -f runs/snake/training.log
|
||
|
||
The log shows one line per episode:
|
||
|
||
.. code-block:: text
|
||
|
||
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
|
||
[EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217
|
||
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
|
||
|
||
- **total_reward**: the total score earned during the episode (how many
|
||
apples the snake ate, for Snake).
|
||
- **steps**: how many turns the episode lasted.
|
||
- **epsilon**: the current exploration rate. Early in training this is
|
||
close to 1 (mostly random actions); it decays toward ``epsilon_min``.
|
||
- **avg_loss**: the average temporal-difference error across training
|
||
steps in this episode. A decreasing loss generally indicates that the
|
||
Q-value estimates are converging.
|
||
|
||
Resuming training
|
||
~~~~~~~~~~~~~~~~~
|
||
|
||
Training can be resumed from a checkpoint:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
|
||
|
||
Watching a trained agent play
|
||
------------------------------
|
||
|
||
To watch a trained agent play the game in your terminal:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer play runs/snake/ --checkpoint final
|
||
|
||
You can substitute any checkpoint name:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer play runs/snake/ --checkpoint ep_0100
|
||
|
||
Press Enter or Escape to quit.
|
||
|
||
Comparing agents trained at different checkpoints is a useful activity:
|
||
the agent at episode 100 has learned *something*, but typically much
|
||
less than the agent at episode 500. Articulating *what* the earlier
|
||
agent has and has not learned, and *why*, is productive reasoning about
|
||
the training process.
|
||
|
||
Inspecting a run
|
||
----------------
|
||
|
||
To review the configuration and recent training progress for a run:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer info runs/snake/
|
||
Game module : retro.examples.snake
|
||
Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
|
||
Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
|
||
|
||
Last 5 episodes:
|
||
[EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312
|
||
[EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289
|
||
[EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274
|
||
[EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261
|
||
[EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248
|
||
|
||
Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
|
||
|
||
Adjusting hyperparameters
|
||
--------------------------
|
||
|
||
The training hyperparameters can be changed by editing ``config.toml``
|
||
before training, or by passing them as options to ``retro-gamer
|
||
create``. Common adjustments and their effects:
|
||
|
||
**``training_episodes``** — How long to train. More episodes give the
|
||
agent more time to learn, but also take longer to run.
|
||
|
||
**``epsilon_decay``** — How quickly exploration decreases. A faster
|
||
decay (smaller ``epsilon_decay``) means the agent commits to its early
|
||
Q-estimates before they are fully reliable. A slower decay (larger
|
||
``epsilon_decay``, closer to 1) gives the agent more time to explore
|
||
but may waste training time on random actions.
|
||
|
||
**``learning_rate``** — How large the weight updates are at each
|
||
training step. A large learning rate learns fast but may overshoot;
|
||
a small learning rate is stable but slow.
|
||
|
||
**``gamma``** — The discount factor for future rewards. Closer to 1
|
||
means the agent values long-term consequences; closer to 0 makes the
|
||
agent focus on immediate reward.
|
||
|
||
**``n_layers`` and ``layer_size``** — The depth and width of the MLP
|
||
head. Larger networks can represent more complex Q-functions but are
|
||
slower to train and may overfit.
|
||
|
||
**``prioritize_experiences``** — Whether to use prioritized experience
|
||
replay. This often improves sample efficiency but is slightly slower
|
||
per step.
|
||
|
||
Questions for investigation
|
||
----------------------------
|
||
|
||
The following questions are intended to guide productive investigation
|
||
using ``retro-gamer``. They are chosen because they have specific,
|
||
reasoned answers that connect what you know about the game to the
|
||
concepts underlying the training algorithm.
|
||
|
||
1. **Character set completeness.** Train two agents: one with the full
|
||
character set, one missing a character that frequently appears on the
|
||
board. Compare their performance. What did the second agent lose the
|
||
ability to perceive, and how did that affect its behavior?
|
||
|
||
2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
|
||
true`` and ``spatial = false``. How does training efficiency differ?
|
||
Can you explain the difference in terms of what each architecture
|
||
can and cannot learn?
|
||
|
||
3. **Reward shaping.** If the game currently rewards only the final
|
||
objective (e.g., reaching a goal), add intermediate rewards for
|
||
sub-goals. How does this change the early training curve? Does it
|
||
change the agent's final strategy?
|
||
|
||
4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
|
||
(so the agent commits to exploiting early) and a very slow one (so
|
||
exploration continues for a long time). How do the training curves
|
||
differ? What is the agent doing in each case when ``epsilon`` is low?
|
||
|
||
5. **Checkpoint comparison.** Load the agent at episode 100 and at
|
||
episode 1000 and watch each play the same game. What has the later
|
||
agent learned that the earlier one has not? How would you describe
|
||
this difference to someone who does not know about neural networks?
|