Walkthrough =========== This section walks through a complete ``retro-gamer`` workflow, from preparing a game to watching a trained agent play. The game used here is the Snake example included with the ``retro-games`` framework, but the same steps apply to any game you build. Prerequisites ------------- You will need: - Python 3.11 or higher. - The ``retro-games`` framework installed and a game you have written (or the built-in Snake example). See the `retro-games documentation `__ for help writing games. - ``retro-gamer`` installed (see :ref:`installation`). Preparing your game ------------------- ``retro-gamer`` loads your game by calling a function named ``create_game``. The function must take no arguments and return a new ``Game`` instance. Here is the ``create_game`` function for Snake: .. code-block:: python def create_game(): head = SnakeHead() apple = Apple() game = Game([head, apple], {'score': 100}, board_size=(32, 16), framerate=12) apple.relocate(game) return game If your game file does not already have a ``create_game`` function, add one following this pattern. When you run ``retro-gamer create``, you can point to your game file directly by path or by Python module name: .. code-block:: console % retro-gamer create --game my_game.py --output runs/my_game/ % retro-gamer create --game retro.examples.snake --output runs/snake/ Describing your game -------------------- Every training run begins with a description of your game. This description belongs in the ``[tool.retro-gamer]`` section of your game project's ``pyproject.toml``—the same file that defines the project's name, version, and dependencies. Placing it there keeps the description with the game itself, where it belongs. Here is the ``[tool.retro-gamer]`` section for the Snake example: .. code-block:: toml [tool.retro-gamer] actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"] reward = "score" character_set = ["@", "*", ">", "<", "^", "v"] Let's go through each field. ``actions`` ~~~~~~~~~~~ A list of the keystrokes the agent may send to the game. For Snake, the four arrow keys control the direction of travel. The agent also implicitly has access to a no-op (doing nothing). .. note:: Only include actions that the game actually responds to. Listing unreachable keys wastes part of the agent's action space and may slow training. ``reward`` ~~~~~~~~~~ The key in the game's state dictionary to use as the reward signal. ``retro-gamer`` computes the reward for each turn as the *change* in this value from one turn to the next. For Snake, the score changes when the snake eats an apple (+50), when it moves away from the apple (−1), and when it dies (−10). These incremental changes are what the agent tries to maximize. Choosing an appropriate reward is one of the most consequential decisions in RL. Some considerations: - A reward that is too sparse—where the agent goes many turns without receiving any signal—makes learning slow. A snake that dies without ever eating an apple receives no positive reward at all in the first episodes, giving the learning algorithm almost nothing to work with. - A reward that is too dense—assigned every turn—may not reflect the true goal of the game. - An artificial reward, such as giving a point for moving toward the apple, can accelerate early training but may cause the agent to optimize the proxy rather than the real objective. ``character_set`` ~~~~~~~~~~~~~~~~~ The characters that can appear on the board, as a list of single-character strings. Each cell of the board will be *one-hot encoded* using this list: the agent represents the content of each cell as a vector of zeros with a single 1 at the position corresponding to the character. A cell containing a character not in this list is treated as empty. For Snake, the characters are: ``@`` (the apple), ``*`` (body segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction). If you omit this field, ``retro-gamer`` will run a brief exploration phase before training to discover which characters actually appear. The number of exploration turns is controlled by the ``exploration_turns`` hyperparameter. ``spatial`` and other preprocessing options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``[tool.retro-gamer]`` section describes the game. Preprocessing options—such as ``spatial`` (whether to use a CNN or MLP, default: ``false``), ``egocentric``, and ``observe_state``—live in the ``[preprocessing]`` section of the generated ``config.toml``. You can edit them there after running ``retro-gamer create``. ``observe_state`` ~~~~~~~~~~~~~~~~~ By default the agent only sees the board. You can also give it access to computed values from ``game.state`` by listing the relevant keys in the ``observe_state`` option in ``[preprocessing]`` of ``config.toml``. For example, Snake exposes the normalized direction to the apple: .. code-block:: toml [preprocessing] observe_state = ["apple_dx", "apple_dy"] The trainer appends these values to the observation vector after the board encoding (or uses them as the entire observation when ``board = false``). These values must be set in ``game.state`` at the start of every episode—typically inside ``create_game()``—and must keep the same type and length from episode to episode. .. warning:: Always initialize every key listed in ``observe_state`` before the game starts. If a key is missing or its length changes between episodes, training stops immediately with a clear error explaining what changed. The neural network's input size is fixed when training begins; it cannot adapt to a changing observation shape mid-run. This is a good place to ask: *can a human player see this information?* The apple's location is visible on screen; the normalized distance vector is not. Whether that asymmetry is appropriate is a design choice worth examining. Once you have written this section, create the training run directory: .. code-block:: console % retro-gamer create \ --game retro.examples.snake \ --output runs/snake/ Created training run at runs/snake/config.toml game : retro.examples.snake board_size : 32×16 actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN'] reward : score characters : ['@', '*', '>', '<', '^', 'v'] architecture: MLP ``retro-gamer create`` reads your game metadata directly from ``pyproject.toml`` and writes it—along with all hyperparameters—to ``runs/snake/config.toml``. Training the agent ------------------ With the ``config.toml`` in place, start training: .. code-block:: console % retro-gamer train runs/snake/ 100%|████████████████████| 1000/1000 [12:34<00:00, 1.32ep/s, reward=9.0, eps=0.007, loss=0.0003] Done. Checkpoints saved in runs/snake/checkpoints/ A progress bar shows how far training has gone, along with the most recent episode's reward, the current exploration rate (``eps``), and the average prediction error (``loss``). Training saves a checkpoint every 100 episodes to ``runs/snake/checkpoints/``. You can stop training at any time with Ctrl-C and resume it later—the next ``retro-gamer train`` command will automatically pick up from the latest checkpoint. Reading the training log ~~~~~~~~~~~~~~~~~~~~~~~~ For a longer view of how training is progressing, inspect the training log: .. code-block:: console % cat runs/snake/training.log The log begins with the full network architecture, followed by one line per checkpoint (every 100 episodes): .. code-block:: text [ep_0100] ep=0001-0100 avg_reward=-31.4 avg_steps=47 epsilon=0.938 avg_loss=7.2 time=0m12s total=0m12s [ep_0200] ep=0101-0200 avg_reward=-18.6 avg_steps=89 epsilon=0.879 avg_loss=6.8 time=0m14s total=0m26s [ep_0300] ep=0201-0300 avg_reward= -4.1 avg_steps=134 epsilon=0.824 avg_loss=6.1 time=0m15s total=0m41s [ep_0500] ep=0401-0500 avg_reward= +8.7 avg_steps=210 epsilon=0.724 avg_loss=5.4 time=0m16s total=1m12s [ep_1000] ep=0901-1000 avg_reward=+22.3 avg_steps=389 epsilon=0.557 avg_loss=4.9 time=0m18s total=2m30s Here is what each field means: - **avg_reward**: Average total reward per episode over the past 100 episodes. Positive values mean the agent is accumulating reward; negative values mean it is accumulating penalties. An upward trend over time is the main signal that learning is working. - **avg_steps**: Average number of turns per episode. If episodes are ending quickly (small ``avg_steps``), the agent may be dying often. Longer episodes generally indicate the agent is surviving longer. - **epsilon**: The current exploration rate. Starts near 1.0 (mostly random) and decays toward ``epsilon_min``. When ``epsilon`` is still high, erratic behavior is expected. - **avg_loss**: Average Huber loss across training steps. Huber loss is quadratic for small prediction errors and linear for large ones, which keeps it stable even when rewards have a wide range (such as a large bonus for reaching a goal). Values in the range 0–10 are typical for most games. A slow downward trend is the healthy pattern. A loss that grows without bound indicates the learning rate is too high. - **time**: Wall-clock time for this 100-episode interval. - **total**: Cumulative training time across all sessions. When training is resumed after a stop, a header line marks the break:: === Resumed from ep_0500.pt | 2026-05-09 14:22:01 === This lets you track exactly when each session took place. Stopping training to watch the agent play ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You do not need to wait for training to finish before watching the agent. Training can be stopped at any time with Ctrl-C, and the latest checkpoint is always available immediately: .. code-block:: console % retro-gamer play runs/snake/ This loads the most recent checkpoint and runs the agent in your terminal. Press Enter or Escape to quit. .. note:: The game is rendered directly in your terminal. If the window is smaller than the board plus borders, ``retro-gamer play`` will raise a ``TerminalTooSmall`` error — enlarge the terminal window and try again. To watch an earlier stage of training, use ``--checkpoint``: .. code-block:: console % retro-gamer play runs/snake/ --checkpoint ep_0100 Comparing what the agent at episode 100 does versus the agent at episode 500 can reveal exactly what the agent has (and has not) learned. For Snake, you might notice the episode-100 agent moving somewhat randomly, while the episode-500 agent consistently navigates toward the apple. Articulating *why* the later agent behaves differently—what the training process produced—connects observation directly to the concepts underlying DQN. Resuming training after watching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After watching the agent play, resume training with exactly the same command you used before: .. code-block:: console % retro-gamer train runs/snake/ ``retro-gamer`` automatically detects and resumes from the latest checkpoint. No extra flags are needed. If all configured episodes have already been completed, it prints a message and exits: .. code-block:: console Training already complete (1000 episodes). To keep training, increase training_episodes in config.toml. To continue training, open ``runs/snake/config.toml``, increase the ``training_episodes`` value, and run ``retro-gamer train`` again. Watching a trained agent play ------------------------------ Once training is complete, watch the final agent: .. code-block:: console % retro-gamer play runs/snake/ By default the latest checkpoint is loaded. You can also compare the agent's performance at different stages of training: .. code-block:: console % retro-gamer play runs/snake/ --checkpoint ep_0100 % retro-gamer play runs/snake/ --checkpoint ep_0500 Press Enter or Escape to quit. Inspecting a run ---------------- To review the configuration and recent training progress for a run: .. code-block:: console % retro-gamer info runs/snake/ Game module : retro.examples.snake Metadata : {'actions': ['KEY_RIGHT', ...], 'reward': 'score', 'board_size': [32, 16], ...} Preprocessing : {'spatial': False, 'board': True, 'observe_state': ['apple_dx', 'apple_dy'], ...} Model : {'hidden_sizes': [128, 64]} Training : {'learning_rate': 0.0001, 'gamma': 0.99, ...} Last 5 checkpoints: [ep_0600] ep=0501-0600 avg_reward=+12.1 ... [ep_0700] ep=0601-0700 avg_reward=+14.8 ... [ep_0800] ep=0701-0800 avg_reward=+16.3 ... [ep_0900] ep=0801-0900 avg_reward=+19.0 ... [ep_1000] ep=0901-1000 avg_reward=+22.3 ... Checkpoints (10): ['ep_0100.pt', 'ep_0200.pt', ..., 'ep_1000.pt'] Adjusting hyperparameters -------------------------- The training hyperparameters can be changed by editing ``config.toml`` before training, or by passing them as options to ``retro-gamer create``. Common adjustments and their effects: **``training_episodes``** — How long to train. More episodes give the agent more time to learn, but also take longer to run. This is always safe to increase. **``epsilon_decay``** — How quickly exploration decreases. A faster decay (smaller ``epsilon_decay``) means the agent commits to its early Q-estimates before they are fully reliable. A slower decay (larger ``epsilon_decay``, closer to 1) gives the agent more time to explore but may waste training time on random actions. **``learning_rate``** — How large the weight updates are at each training step. A large learning rate learns fast but may overshoot; a small learning rate is stable but slow. **``gamma``** — The discount factor for future rewards. Closer to 1 means the agent values long-term consequences; closer to 0 makes the agent focus on immediate reward. **``hidden_sizes``** — The shape of the MLP head as a list of layer sizes, e.g. ``[128, 64]``. Larger or deeper networks can represent more complex Q-functions but are slower to train and may overfit. **``prioritize_experiences``** — Whether to use prioritized experience replay. This often improves sample efficiency but is slightly slower per step. .. _incompatible-changes: Why some changes require starting fresh ---------------------------------------- Not all changes to ``config.toml`` are equal. Some can be applied immediately to an existing training run; others make the existing checkpoints unusable. **Safe to change at any time** (``[training]`` section) — These affect *how* the agent learns, not *what* it is learning to do. Existing checkpoints remain valid: - ``training_episodes``, ``max_turns_per_episode`` - ``learning_rate``, ``learning_rate_decay``, ``gamma`` - ``epsilon``, ``epsilon_decay``, ``epsilon_min`` - ``batch_size``, ``memory_capacity``, ``prioritize_experiences`` - ``target_update_freq``, ``train_every`` **Requires starting fresh** — These changes alter the shape of the game or the shape of the network. The saved model weights are incompatible with the new configuration: - ``actions``, ``reward``, ``character_set``, ``board_size`` (``[metadata]``) — These define what the agent perceives and what it can do. Changing them changes the size of the network's input or output layers; the existing weights no longer fit. - ``spatial``, ``board``, ``observe_state``, ``observe_state_sizes``, ``egocentric``, ``egocentric_player``, ``egocentric_radius`` (``[preprocessing]``) — These control how the observation is constructed. Any change here alters the input shape or meaning and makes existing weights invalid. - ``hidden_sizes`` (``[model]``) — This defines the network's hidden layers. Changing it changes the shape of the network; the existing weights no longer fit. If you try to resume training after making one of these changes, ``retro-gamer train`` detects the mismatch and stops with a clear explanation, for example:: Cannot resume from ep_0500.pt: incompatible changes detected in config.toml. The following changes require starting fresh. The existing model was trained on a different problem and its weights cannot be reused: character_set was : ['@', '*', '>', '<', '^', 'v'] now : ['@', '*', '>', '<', '^', 'v', '#'] why : the set of board characters (changes input layer size) Run 'retro-gamer clean RUN_DIR' to remove existing checkpoints and the training log, then run 'retro-gamer train RUN_DIR' to start fresh. To clear out the old checkpoints and begin again: .. code-block:: console % retro-gamer clean runs/snake/ Will remove 5 checkpoint(s) and training log from runs/snake/: checkpoints/ep_0100.pt checkpoints/ep_0200.pt ... training.log Proceed? [y/N]: y Cleaned. Run 'retro-gamer train runs/snake/' to start fresh. The ``config.toml`` is always preserved so you do not need to run ``retro-gamer create`` again. Reasoning about training from the log -------------------------------------- The training log is one of the most useful tools for understanding what is happening during training. Here are some patterns to look for and what they mean. **Reward increasing steadily** is the normal, healthy pattern. Each checkpoint block should show a higher ``avg_reward`` than the last. The rate of increase typically slows as training progresses. **Reward flat or negative through early episodes** is normal. Early in training, ``epsilon`` is high and the agent is mostly acting randomly. It has not yet discovered effective strategies. Patience—and a look at the ``epsilon`` column—will confirm whether this is just the exploration phase. **Loss decreasing** is also healthy. As the Q-network's estimates improve, the difference between predicted and target Q-values (the TD error) should shrink. A loss that stabilizes near zero is usually a good sign. **Loss growing without bound** indicates the learning rate is too high. The trainer uses Huber loss, which is robust to large reward scales, but a learning rate above roughly ``0.001`` can still destabilise training. Try reducing it by a factor of 10 (e.g. from ``0.001`` to ``0.0001``) and restarting training. **Short episodes (low ``avg_steps``)** combined with low reward suggests the agent is dying frequently. Early in training this is normal. If it persists late in training, the agent may have settled on a bad policy—consider extending training or adjusting ``epsilon_decay`` to explore longer. **Reward that improves and then regresses** can indicate that the agent has discovered a suboptimal but consistent strategy and is stuck. Increasing ``epsilon_min`` to keep some exploration active, or adjusting the reward signal to better differentiate good moves from bad ones, can help. Questions for investigation ---------------------------- The following questions are intended to guide productive investigation using ``retro-gamer``. They are chosen because they have specific, reasoned answers that connect what you know about the game to the concepts underlying the training algorithm. 1. **Character set completeness.** Train two agents: one with the full character set, one missing a character that frequently appears on the board. Compare their performance. What did the second agent lose the ability to perceive, and how did that affect its behavior? 2. **Spatial vs. non-spatial.** Train the same game with ``spatial = true`` and ``spatial = false``. How does training efficiency differ? Can you explain the difference in terms of what each architecture can and cannot learn? 3. **Reward shaping.** If the game currently rewards only the final objective (e.g., reaching a goal), add intermediate rewards for sub-goals. How does this change the early training curve? Does it change the agent's final strategy? 4. **Exploration schedule.** Train with a very fast ``epsilon_decay`` (so the agent commits to exploiting early) and a very slow one (so exploration continues for a long time). How do the training curves differ? What is the agent doing in each case when ``epsilon`` is low? 5. **Checkpoint comparison.** Load the agent at episode 100 and at episode 1000 and watch each play the same game. What has the later agent learned that the earlier one has not? How would you describe this difference to someone who does not know about neural networks?