Walkthrough =========== This section walks through a complete ``retro-gamer`` workflow, from preparing a game to watching a trained agent play. The game used here is the Snake example included with the ``retro-games`` framework, but the same steps apply to any game you build. Prerequisites ------------- You will need: - Python 3.11 or higher. - The ``retro-games`` framework installed and a game you have written (or the built-in Snake example). See the `retro-games documentation `__ for help writing games. - ``retro-gamer`` installed (see :ref:`installation`). Preparing your game ------------------- ``retro-gamer`` loads your game by importing a Python module and calling a function named ``create_game``. The ``create_game`` function must take no arguments and return a new ``Game`` instance. Here is the ``create_game`` function for Snake: .. code-block:: python def create_game(): head = SnakeHead() apple = Apple() game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12) apple.relocate(game) return game If your game module does not already have a ``create_game`` function, add one following this pattern. Describing your game -------------------- Every training run begins with a description of your game. This description belongs in the ``[tool.retro-gamer]`` section of your game project's ``pyproject.toml``—the same file that defines the project's name, version, and dependencies. Placing it there keeps the description with the game itself, where it belongs. Here is the ``[tool.retro-gamer]`` section for the Snake example: .. code-block:: toml [tool.retro-gamer] actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"] reward = "score" character_set = ["@", "*", ">", "<", "^", "v"] spatial = true observe_state = [] Let's go through each field. ``actions`` ~~~~~~~~~~~ A list of the keystrokes the agent may send to the game. For Snake, the four arrow keys control the direction of travel. The agent also implicitly has access to a no-op (doing nothing). .. note:: Only include actions that the game actually responds to. Listing unreachable keys wastes part of the agent's action space and may slow training. ``reward`` ~~~~~~~~~~ The key in the game's state dictionary to use as the reward signal. ``retro-gamer`` computes the reward for each turn as the *change* in this value from one turn to the next. For Snake, score increases by 1 (or more) each time the apple is eaten, so the agent receives a reward of 1 when it eats an apple and 0 otherwise. Choosing an appropriate reward is one of the most consequential decisions in RL. Some considerations: - A reward that is too sparse—where the agent goes many turns without receiving any signal—makes learning slow. A snake that dies without ever eating an apple receives no positive reward at all in the first episodes, giving the learning algorithm almost nothing to work with. - A reward that is too dense—assigned every turn—may not reflect the true goal of the game. - An artificial reward, such as giving a point for moving toward the apple, can accelerate early training but may cause the agent to optimize the proxy rather than the real objective. ``character_set`` ~~~~~~~~~~~~~~~~~ The characters that can appear on the board, as a list of single-character strings. Each cell of the board will be *one-hot encoded* using this list: the agent represents the content of each cell as a vector of zeros with a single 1 at the position corresponding to the character. A cell containing a character not in this list is treated as empty. For Snake, the characters are: ``@`` (the apple), ``*`` (body segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction). If you omit this field, ``retro-gamer`` will run a brief exploration phase before training to discover which characters actually appear. The number of exploration turns is controlled by the ``exploration_turns`` hyperparameter. ``spatial`` ~~~~~~~~~~~ Whether to treat the board as a spatial scene (default: ``true``). A spatial game uses a *convolutional neural network* (CNN) that can detect patterns in the relative arrangement of characters. A non-spatial game uses a simpler *multilayer perceptron* (MLP) that ignores positional relationships. Set to ``false`` for games where position is irrelevant. Once you have written this section, create the training run directory: .. code-block:: console % retro-gamer create \ --game retro.examples.snake \ --output runs/snake/ Created training run at runs/snake/config.toml game : retro.examples.snake board_size : 32×16 actions : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN'] reward : score characters : ['@', '*', '>', '<', '^', 'v'] architecture: CNN (spatial) ``retro-gamer create`` reads your game metadata directly from ``pyproject.toml`` and writes it—along with all hyperparameters—to ``runs/snake/config.toml``. Training the agent ------------------ With the ``config.toml`` in place, start training: .. code-block:: console % retro-gamer train runs/snake/ Training for 1000 episodes… Done. Checkpoints in runs/snake/checkpoints/ Training saves checkpoints every 100 episodes and a ``final.pt`` checkpoint when complete. You can follow progress in the training log: .. code-block:: console % tail -f runs/snake/training.log The log shows one line per episode: .. code-block:: text [EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540 [EP 0050] total_reward=1.0 steps=1921 epsilon=0.7783 avg_loss=0.003217 [EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204 - **total_reward**: the total score earned during the episode (how many apples the snake ate, for Snake). - **steps**: how many turns the episode lasted. - **epsilon**: the current exploration rate. Early in training this is close to 1 (mostly random actions); it decays toward ``epsilon_min``. - **avg_loss**: the average temporal-difference error across training steps in this episode. A decreasing loss generally indicates that the Q-value estimates are converging. Resuming training ~~~~~~~~~~~~~~~~~ Training can be resumed from a checkpoint: .. code-block:: console % retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt Watching a trained agent play ------------------------------ To watch a trained agent play the game in your terminal: .. code-block:: console % retro-gamer play runs/snake/ --checkpoint final You can substitute any checkpoint name: .. code-block:: console % retro-gamer play runs/snake/ --checkpoint ep_0100 Press Enter or Escape to quit. Comparing agents trained at different checkpoints is a useful activity: the agent at episode 100 has learned *something*, but typically much less than the agent at episode 500. Articulating *what* the earlier agent has and has not learned, and *why*, is productive reasoning about the training process. Inspecting a run ---------------- To review the configuration and recent training progress for a run: .. code-block:: console % retro-gamer info runs/snake/ Game module : retro.examples.snake Metadata : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...} Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...} Last 5 episodes: [EP 0996] total_reward=9.0 steps=1203 epsilon=0.0074 avg_loss=0.000312 [EP 0997] total_reward=11.0 steps=1051 epsilon=0.0074 avg_loss=0.000289 [EP 0998] total_reward=14.0 steps=987 epsilon=0.0074 avg_loss=0.000274 [EP 0999] total_reward=8.0 steps=1142 epsilon=0.0074 avg_loss=0.000261 [EP 1000] total_reward=12.0 steps=1089 epsilon=0.0074 avg_loss=0.000248 Checkpoints (11): ['ep_0100.pt', ..., 'final.pt'] Adjusting hyperparameters -------------------------- The training hyperparameters can be changed by editing ``config.toml`` before training, or by passing them as options to ``retro-gamer create``. Common adjustments and their effects: **``training_episodes``** — How long to train. More episodes give the agent more time to learn, but also take longer to run. **``epsilon_decay``** — How quickly exploration decreases. A faster decay (smaller ``epsilon_decay``) means the agent commits to its early Q-estimates before they are fully reliable. A slower decay (larger ``epsilon_decay``, closer to 1) gives the agent more time to explore but may waste training time on random actions. **``learning_rate``** — How large the weight updates are at each training step. A large learning rate learns fast but may overshoot; a small learning rate is stable but slow. **``gamma``** — The discount factor for future rewards. Closer to 1 means the agent values long-term consequences; closer to 0 makes the agent focus on immediate reward. **``n_layers`` and ``layer_size``** — The depth and width of the MLP head. Larger networks can represent more complex Q-functions but are slower to train and may overfit. **``prioritize_experiences``** — Whether to use prioritized experience replay. This often improves sample efficiency but is slightly slower per step. Questions for investigation ---------------------------- The following questions are intended to guide productive investigation using ``retro-gamer``. They are chosen because they have specific, reasoned answers that connect what you know about the game to the concepts underlying the training algorithm. 1. **Character set completeness.** Train two agents: one with the full character set, one missing a character that frequently appears on the board. Compare their performance. What did the second agent lose the ability to perceive, and how did that affect its behavior? 2. **Spatial vs. non-spatial.** Train the same game with ``spatial = true`` and ``spatial = false``. How does training efficiency differ? Can you explain the difference in terms of what each architecture can and cannot learn? 3. **Reward shaping.** If the game currently rewards only the final objective (e.g., reaching a goal), add intermediate rewards for sub-goals. How does this change the early training curve? Does it change the agent's final strategy? 4. **Exploration schedule.** Train with a very fast ``epsilon_decay`` (so the agent commits to exploiting early) and a very slow one (so exploration continues for a long time). How do the training curves differ? What is the agent doing in each case when ``epsilon`` is low? 5. **Checkpoint comparison.** Load the agent at episode 100 and at episode 1000 and watch each play the same game. What has the later agent learned that the earlier one has not? How would you describe this difference to someone who does not know about neural networks?