161 lines
7.7 KiB
ReStructuredText
161 lines
7.7 KiB
ReStructuredText
Introduction
|
||
============
|
||
|
||
``retro-gamer`` grew out of a question about how students learn
|
||
difficult ideas in computer science. Reinforcement learning—the branch
|
||
of machine learning in which an agent learns to act well by interacting
|
||
with an environment and receiving rewards—is one of the most powerful
|
||
and widely-deployed ideas in modern computing. It underlies systems that
|
||
play chess and Go at superhuman levels, control industrial robots,
|
||
optimize power grids, and personalize recommendation feeds. It is also
|
||
genuinely hard to understand, not because the core ideas are especially
|
||
abstract, but because the feedback between a student's understanding and
|
||
the system's behavior is usually invisible. You adjust a hyperparameter,
|
||
run a training loop, and get a number. What happened inside, and why,
|
||
remains opaque.
|
||
|
||
The design hypothesis of ``retro-gamer`` is that this opacity is not
|
||
inevitable. If a student already knows a game well—how it works, what
|
||
the pieces mean, what counts as doing well—then training an agent on
|
||
that game gives them a concrete anchor for reasoning about what the
|
||
learning algorithm is doing and why. When the trainer decides to use a
|
||
convolutional neural network instead of a simpler model, it explains its
|
||
reasoning. When training stalls, the student can ask: did I describe the
|
||
game accurately? Is the reward signal sending the right signal? Would a
|
||
different exploration strategy help? These are exactly the questions that
|
||
build genuine conceptual understanding.
|
||
|
||
``retro-gamer`` is developed as part of the
|
||
`Making With Code <https://makingwithcode.org>`__ curriculum, a
|
||
project-based high school computer science curriculum emphasizing
|
||
personally meaningful creation and deep conceptual engagement. In the
|
||
games unit, students design and implement their own games using the
|
||
``retro-games`` framework. The extension into reinforcement learning is
|
||
a natural next step: you built the game; now let's see if a machine can
|
||
learn to play it.
|
||
|
||
How retro-gamer works
|
||
---------------------
|
||
|
||
Rather than asking you to write a training algorithm yourself,
|
||
``retro-gamer`` asks you to describe the game you want to train on.
|
||
This description—written in your game project's ``pyproject.toml``—tells
|
||
the trainer things the game's code alone doesn't make obvious: which
|
||
characters matter, which piece of game state represents success, whether
|
||
the board should be understood spatially or as a flat data display.
|
||
|
||
From this description, the trainer constructs a deep Q-learning model
|
||
suited to the game. It writes out a plain-language explanation of every
|
||
architectural decision it makes, then begins training. As training
|
||
proceeds, it logs each episode's reward, loss, and exploration rate.
|
||
Trained model snapshots—checkpoints—are saved periodically, so you can
|
||
watch how the agent's skill develops over time. When you're done
|
||
training, you can load any checkpoint and watch the agent play.
|
||
|
||
A typical workflow looks like this. First, describe your game in the
|
||
``[tool.retro-gamer]`` section of your game project's ``pyproject.toml``:
|
||
|
||
.. code-block:: toml
|
||
|
||
[tool.retro-gamer]
|
||
actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
|
||
reward = "score"
|
||
character_set = ["@", "*", ">", "<", "^", "v"]
|
||
|
||
Then create a training run, train, and watch the result:
|
||
|
||
.. code-block:: console
|
||
|
||
% retro-gamer create --game my_game --output runs/snake/
|
||
|
||
% retro-gamer train runs/snake/
|
||
|
||
% retro-gamer play runs/snake/ --checkpoint ep_0500
|
||
|
||
The ``create`` command sets up the training run directory; ``train``
|
||
runs the learning algorithm; ``play`` loads a checkpoint and lets you
|
||
watch the trained agent live in the terminal.
|
||
|
||
What you will learn
|
||
-------------------
|
||
|
||
Working with ``retro-gamer`` is designed to build understanding of a
|
||
cluster of related ideas:
|
||
|
||
**Reinforcement learning** is the framework in which an agent
|
||
interacts with an environment, receiving observations and rewards, and
|
||
learns to choose actions that maximize its long-term reward. The
|
||
``retro-gamer`` training loop is a concrete instance of this framework:
|
||
the agent is the neural network, the environment is the game, the
|
||
observation is the encoded board and game state, and the reward is
|
||
the change in score from one turn to the next.
|
||
|
||
**Neural network architecture** shapes what a model can and cannot
|
||
learn. When you declare a game ``spatial``, the trainer builds a
|
||
convolutional neural network that can detect patterns in the relative
|
||
positions of game pieces. When you declare it non-spatial, it builds a
|
||
simpler network that ignores position. Seeing the consequence of this
|
||
choice in training behavior is a direct experience of why architecture
|
||
matters.
|
||
|
||
**Observation design** determines what information is available to the
|
||
agent. If you leave a character out of the ``character_set``, the agent
|
||
will not distinguish it from empty space. If the game module defines a
|
||
``get_state()`` function, the agent also receives those computed values
|
||
as part of its observation. The consequences of these choices for what
|
||
the agent can learn are reasonably predictable — and making and checking
|
||
those predictions is exactly the kind of reasoning the tool is designed
|
||
to support.
|
||
|
||
**Reward engineering** is the craft of specifying what counts as doing
|
||
well in a way the agent can actually optimize. Using score as the reward
|
||
is natural for many games, but some games have sparse rewards (the agent
|
||
rarely earns points), and some have reward signals that are easy to
|
||
game. Experimenting with what to use as a reward—and observing how that
|
||
choice shapes training—is one of the richest paths into understanding
|
||
what reinforcement learning is actually optimizing.
|
||
|
||
**Hyperparameter tuning** is the practice of adjusting training settings
|
||
such as learning rate, exploration probability, and network size to
|
||
improve training efficiency and final performance. ``retro-gamer``
|
||
exposes these settings explicitly and explains their role in the
|
||
training log, so tuning them is connected to conceptual understanding
|
||
rather than uninformed search.
|
||
|
||
The interpretable training log
|
||
------------------------------
|
||
|
||
A key feature of ``retro-gamer`` is its training log. When training
|
||
begins, the trainer writes a complete, plain-language account of the
|
||
model it built: why it chose the architecture it did, what the
|
||
observation vector contains, what actions the agent can take, and how
|
||
the exploration and learning schedules are set up. Here is an example
|
||
from training a snake agent:
|
||
|
||
.. code-block:: text
|
||
|
||
[INIT] === Network Architecture ===
|
||
[INIT] Board: 32×16, character set: 6 chars (one-hot per cell)
|
||
[INIT] Observed state keys: 0 | Actions (incl. no-op): 5
|
||
[INIT] spatial=True → using CNN architecture
|
||
[INIT] Rationale: the board is a 2-D spatial scene; a CNN captures
|
||
[INIT] local patterns (walls, items nearby) more efficiently than an MLP.
|
||
[INIT] CNN: Conv2d(6→32, k=3, pad=1) → ReLU → Conv2d(32→64, k=3, pad=1) → ReLU
|
||
[INIT] CNN output: 64 channels × 16×32 = 32768 features (flattened)
|
||
[INIT] MLP head input: 32768 (conv) + 0 (state) = 32768
|
||
[INIT] MLP: 32768 → 128 → 128 → 5
|
||
[INIT] Hidden layers: 2 | Layer width: 128
|
||
[INIT] Output: 5 Q-values
|
||
[INIT] Actions: ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN'] + (no-op)
|
||
...
|
||
[EP 0001] total_reward=0.0 steps=2000 epsilon=0.9950 avg_loss=0.023540
|
||
[EP 0100] total_reward=3.0 steps=1847 epsilon=0.6065 avg_loss=0.001204
|
||
[EP 0500] total_reward=9.0 steps=1203 epsilon=0.0821 avg_loss=0.000387
|
||
|
||
The episode log shows total reward (score earned), how many turns the
|
||
episode lasted, the current exploration rate (``epsilon``), and the
|
||
average prediction error (``avg_loss``). Reading this log—and
|
||
connecting changes in these numbers to what you know about the game and
|
||
the algorithm—is one of the main activities the tool is designed to
|
||
support.
|