retro-gamer/docs/introduction.rst

Introduction
============

``retro-gamer`` grew out of a question about how students learn
difficult ideas in computer science. Reinforcement learning—the branch
of machine learning in which an agent learns to act well by interacting
with an environment and receiving rewards—is one of the most powerful
and widely-deployed ideas in modern computing. It underlies systems that
play chess and Go at superhuman levels, control industrial robots,
optimize power grids, and personalize recommendation feeds. It is also
genuinely hard to understand, not because the core ideas are especially
abstract, but because the feedback between a student's understanding and
the system's behavior is usually invisible. You adjust a hyperparameter,
run a training loop, and get a number. What happened inside, and why,
remains opaque.

The design hypothesis of ``retro-gamer`` is that this opacity is not
inevitable. If a student already knows a game well—how it works, what
the pieces mean, what counts as doing well—then training an agent on
that game gives them a concrete anchor for reasoning about what the
learning algorithm is doing and why. When the trainer decides to use a
convolutional neural network instead of a simpler model, it explains its
reasoning. When training stalls, the student can ask: did I describe the
game accurately? Is the reward signal sending the right signal? Would a
different exploration strategy help? These are exactly the questions that
build genuine conceptual understanding.

``retro-gamer`` is developed as part of the
`Making With Code <https://makingwithcode.org>`__ curriculum, a
project-based high school computer science curriculum emphasizing
personally meaningful creation and deep conceptual engagement. In the
games unit, students design and implement their own games using the
``retro-games`` framework. The extension into reinforcement learning is
a natural next step: you built the game; now let's see if a machine can
learn to play it.

How retro-gamer works
---------------------

Rather than asking you to write a training algorithm yourself,
``retro-gamer`` asks you to describe the game you want to train on.
This description—written in your game project's ``pyproject.toml``—tells
the trainer things the game's code alone doesn't make obvious: which
characters matter, which piece of game state represents success, whether
the board should be understood spatially or as a flat data display.

From this description, the trainer constructs a deep Q-learning model
suited to the game. It writes out a plain-language explanation of every
architectural decision it makes, then begins training. As training
proceeds, it logs each episode's reward, loss, and exploration rate.
Trained model snapshots—checkpoints—are saved periodically, so you can
watch how the agent's skill develops over time. When you're done
training, you can load any checkpoint and watch the agent play.

A typical workflow looks like this. First, describe your game in the
``[tool.retro-gamer]`` section of your game project's ``pyproject.toml``:

.. code-block:: toml

   [tool.retro-gamer]
   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
   reward = "score"
   character_set = ["@", "*", ">", "<", "^", "v"]

Then create a training run, train, and watch the result:

.. code-block:: console

   % retro-gamer create --game my_game --output runs/snake/

   % retro-gamer train runs/snake/

   % retro-gamer play runs/snake/ --checkpoint ep_0500

The ``create`` command sets up the training run directory; ``train``
runs the learning algorithm; ``play`` loads a checkpoint and lets you
watch the trained agent live in the terminal.

What you will learn
-------------------

Working with ``retro-gamer`` is designed to build understanding of a
cluster of related ideas:

**Reinforcement learning** is the framework in which an agent
interacts with an environment, receiving observations and rewards, and
learns to choose actions that maximize its long-term reward. The
``retro-gamer`` training loop is a concrete instance of this framework:
the agent is the neural network, the environment is the game, the
observation is the encoded board and game state, and the reward is
the change in score from one turn to the next.

**Neural network architecture** shapes what a model can and cannot
learn. When you declare a game ``spatial``, the trainer builds a
convolutional neural network that can detect patterns in the relative
positions of game pieces. When you declare it non-spatial, it builds a
simpler network that ignores position. Seeing the consequence of this
choice in training behavior is a direct experience of why architecture
matters.

**Observation design** determines what information is available to the
agent. If you leave a character out of the ``character_set``, the agent
will not distinguish it from empty space. If the game module defines a
``get_state()`` function, the agent also receives those computed values
as part of its observation. The consequences of these choices for what
the agent can learn are reasonably predictable — and making and checking
those predictions is exactly the kind of reasoning the tool is designed
to support.

**Reward engineering** is the craft of specifying what counts as doing
well in a way the agent can actually optimize. Using score as the reward
is natural for many games, but some games have sparse rewards (the agent
rarely earns points), and some have reward signals that are easy to
game. Experimenting with what to use as a reward—and observing how that
choice shapes training—is one of the richest paths into understanding
what reinforcement learning is actually optimizing.

**Hyperparameter tuning** is the practice of adjusting training settings
such as learning rate, exploration probability, and network size to
improve training efficiency and final performance. ``retro-gamer``
exposes these settings explicitly and explains their role in the
training log, so tuning them is connected to conceptual understanding
rather than uninformed search.

The interpretable training log
------------------------------

A key feature of ``retro-gamer`` is its training log. When training
begins, the trainer writes a complete, plain-language account of the
model it built: why it chose the architecture it did, what the
observation vector contains, what actions the agent can take, and how
the exploration and learning schedules are set up. Here is an example
from training a snake agent:

.. code-block:: text

   [INIT] === Network Architecture ===
   [INIT] Board: 32×16, character set: 6 chars (one-hot per cell)
   [INIT] Observed state keys: 0  |  Actions (incl. no-op): 5
   [INIT] spatial=True → using CNN architecture
   [INIT] Rationale: the board is a 2-D spatial scene; a CNN captures
   [INIT]   local patterns (walls, items nearby) more efficiently than an MLP.
   [INIT] CNN: Conv2d(6→32, k=3, pad=1) → ReLU → Conv2d(32→64, k=3, pad=1) → ReLU
   [INIT] CNN output: 64 channels × 16×32 = 32768 features (flattened)
   [INIT] MLP head input: 32768 (conv) + 0 (state) = 32768
   [INIT] MLP: 32768 → 128 → 128 → 5
   [INIT] Hidden layers: 2  |  Layer width: 128
   [INIT] Output: 5 Q-values
   [INIT] Actions: ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN'] + (no-op)
   ...
   [EP 0001] total_reward=0.0  steps=2000  epsilon=0.9950  avg_loss=0.023540
   [EP 0100] total_reward=3.0  steps=1847  epsilon=0.6065  avg_loss=0.001204
   [EP 0500] total_reward=9.0  steps=1203  epsilon=0.0821  avg_loss=0.000387

The episode log shows total reward (score earned), how many turns the
episode lasted, the current exploration rate (``epsilon``), and the
average prediction error (``avg_loss``). Reading this log—and
connecting changes in these numbers to what you know about the game and
the algorithm—is one of the main activities the tool is designed to
support.