Initial commit

2026-05-08 14:07:17 -04:00
commit 5ca97dc5d0
36 changed files with 4147 additions and 0 deletions
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -0,0 +1,12 @@
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/background.rst
+++ b/docs/background.rst
@@ -0,0 +1,442 @@
+Background
+==========
+
+Pedagogical framework
+---------------------
+
+Making With Code and the games unit
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``retro-gamer`` is developed for use in
+`Making With Code <https://makingwithcode.org>`__ (MWC), a high school
+computer science curriculum designed around the constructionist
+principle that students learn most durably by building things they care
+about. In MWC's games unit, students design and implement their own
+games using the ``retro-games`` framework: a Python library for
+building terminal-based, character-grid games in the style of early
+arcade software. Students start from concept, work through design,
+implement agents and game logic in Python, and end with a complete,
+playable game.
+
+The games unit gives students deep familiarity with one particular
+game and its code. They know which characters appear on the board,
+what the state dictionary contains, how reward accumulates, and what
+strategies tend to work. This knowledge is ordinarily tacit—embedded
+in how they play—but it is exactly the kind of knowledge that
+``retro-gamer`` asks students to make explicit. The act of writing a
+``config.toml`` that accurately describes your game to a learning
+algorithm is a form of structured reflection: you have to articulate,
+in precise terms, what you know.
+
+Objects to think with
+~~~~~~~~~~~~~~~~~~~~~
+
+The educational psychologist and mathematician Seymour Papert
+introduced the concept of *objects to think with*: concrete artifacts
+that serve as anchors for otherwise abstract ideas (Papert 1980). A
+gear, for Papert, was an object to think with about mathematics. The
+turtle in Logo was an object to think with about procedural thinking.
+In each case, the learner's embodied, intuitive knowledge of the
+object—how gears mesh, how the turtle moves—provides traction on
+abstract relationships that might otherwise remain inaccessible.
+
+A game that a student has built and played is a particularly rich
+object to think with. The student knows the game's behavior
+intimately: they have watched characters interact, experienced the
+score signal as meaningful, and developed intuitions about what makes
+a good move. These intuitions are not merely useful—they are
+*translatable* into the language of reinforcement learning. The reward
+signal the student experiences as a player is the same signal the
+trainer uses to evaluate actions. The patterns the student recognizes
+as meaningful on the board are precisely the patterns a convolutional
+neural network is designed to detect. The exploration-exploitation
+tradeoff the trainer navigates—trying new things versus sticking with
+what has worked—is analogous to the choices a student makes when
+learning a new game.
+
+``retro-gamer`` is designed to make these translations visible. When
+the student reads the training log and sees that the trainer chose a
+CNN because the game is spatial, they can connect that decision to
+their own knowledge of how the board works. When they see the reward
+increasing episode by episode, they can reason about *why*—what the
+agent is learning to do—rather than watching an opaque number change.
+
+Metadata as structured reflection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A student who has built a game knows things about it that its code does
+not make explicit. They know which characters matter—which ones indicate
+danger, opportunity, or neutral terrain. They know what game state
+changes signal success. They know whether the arrangement of pieces on
+the board is meaningful or incidental. This knowledge is usually tacit:
+embedded in how they play, not in anything they have written down.
+
+``retro-gamer`` asks students to make this tacit knowledge explicit by
+writing a ``[tool.retro-gamer]`` section in their game's
+``pyproject.toml``. The choice of location is deliberate: placing game
+metadata in the game's own project file frames it as *a property of the
+game*, not as a configuration setting for the training tool. The student
+is not giving hints to the trainer; they are accurately describing what
+they built.
+
+This framing matters for how students reason about the relationship
+between description and performance. A student who omits a character
+from the character set and then notices degraded training performance is
+not observing a failure of their trainer configuration—they are
+observing the consequence of having described the game inaccurately.
+The fix is not to adjust a hyperparameter; it is to write a more
+accurate description. The question "is my description of the game
+correct?" is precisely the kind of structured reflection that produces
+conceptual understanding, because it requires the student to connect
+what they know about the game to the representations the learning
+algorithm uses.
+
+Knowledge building and discussion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Making a game does not, by itself, guarantee conceptual understanding
+of reinforcement learning. Students may engage deeply with the
+implementation details of their game while remaining unable to
+articulate the big ideas that ``retro-gamer`` is meant to make
+salient. Research in the knowledge-building tradition (Scardamalia and
+Bereiter 2006) suggests that conceptual understanding deepens
+substantially when students discuss their ideas with others—explaining,
+questioning, and revising their understanding in dialogue.
+
+``retro-gamer`` is designed to generate the kind of specific,
+grounded questions that productive discussion requires. "What happens
+if I leave a character out of the character set?" is not an abstract
+question; it is a question about a specific game the student knows
+well, and it has a specific, reasoned answer. "Why does training
+improve faster with prioritized experience replay?" connects a
+hyperparameter setting to a mechanism. These are better starting
+points for discussion than the generic questions that arise from
+reading about reinforcement learning without a concrete artifact to
+refer to.
+
+Research design
+~~~~~~~~~~~~~~~
+
+The pedagogical hypothesis underlying ``retro-gamer`` is being
+evaluated in a research study conducted in the context of MWC's games
+unit. The study investigates how two interventions—using
+``retro-gamer`` to train an agent, and discussing reinforcement
+learning with a large language model—interact to support conceptual
+understanding of reinforcement learning.
+
+The key outcome is measured by a set of scenario-based conceptual
+questions. Representative examples include:
+
+- *Imagine you were training an agent to play a game with a specified
+  character set. If you forgot to include one of the characters which
+  is used in the game, how would it affect the trained agent's
+  performance? Explain your reasoning.*
+- *Imagine you are training an agent to play a game which has a
+  specified character set. You realize that only half of the specified
+  characters are actually used in the game. If you change the
+  character set to include only the characters that actually appear,
+  how would the training process change? Explain your reasoning.*
+- *Imagine you are creating a game where the goal is to win, and
+  partial success has no value—for example, a game where the goal is
+  to escape a maze. What would be the effect on agent training of
+  adding artificial rewards for completing sub-goals such as reaching
+  a milestone halfway to the exit? Explain your reasoning.*
+
+Each question is evaluated using a rubric that rewards conceptual
+understanding, even where specific misconceptions remain.
+
+Participants all receive a traditional classroom lesson on
+reinforcement learning before the study begins, ensuring that the same
+conceptual vocabulary is available to everyone. They then complete a
+pretest of the conceptual questions. Participants are randomly assigned
+to one of four conditions in a 2×2 design: the first factor is whether
+they use ``retro-gamer`` to train an agent on their game; the second
+is whether they discuss reinforcement learning with a large language
+model. One week later, participants complete the posttest. We
+hypothesize that the combination of ``retro-gamer`` and LLM discussion
+will produce the largest gains, mediated by more specific and more
+numerous questions to the LLM—a sign that students are reasoning more
+deeply about the underlying concepts.
+
+Technical background
+--------------------
+
+This section provides a conceptual introduction to the ideas underlying
+``retro-gamer``. It is intended to be accessible to students who have
+not studied machine learning before, while also connecting each concept
+to the specific choices you make when using the tool.
+
+Reinforcement learning
+~~~~~~~~~~~~~~~~~~~~~~
+
+*Reinforcement learning* (RL) is a framework for training an *agent*
+to make good decisions by interacting with an *environment*.
+
+At every moment, the environment is in some *state*, and the agent
+observes something about that state. The agent chooses an *action*,
+the environment transitions to a new state in response, and the agent
+receives a *reward* signal—a number that indicates how well it is
+doing. The agent's goal is to learn a *policy*: a rule for choosing
+actions that maximizes the total reward it accumulates over time. In
+``retro-gamer``, the game is the environment, the character grid and
+state dictionary are what the agent observes, pressing a key is an
+action, and the change in score is the reward.
+
+A distinctive feature of reinforcement learning—distinguishing it from
+supervised learning, where a model is trained on labeled examples—is
+that the agent must discover what good behavior looks like through
+experience. There is no teacher providing correct answers. The reward
+signal is all the agent has to go on. This makes reinforcement
+learning both powerful (it can find solutions no human designer would
+think to specify) and tricky (poorly chosen reward signals can produce
+strange or unintended behavior).
+
+The total reward the agent receives from a given state onward—if it
+acts according to its current policy—is called the *return*. Because
+rewards in the far future are harder to predict and plan for, RL
+algorithms typically *discount* future rewards: a reward received
+``t`` turns from now is worth only ``γ^t`` times its face value, where
+``γ`` (gamma) is a number slightly less than 1. The ``gamma``
+hyperparameter in ``retro-gamer`` controls this discount. A value
+close to 1 means the agent values the distant future almost as much
+as the immediate present; a smaller value makes the agent more
+myopic.
+
+Q-learning
+~~~~~~~~~~~
+
+A natural way to formalize the agent's goal is to define the *Q-function*
+(or *Q-value*): Q(s, a) is the expected total discounted reward the
+agent will receive if it is in state ``s``, takes action ``a``, and
+then follows its current policy from that point on. If the agent knew
+the true Q-function, it could act optimally simply by choosing the
+action with the highest Q-value in each state.
+
+Q-learning is an algorithm for learning the Q-function by experience.
+Starting from an arbitrary initial estimate, the agent uses the
+*Bellman equation* to update its Q-estimates after each transition.
+The key insight is that the Q-value of taking action ``a`` in state
+``s`` is related to the immediate reward and the best Q-value
+achievable from the next state:
+
+.. math::
+
+   Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')
+
+After each turn, the agent computes this *temporal difference* (TD)
+error—the gap between its current Q-estimate and what the Bellman
+equation says it should be—and adjusts its estimates to reduce the
+error. Over many iterations, the Q-estimates converge toward their
+true values.
+
+Deep Q-networks
+~~~~~~~~~~~~~~~
+
+Classical Q-learning stores the Q-function in a table: one entry for
+every possible (state, action) pair. This is feasible only when the
+number of possible states is small. For a game board with even modest
+dimensions—say 32×16 cells, each displaying one of a handful of
+characters—the number of possible board configurations is astronomically
+large. Storing a table of Q-values for every configuration is not
+practical.
+
+*Deep Q-Networks* (DQN), introduced by Mnih et al. (2015), solve this
+problem by approximating the Q-function with a neural network. Instead
+of a table, the network takes the current state as input and outputs
+Q-value estimates for all possible actions simultaneously. The network
+*generalizes*: having learned that moving right is a good idea when
+the apple is to the right and nothing is in the way, it applies that
+knowledge to board configurations it has never seen before.
+
+The training process in ``retro-gamer`` follows the DQN algorithm. At
+each turn, the agent uses its current network to estimate Q-values and
+selects an action. It stores the experience—(state, action, reward,
+next state)—in a *replay buffer*. Periodically, it samples a random
+batch of experiences from the buffer and uses them to compute TD
+errors, then adjusts the network weights to reduce those errors. This
+process continues for many episodes.
+
+Experience replay
+~~~~~~~~~~~~~~~~~
+
+A key ingredient of DQN is *experience replay*. Rather than training
+on experiences as they arrive—which would mean training on correlated,
+sequential transitions—the agent stores experiences in a buffer and
+samples them randomly for training. This has two benefits. First, each
+experience is potentially used many times for training, making data
+use more efficient. Second, random sampling breaks the correlations
+between consecutive transitions, which would otherwise cause the
+network's weight updates to interfere with each other.
+
+``retro-gamer`` offers a standard replay buffer and an optional
+*prioritized* replay buffer (PER). In PER, experiences with larger TD
+errors—cases where the agent's prediction was most wrong—are sampled
+more often. The intuition is that surprising transitions are more
+informative. Prioritized replay often improves training efficiency but
+introduces a bias that must be corrected with *importance sampling
+weights* (Schaul et al. 2015).
+
+The ``memory_capacity`` hyperparameter sets how many experiences the
+buffer can hold. When the buffer is full, old experiences are
+discarded. A larger buffer provides more diverse training data but
+uses more memory.
+
+Target networks
+~~~~~~~~~~~~~~~
+
+A subtle challenge in DQN training is that the Q-values computed by the
+Bellman equation depend on the network's own estimates of the next
+state's Q-values. If the network is updated constantly, its Q-value
+estimates keep shifting, making the training target a moving one. This
+can cause instability.
+
+DQN addresses this with a *target network*: a copy of the main network
+that is updated only every ``target_update_freq`` steps. The Bellman
+target is computed using the target network, while the main network is
+updated by gradient descent. Because the target network changes slowly,
+training targets remain stable long enough for the main network to
+make progress.
+
+Exploration vs. exploitation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A reinforcement learning agent faces a fundamental dilemma: should it
+*exploit* what it already knows (taking the action with the highest
+estimated Q-value) or *explore* (trying actions it is less certain
+about, in case they lead to better outcomes it has not yet discovered)?
+Exploiting too much early in training means the agent never discovers
+better strategies; exploring too much later means the agent wastes time
+on random behavior when it already knows what to do.
+
+``retro-gamer`` uses *ε-greedy exploration*: with probability ε
+(epsilon), the agent chooses a random action; with probability 1 − ε,
+it exploits its current Q-function. ε starts at 1 (pure exploration)
+and decays over training according to ``epsilon_decay``, reaching
+a floor of ``epsilon_min``. Reading the ``epsilon`` column in the
+training log shows how exploration decreases as training progresses.
+
+Representing the game board
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A neural network operates on numbers, not characters. Before the
+game board can be fed to the Q-network, it must be converted to a
+numerical representation. ``retro-gamer`` uses *one-hot encoding*.
+
+For a character set of ``n`` distinct characters, each cell on the
+board is represented by a vector of ``n`` numbers, all zero except for
+the one position corresponding to the character in that cell, which is
+set to 1. For example, with character set ``['@', '*', '>']``, the
+character ``'>'`` is encoded as ``[0, 0, 1]``. An empty cell is
+encoded as ``[0, 0, 0]``.
+
+The full board representation is a three-dimensional array of shape
+(H, W, C), where H is the board height, W is the board width, and
+C is the number of characters in the character set. The total number
+of numbers in this array—H × W × C—is the size of the board part of
+the observation. For a 32×16 board with 6 characters, this is
+32 × 16 × 6 = 3,072 numbers.
+
+The ``character_set`` field in the game description determines which
+characters the agent can distinguish. A character not in the set
+appears as an all-zero vector—indistinguishable from an empty cell.
+If the character set is not specified, ``retro-gamer`` runs a brief
+exploration phase before training to observe which characters actually
+appear.
+
+In addition to the board, the agent can observe numerical values from
+the game's state dictionary via ``observe_state``. These are
+appended to the end of the observation vector. The reward key must
+not be included in ``observe_state``: it would give the agent direct
+access to its own performance signal, which is not a realistic observation
+in most game contexts and can cause training pathologies.
+
+Neural network architectures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The architecture of the Q-network—the number and arrangement of its
+layers—is one of the most consequential choices in DQN training.
+``retro-gamer`` selects an architecture based on the ``spatial``
+field in the game description and generates a plain-language rationale.
+
+**Multilayer perceptrons (MLP)**
+
+The simplest neural network architecture for fixed-size input is the
+*multilayer perceptron* (MLP). An MLP is a sequence of *fully
+connected layers*: every unit in one layer is connected to every unit
+in the next. Each connection has a learnable *weight*; a unit computes
+a weighted sum of its inputs, passes it through a nonlinear *activation
+function* (``retro-gamer`` uses the rectified linear unit, or ReLU:
+``max(0, x)``), and sends the result to the next layer. The final
+layer has one unit per action, producing Q-value estimates.
+
+An MLP with two hidden layers of width 128, for an observation of size
+3,072 and 5 possible actions, would have approximately 400,000 trainable
+parameters. Training adjusts all of these parameters simultaneously to
+reduce the TD error.
+
+An MLP treats its input as a flat list of numbers. It does not know
+that these numbers were arranged in a 2D grid, or that spatially
+adjacent cells are related. This is appropriate when the game's
+observation is better understood as a collection of independent
+readings—a set of meters or status indicators—rather than as a spatial
+scene. Set ``spatial = false`` in the game description to use this
+architecture.
+
+**Convolutional neural networks (CNN)**
+
+When the game board is genuinely spatial—when the relative positions
+of characters matter—a *convolutional neural network* (CNN) is a much
+better fit. A CNN applies a set of learnable *filters* (small weight
+matrices) across the board, computing a dot product of each filter with
+every overlapping patch of the input. The result is a set of *feature
+maps*: each feature map highlights where in the board a particular
+pattern appears.
+
+This is efficient for two reasons. First, the same filter is applied
+at every board position: a filter that detects "apple to the right of
+snake head" works the same way whether the apple is at position (10,5)
+or (20,12). This *translational invariance* means the network can
+generalize across positions without learning a separate rule for each
+one. Second, each filter needs only a small number of parameters (the
+filter size)—far fewer than the equivalent fully connected connections.
+
+``retro-gamer`` uses two convolutional layers (with 32 and 64 output
+channels respectively, kernel size 3, padding 1) followed by a
+flattening step and an MLP head. The padding ensures that the spatial
+dimensions are preserved through the convolution, so the output of the
+second conv layer has shape (64, H, W), which is then flattened and
+passed to the MLP. Set ``spatial = true`` (the default) to use this
+architecture.
+
+Connecting architecture to game metadata
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The architectural choices ``retro-gamer`` makes are not arbitrary: they
+follow from the game description you provide. This connection is worth
+making explicit, because understanding it is one of the main paths into
+understanding why neural network architecture matters.
+
+- If ``spatial = true``, the CNN can detect local patterns—which characters
+  are adjacent to which—without needing to see every possible arrangement.
+  This is appropriate for games like Snake, where the snake's direction
+  and the apple's relative position are spatially encoded.
+
+- If ``spatial = false``, the MLP treats the board as a flat vector. This
+  may be appropriate for games that use the character grid primarily as a
+  display rather than a spatial field—for example, a game where characters
+  appear in fixed, non-interacting positions as status indicators.
+
+- The ``character_set`` determines the depth (C) of the board tensor.
+  More characters mean more numbers per cell and a larger input to the
+  network. A character set that includes characters the game never uses
+  wastes capacity; a character set that omits relevant characters forces
+  the agent to treat different things as the same.
+
+- The ``observe_state`` fields are appended to the flattened CNN output
+  before the MLP head. This allows the agent to use explicit state
+  variables—a timer, a lives count—alongside the visual board
+  representation.
+
+These relationships are not incidental features of the implementation.
+They are the reason the game description matters: every field you fill
+in shapes what the agent can perceive and therefore what it can learn.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -0,0 +1,13 @@
+project = 'retro-gamer'
+copyright = '2025, Chris Proctor'
+author = 'Chris Proctor'
+release = '0.1.0'
+
+extensions = []
+
+templates_path = ['_templates']
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+html_theme = 'sphinx_rtd_theme'
+html_static_path = ['_static']
+html_theme_options = {}
--- a/docs/contributing.rst
+++ b/docs/contributing.rst
@@ -0,0 +1,19 @@
+Contributing
+============
+
+``retro-gamer`` is developed as part of the
+`Making With Code <https://makingwithcode.org>`__ project. Chris
+Proctor (chrisp@buffalo.edu), the project lead, is interested in
+hearing about your experience using the package, whether in a classroom,
+as a research tool, or for personal exploration.
+
+Bug reports, feature requests, and discussion of future directions take
+place on the project repository's
+`issues page <https://github.com/cproctor/retro-gamer/issues>`__. Code
+contributions should be submitted as pull requests. Development follows
+the `Contributor Covenant <https://www.contributor-covenant.org/>`__.
+
+If you are a teacher or curriculum designer considering using
+``retro-gamer`` in a course, or a researcher interested in collaborating
+on studies of its educational effectiveness, please contact Chris
+directly.
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -0,0 +1,69 @@
+retro-gamer: train agents to play retro games
+==============================================
+
+``retro-gamer`` is a Python package for training reinforcement learning
+agents to play games implemented with the
+`retro-games <https://retro-games.readthedocs.io/en/latest/>`__
+framework. It is designed as a learning tool: rather than writing the
+learning algorithm yourself, you describe the game to the trainer in a
+structured way, adjust the training parameters, and then observe—through
+a detailed log—how the trainer uses your description to build and run a
+learning model.
+
+The central idea is that the game becomes an *object to think with*
+about reinforcement learning. The choices you make—which characters to
+tell the trainer about, what counts as a reward, whether to treat the
+board as a spatial scene or a readout—have direct, observable
+consequences for how learning proceeds. Working out *why* a training run
+behaves as it does is the kind of reasoning that leads to lasting
+understanding of the underlying concepts.
+
+.. _installation:
+
+Installation
+------------
+
+Prerequisites
+~~~~~~~~~~~~~
+
+``retro-gamer`` requires Python 3.11 or higher and a game implemented
+with `retro-games <https://retro-games.readthedocs.io/en/latest/>`__.
+The retro-games framework must also be installed; see its documentation
+for instructions.
+
+.. code-block:: console
+
+   % pip install retro-gamer
+
+To install from source (for development or to use the latest changes):
+
+.. code-block:: console
+
+   % git clone https://github.com/cproctor/retro-gamer
+   % cd retro-gamer
+   % pip install -e .
+
+Verify the installation by checking the command-line tool:
+
+.. code-block:: console
+
+   % retro-gamer --help
+   Usage: retro-gamer [OPTIONS] COMMAND [ARGS]...
+
+     Train and run RL agents for retro games.
+
+   Commands:
+     create  Create a new training run directory with config.toml.
+     info    Print a summary of a training run.
+     play    Watch a trained agent play the game.
+     train   Train (or resume training) a DQN agent.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Contents:
+
+   introduction
+   background
+   walkthrough
+   reference
+   contributing
--- a/docs/introduction.rst
+++ b/docs/introduction.rst
@@ -0,0 +1,160 @@
+Introduction
+============
+
+``retro-gamer`` grew out of a question about how students learn
+difficult ideas in computer science. Reinforcement learning—the branch
+of machine learning in which an agent learns to act well by interacting
+with an environment and receiving rewards—is one of the most powerful
+and widely-deployed ideas in modern computing. It underlies systems that
+play chess and Go at superhuman levels, control industrial robots,
+optimize power grids, and personalize recommendation feeds. It is also
+genuinely hard to understand, not because the core ideas are especially
+abstract, but because the feedback between a student's understanding and
+the system's behavior is usually invisible. You adjust a hyperparameter,
+run a training loop, and get a number. What happened inside, and why,
+remains opaque.
+
+The design hypothesis of ``retro-gamer`` is that this opacity is not
+inevitable. If a student already knows a game well—how it works, what
+the pieces mean, what counts as doing well—then training an agent on
+that game gives them a concrete anchor for reasoning about what the
+learning algorithm is doing and why. When the trainer decides to use a
+convolutional neural network instead of a simpler model, it explains its
+reasoning. When training stalls, the student can ask: did I describe the
+game accurately? Is the reward signal sending the right signal? Would a
+different exploration strategy help? These are exactly the questions that
+build genuine conceptual understanding.
+
+``retro-gamer`` is developed as part of the
+`Making With Code <https://makingwithcode.org>`__ curriculum, a
+project-based high school computer science curriculum emphasizing
+personally meaningful creation and deep conceptual engagement. In the
+games unit, students design and implement their own games using the
+``retro-games`` framework. The extension into reinforcement learning is
+a natural next step: you built the game; now let's see if a machine can
+learn to play it.
+
+How retro-gamer works
+---------------------
+
+Rather than asking you to write a training algorithm yourself,
+``retro-gamer`` asks you to describe the game you want to train on.
+This description—written in your game project's ``pyproject.toml``—tells
+the trainer things the game's code alone doesn't make obvious: which
+characters matter, which piece of game state represents success, whether
+the board should be understood spatially or as a flat data display.
+
+From this description, the trainer constructs a deep Q-learning model
+suited to the game. It writes out a plain-language explanation of every
+architectural decision it makes, then begins training. As training
+proceeds, it logs each episode's reward, loss, and exploration rate.
+Trained model snapshots—checkpoints—are saved periodically, so you can
+watch how the agent's skill develops over time. When you're done
+training, you can load any checkpoint and watch the agent play.
+
+A typical workflow looks like this. First, describe your game in the
+``[tool.retro-gamer]`` section of your game project's ``pyproject.toml``:
+
+.. code-block:: toml
+
+   [tool.retro-gamer]
+   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
+   reward = "score"
+   character_set = ["@", "*", ">", "<", "^", "v"]
+
+Then create a training run, train, and watch the result:
+
+.. code-block:: console
+
+   % retro-gamer create --game my_game --output runs/snake/
+
+   % retro-gamer train runs/snake/
+
+   % retro-gamer play runs/snake/ --checkpoint ep_0500
+
+The ``create`` command sets up the training run directory; ``train``
+runs the learning algorithm; ``play`` loads a checkpoint and lets you
+watch the trained agent live in the terminal.
+
+What you will learn
+-------------------
+
+Working with ``retro-gamer`` is designed to build understanding of a
+cluster of related ideas:
+
+**Reinforcement learning** is the framework in which an agent
+interacts with an environment, receiving observations and rewards, and
+learns to choose actions that maximize its long-term reward. The
+``retro-gamer`` training loop is a concrete instance of this framework:
+the agent is the neural network, the environment is the game, the
+observation is the encoded board and game state, and the reward is
+the change in score from one turn to the next.
+
+**Neural network architecture** shapes what a model can and cannot
+learn. When you declare a game ``spatial``, the trainer builds a
+convolutional neural network that can detect patterns in the relative
+positions of game pieces. When you declare it non-spatial, it builds a
+simpler network that ignores position. Seeing the consequence of this
+choice in training behavior is a direct experience of why architecture
+matters.
+
+**Observation design** determines what information is available to the
+agent. If you leave a character out of the ``character_set``, the agent
+will not distinguish it from empty space. If you include a game-state
+variable in ``observe_state``, the agent can see it directly rather than
+having to infer it from the board. The consequences of these choices for
+what the agent can learn are reasonably predictable—and making and
+checking those predictions is exactly the kind of reasoning the tool is
+designed to support.
+
+**Reward engineering** is the craft of specifying what counts as doing
+well in a way the agent can actually optimize. Using score as the reward
+is natural for many games, but some games have sparse rewards (the agent
+rarely earns points), and some have reward signals that are easy to
+game. Experimenting with what to use as a reward—and observing how that
+choice shapes training—is one of the richest paths into understanding
+what reinforcement learning is actually optimizing.
+
+**Hyperparameter tuning** is the practice of adjusting training settings
+such as learning rate, exploration probability, and network size to
+improve training efficiency and final performance. ``retro-gamer``
+exposes these settings explicitly and explains their role in the
+training log, so tuning them is connected to conceptual understanding
+rather than uninformed search.
+
+The interpretable training log
+------------------------------
+
+A key feature of ``retro-gamer`` is its training log. When training
+begins, the trainer writes a complete, plain-language account of the
+model it built: why it chose the architecture it did, what the
+observation vector contains, what actions the agent can take, and how
+the exploration and learning schedules are set up. Here is an example
+from training a snake agent:
+
+.. code-block:: text
+
+   [INIT] === Network Architecture ===
+   [INIT] Board: 32×16, character set: 6 chars (one-hot per cell)
+   [INIT] Observed state keys: 0  |  Actions (incl. no-op): 5
+   [INIT] spatial=True → using CNN architecture
+   [INIT] Rationale: the board is a 2-D spatial scene; a CNN captures
+   [INIT]   local patterns (walls, items nearby) more efficiently than an MLP.
+   [INIT] CNN: Conv2d(6→32, k=3, pad=1) → ReLU → Conv2d(32→64, k=3, pad=1) → ReLU
+   [INIT] CNN output: 64 channels × 16×32 = 32768 features (flattened)
+   [INIT] MLP head input: 32768 (conv) + 0 (state) = 32768
+   [INIT] MLP: 32768 → 128 → 128 → 5
+   [INIT] Hidden layers: 2  |  Layer width: 128
+   [INIT] Output: 5 Q-values
+   [INIT] Actions: ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN'] + (no-op)
+   ...
+   [EP 0001] total_reward=0.0  steps=2000  epsilon=0.9950  avg_loss=0.023540
+   [EP 0100] total_reward=3.0  steps=1847  epsilon=0.6065  avg_loss=0.001204
+   [EP 0500] total_reward=9.0  steps=1203  epsilon=0.0821  avg_loss=0.000387
+
+The episode log shows total reward (score earned), how many turns the
+episode lasted, the current exploration rate (``epsilon``), and the
+average prediction error (``avg_loss``). Reading this log—and
+connecting changes in these numbers to what you know about the game and
+the algorithm—is one of the main activities the tool is designed to
+support.
--- a/docs/reference.rst
+++ b/docs/reference.rst
@@ -0,0 +1,344 @@
+Reference
+=========
+
+Game description fields
+-----------------------
+
+Game descriptions are written in the ``[tool.retro-gamer]`` section of
+your game project's ``pyproject.toml``. ``retro-gamer create`` reads
+this section and copies the metadata into the training run's
+``config.toml``, where it can also be inspected or hand-edited.
+
+A complete example for the Snake game:
+
+.. code-block:: toml
+
+   [tool.retro-gamer]
+   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
+   reward = "score"
+   character_set = ["@", "*", ">", "<", "^", "v"]
+   spatial = true
+   observe_state = []
+
+You do not need to specify the board size: ``retro-gamer`` reads it
+directly from your game's ``board_size`` attribute.
+
+The fields are described below.
+
+``actions``
+~~~~~~~~~~~
+
+**Required.** A list of keystroke names the agent may send to the game
+each turn. Use arrow key names for directional games, or single
+characters for character-key games.
+
+.. code-block:: toml
+
+   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
+
+The agent also has access to a no-op action (doing nothing). The total
+number of actions in the Q-network output is ``len(actions) + 1``.
+
+``reward``
+~~~~~~~~~~
+
+**Required.** The key in the game's state dictionary to use as the
+reward signal. The reward computed for each turn is the *change* in
+this value from the previous turn.
+
+.. code-block:: toml
+
+   reward = "score"
+
+``character_set``
+~~~~~~~~~~~~~~~~~
+
+**Optional.** A list of single characters that may appear on the board.
+Each character occupies one "slot" in the one-hot encoding. Characters
+not in this list are treated as empty space.
+
+.. code-block:: toml
+
+   character_set = ["@", "*", ">", "<", "^", "v"]
+
+If omitted, ``retro-gamer`` runs an exploration phase to discover the
+characters that appear in practice. The length of this phase is
+controlled by the ``exploration_turns`` hyperparameter.
+
+``spatial``
+~~~~~~~~~~~
+
+**Optional; default ``true``.** Whether to treat the board as a 2D
+spatial scene. When ``true``, the trainer uses a convolutional neural
+network (CNN) that can detect patterns in the relative positions of
+characters. When ``false``, the trainer uses a multilayer perceptron
+(MLP) that sees the board as a flat list of numbers without positional
+structure.
+
+.. code-block:: toml
+
+   spatial = true
+
+``observe_state``
+~~~~~~~~~~~~~~~~~
+
+**Optional; default ``[]``.** A list of keys from the game's state
+dictionary to append to the observation vector. The values must be
+numbers (integers, floats, or booleans). The reward key must not
+appear in this list.
+
+.. code-block:: toml
+
+   observe_state = ["lives", "level"]
+
+.. _hyperparameters:
+
+Hyperparameters
+---------------
+
+Hyperparameters are stored in the ``[hyperparameters]`` section of
+``config.toml``. They can be set via ``retro-gamer create`` options or
+edited directly.
+
+Learning and optimization
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``learning_rate`` (default: ``0.001``)
+    The step size used by the Adam optimizer when updating network
+    weights. Larger values converge faster but may be unstable; smaller
+    values are more stable but slower.
+
+``lr_decay`` (default: ``0.995``)
+    Multiplicative decay applied to the learning rate after each
+    episode. The learning rate decreases geometrically over training,
+    helping the network fine-tune later without destabilizing early
+    progress.
+
+``gamma`` (default: ``0.99``)
+    The discount factor for future rewards. A value of 1.0 makes the
+    agent value all future rewards equally; smaller values make the
+    agent increasingly myopic.
+
+Exploration
+~~~~~~~~~~~
+
+``epsilon`` (default: ``1.0``)
+    The initial exploration rate. At each turn, the agent takes a
+    random action with probability ``epsilon`` and exploits its current
+    Q-function with probability ``1 - epsilon``.
+
+``epsilon_decay`` (default: ``0.995``)
+    Multiplicative decay applied to ``epsilon`` after each episode.
+
+``epsilon_min`` (default: ``0.05``)
+    The floor below which ``epsilon`` will not fall. A small amount of
+    continued exploration prevents the agent from becoming permanently
+    committed to a suboptimal policy.
+
+Memory and sampling
+~~~~~~~~~~~~~~~~~~~
+
+``batch_size`` (default: ``64``)
+    The number of experiences sampled from the replay buffer per
+    training step.
+
+``memory_capacity`` (default: ``10000``)
+    The maximum number of experiences the replay buffer can hold. When
+    full, the oldest experiences are discarded.
+
+``prioritize_experiences`` (default: ``false``)
+    Whether to use prioritized experience replay. When ``true``,
+    experiences with larger TD errors are sampled more frequently.
+    This often improves sample efficiency at a modest computational
+    cost.
+
+Network architecture
+~~~~~~~~~~~~~~~~~~~~
+
+``n_layers`` (default: ``2``)
+    The number of hidden layers in the MLP head (for spatial games,
+    this follows the CNN; for non-spatial games, it is the full
+    network).
+
+``layer_size`` (default: ``128``)
+    The width (number of units) in each hidden layer.
+
+Training duration
+~~~~~~~~~~~~~~~~~
+
+``training_episodes`` (default: ``1000``)
+    The total number of game episodes to run. Each episode runs until
+    the game ends or ``max_turns_per_episode`` turns have elapsed.
+
+``max_turns_per_episode`` (default: ``2000``)
+    A safety cutoff preventing a single episode from running
+    indefinitely (for example, if the agent finds a way to avoid
+    dying).
+
+``target_update_freq`` (default: ``100``)
+    How many training steps between updates of the target network.
+    More frequent updates make training targets move faster (less
+    stable); less frequent updates make them more stable but slower
+    to reflect new learning.
+
+Character discovery
+~~~~~~~~~~~~~~~~~~~
+
+``exploration_turns`` (default: ``200``)
+    When ``character_set`` is not specified, the number of random
+    turns to run at the start of training to discover which
+    characters appear on the board.
+
+``unknown_character_strategy`` (default: ``"ignore"``)
+    What to do when a character appears during training that is not
+    in the established ``character_set``. ``"ignore"`` treats it as
+    an empty cell; ``"extend"`` rebuilds the model with an extended
+    character set.
+
+CLI reference
+-------------
+
+``retro-gamer create``
+~~~~~~~~~~~~~~~~~~~~~~
+
+Create a new training run directory with ``config.toml``. Game metadata
+is read automatically from the ``[tool.retro-gamer]`` section of your
+game's ``pyproject.toml``; you do not pass it on the command line.
+
+.. code-block:: console
+
+   % retro-gamer create --game MODULE --output DIR [OPTIONS]
+
+**Required options:**
+
+- ``--game MODULE`` — Python module containing ``create_game()``
+  (e.g. ``retro.examples.snake``). The ``[tool.retro-gamer]`` section
+  is read from the ``pyproject.toml`` found in or above the module's
+  source directory.
+- ``--output DIR`` — Directory to create for this training run.
+
+**Hyperparameter options** (all optional; see :ref:`hyperparameters`):
+
+- ``--training-episodes N``
+- ``--n-layers N``
+- ``--layer-size N``
+- ``--learning-rate F``
+- ``--lr-decay F``
+- ``--gamma F``
+- ``--epsilon-decay F``
+- ``--epsilon-min F``
+- ``--batch-size N``
+- ``--memory-capacity N``
+- ``--target-update-freq N``
+- ``--max-turns-per-episode N``
+- ``--exploration-turns N``
+- ``--prioritize-experiences`` / ``--no-prioritize-experiences``
+
+``retro-gamer train``
+~~~~~~~~~~~~~~~~~~~~~
+
+Train (or resume training) a DQN agent.
+
+.. code-block:: console
+
+   % retro-gamer train RUN_DIR [--resume CHECKPOINT]
+
+``RUN_DIR`` must contain a ``config.toml`` generated by ``retro-gamer
+create``. If ``--resume`` is given, training resumes from the specified
+checkpoint file (relative or absolute path).
+
+``retro-gamer play``
+~~~~~~~~~~~~~~~~~~~~
+
+Watch a trained agent play the game in the terminal.
+
+.. code-block:: console
+
+   % retro-gamer play RUN_DIR [--checkpoint NAME] [--framerate N]
+
+``--checkpoint`` defaults to ``final``. You can specify a checkpoint by
+name (e.g. ``ep_0100``) or by path relative to ``RUN_DIR/checkpoints/``.
+``--framerate`` sets the target frames per second (default: 12). Press
+Enter or Escape to quit.
+
+``retro-gamer info``
+~~~~~~~~~~~~~~~~~~~~~
+
+Print a summary of a training run: metadata, hyperparameters, recent
+episode log, and available checkpoints.
+
+.. code-block:: console
+
+   % retro-gamer info RUN_DIR
+
+Training run directory structure
+---------------------------------
+
+A training run is a self-contained directory with the following
+contents:
+
+.. code-block:: text
+
+   runs/snake/
+   ├── config.toml       # game description + hyperparameters
+   ├── training.log      # architecture rationale + per-episode log
+   └── checkpoints/
+       ├── ep_0100.pt    # model weights at episode 100
+       ├── ep_0200.pt
+       ├── ...
+       └── final.pt      # model weights at training completion
+
+``config.toml`` is written by ``retro-gamer create`` and updated (with
+the discovered character set and resolved hyperparameters) when
+``retro-gamer train`` begins. Editing ``config.toml`` between ``create``
+and ``train`` is the recommended way to adjust hyperparameters.
+
+``training.log`` begins with the full architecture description
+generated at training startup, followed by one line per episode in the
+format::
+
+   [EP NNNN] total_reward=F  steps=N  epsilon=F  avg_loss=F
+
+Checkpoint files are PyTorch state dictionaries containing model
+weights, optimizer state, the current epsilon, and the total number of
+training steps completed. They can be loaded with
+``retro-gamer play`` or directly with the Python API.
+
+Python API
+----------
+
+For advanced use, ``retro-gamer``'s components are importable as a
+library.
+
+.. code-block:: python
+
+   from retro_gamer import GameMetadata, GameEnvironment, DQNTrainer
+   from retro.examples.snake import create_game
+
+   # Read metadata from [tool.retro-gamer] in the game's pyproject.toml
+   metadata = GameMetadata.from_pyproject("retro.examples.snake")
+
+   trainer = DQNTrainer(
+       create_game, metadata, "runs/snake/",
+       training_episodes=500,
+       n_layers=2,
+       layer_size=128,
+   )
+   trainer.train()
+
+``GameEnvironment`` provides a gym-style interface for stepping through
+a game programmatically:
+
+.. code-block:: python
+
+   from retro_gamer import GameEnvironment
+
+   env = GameEnvironment(create_game, metadata)
+   obs = env.reset()             # returns initial observation vector
+   obs, reward, done = env.step("KEY_RIGHT")
+
+The observation is a flat NumPy array of dtype ``float32``. For spatial
+games, the first ``C × H × W`` elements are the board (channel-first
+one-hot encoding); for non-spatial games, the board is encoded
+``H × W × C`` and then flattened. Any ``observe_state`` values are
+appended at the end.
--- a/docs/walkthrough.rst
+++ b/docs/walkthrough.rst
@@ -0,0 +1,299 @@
+Walkthrough
+===========
+
+This section walks through a complete ``retro-gamer`` workflow, from
+preparing a game to watching a trained agent play. The game used here
+is the Snake example included with the ``retro-games`` framework, but
+the same steps apply to any game you build.
+
+Prerequisites
+-------------
+
+You will need:
+
+- Python 3.11 or higher.
+- The ``retro-games`` framework installed and a game you have written
+  (or the built-in Snake example). See the
+  `retro-games documentation <https://retro-games.readthedocs.io/en/latest/>`__
+  for help writing games.
+- ``retro-gamer`` installed (see :ref:`installation`).
+
+Preparing your game
+-------------------
+
+``retro-gamer`` loads your game by importing a Python module and
+calling a function named ``create_game``. The ``create_game`` function
+must take no arguments and return a new ``Game`` instance.
+
+Here is the ``create_game`` function for Snake:
+
+.. code-block:: python
+
+   def create_game():
+       head = SnakeHead()
+       apple = Apple()
+       game = Game([head, apple], {'score': 0}, board_size=(32, 16), framerate=12)
+       apple.relocate(game)
+       return game
+
+If your game module does not already have a ``create_game`` function,
+add one following this pattern.
+
+
+Describing your game
+--------------------
+
+Every training run begins with a description of your game. This
+description belongs in the ``[tool.retro-gamer]`` section of your game
+project's ``pyproject.toml``—the same file that defines the project's
+name, version, and dependencies. Placing it there keeps the description
+with the game itself, where it belongs.
+
+Here is the ``[tool.retro-gamer]`` section for the Snake example:
+
+.. code-block:: toml
+
+   [tool.retro-gamer]
+   actions = ["KEY_RIGHT", "KEY_UP", "KEY_LEFT", "KEY_DOWN"]
+   reward = "score"
+   character_set = ["@", "*", ">", "<", "^", "v"]
+   spatial = true
+   observe_state = []
+
+Let's go through each field.
+
+``actions``
+~~~~~~~~~~~
+
+A list of the keystrokes the agent may send to the game. For Snake,
+the four arrow keys control the direction of travel. The agent also
+implicitly has access to a no-op (doing nothing).
+
+.. note::
+
+   Only include actions that the game actually responds to. Listing
+   unreachable keys wastes part of the agent's action space and may slow
+   training.
+
+``reward``
+~~~~~~~~~~
+
+The key in the game's state dictionary to use as the reward signal.
+``retro-gamer`` computes the reward for each turn as the *change* in
+this value from one turn to the next. For Snake, score increases by 1
+(or more) each time the apple is eaten, so the agent receives a reward
+of 1 when it eats an apple and 0 otherwise.
+
+Choosing an appropriate reward is one of the most consequential
+decisions in RL. Some considerations:
+
+- A reward that is too sparse—where the agent goes many turns without
+  receiving any signal—makes learning slow. A snake that dies without
+  ever eating an apple receives no positive reward at all in the first
+  episodes, giving the learning algorithm almost nothing to work with.
+- A reward that is too dense—assigned every turn—may not reflect the
+  true goal of the game.
+- An artificial reward, such as giving a point for moving toward the
+  apple, can accelerate early training but may cause the agent to
+  optimize the proxy rather than the real objective.
+
+``character_set``
+~~~~~~~~~~~~~~~~~
+
+The characters that can appear on the board, as a list of
+single-character strings. Each cell of the board will be *one-hot
+encoded* using this list: the agent represents the content of each cell
+as a vector of zeros with a single 1 at the position corresponding to
+the character. A cell containing a character not in this list is treated
+as empty.
+
+For Snake, the characters are: ``@`` (the apple), ``*`` (body
+segments), ``>`` ``<`` ``^`` ``v`` (the snake head in each direction).
+
+If you omit this field, ``retro-gamer`` will run a brief exploration
+phase before training to discover which characters actually appear.
+The number of exploration turns is controlled by the
+``exploration_turns`` hyperparameter.
+
+``spatial``
+~~~~~~~~~~~
+
+Whether to treat the board as a spatial scene (default: ``true``). A
+spatial game uses a *convolutional neural network* (CNN) that can
+detect patterns in the relative arrangement of characters. A
+non-spatial game uses a simpler *multilayer perceptron* (MLP) that
+ignores positional relationships. Set to ``false`` for games where
+position is irrelevant.
+
+Once you have written this section, create the training run directory:
+
+.. code-block:: console
+
+   % retro-gamer create                    \
+       --game retro.examples.snake         \
+       --output runs/snake/
+
+   Created training run at runs/snake/config.toml
+     game        : retro.examples.snake
+     board_size  : 32×16
+     actions     : ['KEY_RIGHT', 'KEY_UP', 'KEY_LEFT', 'KEY_DOWN']
+     reward      : score
+     characters  : ['@', '*', '>', '<', '^', 'v']
+     architecture: CNN (spatial)
+
+``retro-gamer create`` reads your game metadata directly from
+``pyproject.toml`` and writes it—along with all hyperparameters—to
+``runs/snake/config.toml``.
+
+Training the agent
+------------------
+
+With the ``config.toml`` in place, start training:
+
+.. code-block:: console
+
+   % retro-gamer train runs/snake/
+   Training for 1000 episodes…
+   Done. Checkpoints in runs/snake/checkpoints/
+
+Training saves checkpoints every 100 episodes and a ``final.pt``
+checkpoint when complete. You can follow progress in the training log:
+
+.. code-block:: console
+
+   % tail -f runs/snake/training.log
+
+The log shows one line per episode:
+
+.. code-block:: text
+
+   [EP 0001] total_reward=0.0  steps=2000  epsilon=0.9950  avg_loss=0.023540
+   [EP 0050] total_reward=1.0  steps=1921  epsilon=0.7783  avg_loss=0.003217
+   [EP 0100] total_reward=3.0  steps=1847  epsilon=0.6065  avg_loss=0.001204
+
+- **total_reward**: the total score earned during the episode (how many
+  apples the snake ate, for Snake).
+- **steps**: how many turns the episode lasted.
+- **epsilon**: the current exploration rate. Early in training this is
+  close to 1 (mostly random actions); it decays toward ``epsilon_min``.
+- **avg_loss**: the average temporal-difference error across training
+  steps in this episode. A decreasing loss generally indicates that the
+  Q-value estimates are converging.
+
+Resuming training
+~~~~~~~~~~~~~~~~~
+
+Training can be resumed from a checkpoint:
+
+.. code-block:: console
+
+   % retro-gamer train runs/snake/ --resume checkpoints/ep_0500.pt
+
+Watching a trained agent play
+------------------------------
+
+To watch a trained agent play the game in your terminal:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/ --checkpoint final
+
+You can substitute any checkpoint name:
+
+.. code-block:: console
+
+   % retro-gamer play runs/snake/ --checkpoint ep_0100
+
+Press Enter or Escape to quit.
+
+Comparing agents trained at different checkpoints is a useful activity:
+the agent at episode 100 has learned *something*, but typically much
+less than the agent at episode 500. Articulating *what* the earlier
+agent has and has not learned, and *why*, is productive reasoning about
+the training process.
+
+Inspecting a run
+----------------
+
+To review the configuration and recent training progress for a run:
+
+.. code-block:: console
+
+   % retro-gamer info runs/snake/
+   Game module : retro.examples.snake
+   Metadata    : {'board_size': [32, 16], 'actions': [...], 'reward': 'score', ...}
+   Hyperparams : {'learning_rate': 0.001, 'gamma': 0.99, ...}
+
+   Last 5 episodes:
+     [EP 0996] total_reward=9.0   steps=1203  epsilon=0.0074  avg_loss=0.000312
+     [EP 0997] total_reward=11.0  steps=1051  epsilon=0.0074  avg_loss=0.000289
+     [EP 0998] total_reward=14.0  steps=987   epsilon=0.0074  avg_loss=0.000274
+     [EP 0999] total_reward=8.0   steps=1142  epsilon=0.0074  avg_loss=0.000261
+     [EP 1000] total_reward=12.0  steps=1089  epsilon=0.0074  avg_loss=0.000248
+
+   Checkpoints (11): ['ep_0100.pt', ..., 'final.pt']
+
+Adjusting hyperparameters
+--------------------------
+
+The training hyperparameters can be changed by editing ``config.toml``
+before training, or by passing them as options to ``retro-gamer
+create``. Common adjustments and their effects:
+
+**``training_episodes``** — How long to train. More episodes give the
+agent more time to learn, but also take longer to run.
+
+**``epsilon_decay``** — How quickly exploration decreases. A faster
+decay (smaller ``epsilon_decay``) means the agent commits to its early
+Q-estimates before they are fully reliable. A slower decay (larger
+``epsilon_decay``, closer to 1) gives the agent more time to explore
+but may waste training time on random actions.
+
+**``learning_rate``** — How large the weight updates are at each
+training step. A large learning rate learns fast but may overshoot;
+a small learning rate is stable but slow.
+
+**``gamma``** — The discount factor for future rewards. Closer to 1
+means the agent values long-term consequences; closer to 0 makes the
+agent focus on immediate reward.
+
+**``n_layers`` and ``layer_size``** — The depth and width of the MLP
+head. Larger networks can represent more complex Q-functions but are
+slower to train and may overfit.
+
+**``prioritize_experiences``** — Whether to use prioritized experience
+replay. This often improves sample efficiency but is slightly slower
+per step.
+
+Questions for investigation
+----------------------------
+
+The following questions are intended to guide productive investigation
+using ``retro-gamer``. They are chosen because they have specific,
+reasoned answers that connect what you know about the game to the
+concepts underlying the training algorithm.
+
+1. **Character set completeness.** Train two agents: one with the full
+   character set, one missing a character that frequently appears on the
+   board. Compare their performance. What did the second agent lose the
+   ability to perceive, and how did that affect its behavior?
+
+2. **Spatial vs. non-spatial.** Train the same game with ``spatial =
+   true`` and ``spatial = false``. How does training efficiency differ?
+   Can you explain the difference in terms of what each architecture
+   can and cannot learn?
+
+3. **Reward shaping.** If the game currently rewards only the final
+   objective (e.g., reaching a goal), add intermediate rewards for
+   sub-goals. How does this change the early training curve? Does it
+   change the agent's final strategy?
+
+4. **Exploration schedule.** Train with a very fast ``epsilon_decay``
+   (so the agent commits to exploiting early) and a very slow one (so
+   exploration continues for a long time). How do the training curves
+   differ? What is the agent doing in each case when ``epsilon`` is low?
+
+5. **Checkpoint comparison.** Load the agent at episode 100 and at
+   episode 1000 and watch each play the same game. What has the later
+   agent learned that the earlier one has not? How would you describe
+   this difference to someone who does not know about neural networks?