Updates across the board

This commit is contained in:
Chris Proctor
2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions

287
docs/troubleshooting.rst Normal file
View File

@@ -0,0 +1,287 @@
Troubleshooting
===============
This section describes problems that commonly arise when training an agent
with ``retro-gamer``. Each entry names the issue, describes what you will
see in the training log or when watching the agent play, explains what is
happening in terms of the underlying reinforcement learning, and suggests
how to fix it.
.. contents:: Issues
:local:
:depth: 1
Loss grows rapidly over training
---------------------------------
**Symptoms**
The ``avg_loss`` column in the training log grows steadily from one
checkpoint to the next, often at an accelerating rate::
[ep_0100] avg_loss=22.2
[ep_0200] avg_loss=128.5
[ep_0300] avg_loss=2918.5
[ep_0400] avg_loss=163825.1
Left unchecked, the loss eventually reaches extreme values and the agent's
behavior becomes erratic or degenerates entirely.
**Why this happens**
This is called *Q-value divergence*. The Q-network is trained to predict
the total future reward of each action. To do that, it computes a *target*
for each prediction — but the target itself is computed using the
Q-network's own current predictions. This creates a feedback loop: if
the predictions are slightly off, the targets drift, which makes the next
predictions slightly more off, which drifts the targets further.
Under normal conditions, the learning rate is small enough and the target
network stable enough that this loop stays controlled. Divergence happens
when the learning rate is too high, causing each update to overshoot.
The problem is amplified by larger networks (more parameters to overshoot)
and by prioritized experience replay, which deliberately samples the
experiences the network is most wrong about — exactly the experiences most
likely to destabilize it.
**How to fix it**
Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction
(for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize
training. If you recently increased the size of the network (via
``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning
rate than you used before is likely necessary — larger, more capable
networks need smaller, more careful updates.
Also consider increasing ``target_update_freq``. The target network is a
frozen copy of the Q-network used to compute stable training targets; the
less frequently it is updated, the more stable those targets are. The
default is 200 steps; raising it to 500 or 1000 slows learning slightly
but reduces the chance of divergence.
Because divergence compounds over many episodes, a run that has begun
diverging cannot simply be resumed with a lower learning rate — the
weights have already drifted far from useful values. Use
``retro-gamer clean`` to remove the existing checkpoints and start fresh.
Agent ignores some actions entirely
-------------------------------------
**Symptoms**
After training, the agent never (or almost never) turns in certain
directions, regardless of the board state. If you compare checkpoints at
different stages of training, the missing directions are absent from the
very beginning and never appear. The agent may survive for a while but
always move in only a subset of the possible directions.
**Why this happens**
If some actions lead to immediate death every time they are tried early in
training, the Q-network quickly learns to assign them very low values.
This is correct in the specific situation where those actions are always
fatal — but the network then generalizes that association across *all*
board positions, even positions where those actions would be safe.
A common cause is a fixed starting position at the edge or corner of the
board. A snake that always starts in the top-left corner and always begins
moving downward will die immediately whenever it turns up or left in the
first step. After thousands of early episodes where those actions produce
instant death, the network has seen so much evidence that "turn left →
die" and "turn up → die" that it assigns them low Q-values everywhere.
**How to fix it**
Make sure the game's starting conditions give the agent a chance to try
every action safely. For a snake game, this means randomizing both the
starting position (keeping at least one cell away from every edge) and
the starting direction at the beginning of each episode. An agent that
starts in different places and orientations each time will quickly learn
that all four directions can be appropriate depending on context.
Agent survives but never moves toward the goal
-----------------------------------------------
**Symptoms**
The ``avg_steps`` column in the training log increases steadily — the
agent is surviving longer — but ``avg_reward`` stays negative or barely
improves. When you watch the agent play, it wanders around the board
without ever approaching the target object. Episodes end because the
agent runs into a wall, not because it reached the goal.
**Why this happens**
The reward signal is *asymmetric*: it penalizes moving away from the goal
but gives no reward for moving toward it. With this signal, the agent
learns to avoid the penalty by surviving, but it has no positive gradient
pointing it in the right direction. The eventual goal-reaching reward
(eating the apple, reaching the exit, etc.) is too rare — especially
early in training when the agent is mostly acting randomly — to provide
meaningful learning signal on its own.
From the Q-network's perspective, all directions look roughly equivalent:
moving toward the goal is 0 reward, moving away is 1. On a large board,
the probability of eating the apple by chance is small enough that the
network may never see the positive terminal reward at all during the
exploration phase.
**How to fix it**
Make the distance-based reward symmetric: give **+1 for moving toward the
goal** and **1 for moving away**. This way, every single step provides a
meaningful signal in the correct direction, and the agent does not need to
reach the goal by chance in order to start learning. In a snake game,
computing this signal requires only one line of arithmetic — the change
in Manhattan distance between the head and the apple from one step to the
next.
Note that the shaped ±1 signal is a *proxy* for the real objective. If the
agent learns to follow it too literally, it may take direct paths that run
through its own body. The 10 death penalty and +50 apple reward are still
necessary; the shaping only accelerates early learning.
Exploration ends before learning is complete
---------------------------------------------
**Symptoms**
The ``epsilon`` column in the training log reaches ``epsilon_min`` well
before training is finished. After that point, ``avg_reward`` stops
improving even though many episodes remain. When you watch the agent play,
it commits to the same strategy regardless of what is happening on the
board.
**Why this happens**
Epsilon controls the balance between exploration (random actions) and
exploitation (using the learned policy). Early in training, when the
Q-network has seen little data, exploration is essential: the agent needs
to try different things to accumulate the varied experiences that make
Q-value estimates reliable. Once epsilon reaches its minimum, the agent
stops exploring and commits fully to whatever policy it has learned so far.
If ``training_episodes`` is too small relative to ``epsilon_decay``, the
exploration phase ends while the Q-network is still unreliable. The agent
then exploits a half-learned policy that cannot improve because it never
tries anything new.
You can calculate when epsilon will reach its minimum:
.. code-block:: python
import math
episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay)
With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``,
``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The
agent should have substantial training time *after* the exploration phase
ends — so ``training_episodes`` should be at least several times this
number.
**How to fix it**
Increase ``training_episodes`` so that the agent has many episodes of
exploitation after the exploration phase ends. For simple games on small
boards, 10,000 episodes is a reasonable starting point; for complex games
or large boards, 50,000100,000 may be needed.
This is always safe to change. Because ``training_episodes`` does not
affect the network architecture or the reward signal, you can increase it
in ``config.toml`` and resume training from the latest checkpoint without
starting fresh.
Death penalty dominates all other signals
-------------------------------------------
**Symptoms**
After a period of training, the agent survives for many steps but rarely
or never scores. It tends to circle, hug walls, or otherwise avoid the
goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains
persistently negative. The agent behaves as if staying alive is the only
objective.
**Why this happens**
When the penalty for dying is much larger than any other reward in the
game, the Q-network learns that staying alive is overwhelmingly the most
important thing to do. Scoring — which requires taking some risk —
becomes unattractive because a single death outweighs many successful
goal-reaching events.
For example, if the death penalty is 1000 and each successful apple is
+50, then dying once costs the equivalent of twenty apples. The agent
learns that the safest strategy is to avoid risk entirely, even if that
means never eating. From the Q-network's perspective, this is rational:
it is correctly optimizing the reward signal you gave it.
**How to fix it**
Keep all reward magnitudes in the same order of magnitude. If per-step
shaping gives ±1 and the goal reward is +50, a death penalty of 10 is
appropriate: death is clearly bad (ten times worse than a bad step) but
not so catastrophic that it crowds out everything else. As a rule of
thumb, no single reward should be more than ten to twenty times larger
than the typical per-step reward.
Increasing ``gamma`` (the discount factor) is a better way to make the
agent care more about long-term consequences. A higher gamma causes
future rewards — including the eventual death penalty — to count more
heavily in the agent's current decisions, without distorting the relative
scale of the rewards.
Reward signal and human score interfere with each other
---------------------------------------------------------
**Symptoms**
Human players see scores that go negative, or that include penalties and
adjustments that make no sense in the context of a normal game. Conversely,
adjustments made to improve training (removing a per-step shaping penalty,
changing a death penalty) change the game's visible score in ways that
affect the experience for human players.
**Why this happens**
Using the same state variable for both the training reward and the
human-visible score conflates two separate concerns. Training rewards
benefit from shaping — intermediate signals like "moved toward the goal"
and "died" that accelerate learning. Scores for human players should
reflect only the game's actual objectives (apples eaten, enemies defeated,
distance covered) so that they are legible and motivating.
When these are the same variable, every design decision about one
necessarily affects the other.
**How to fix it**
Use two separate keys in the game's state dictionary: one for the
human-facing score (updated only by meaningful in-game events) and one
for the training reward (updated every step with shaping signals and
penalties). In the game code:
.. code-block:: python
# Only updated when the snake eats an apple — clean for human players.
game.state['score'] += 50
# Updated every step — used only by the trainer.
game.state['reward'] += old_dist - new_dist # +1 toward apple, -1 away
game.state['reward'] += 50 # also reward eating
game.state['reward'] -= 10 # death penalty
Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of
``pyproject.toml`` so the trainer watches the right key. The score display
remains clean for human players, and you can adjust the training reward
freely without affecting it.
Note that changing the ``reward`` key is an incompatible change: existing
checkpoints trained on the old signal will be rejected when you try to
resume. Run ``retro-gamer clean`` and start fresh after making this change.