Updates across the board

2026-06-22 16:41:31 -04:00
parent 5ca97dc5d0
commit 73624d1a0c
33 changed files with 3104 additions and 643 deletions
--- a/docs/troubleshooting.rst
+++ b/docs/troubleshooting.rst
@@ -0,0 +1,287 @@
+Troubleshooting
+===============
+
+This section describes problems that commonly arise when training an agent
+with ``retro-gamer``. Each entry names the issue, describes what you will
+see in the training log or when watching the agent play, explains what is
+happening in terms of the underlying reinforcement learning, and suggests
+how to fix it.
+
+.. contents:: Issues
+   :local:
+   :depth: 1
+
+
+Loss grows rapidly over training
+---------------------------------
+
+**Symptoms**
+
+The ``avg_loss`` column in the training log grows steadily from one
+checkpoint to the next, often at an accelerating rate::
+
+   [ep_0100]  avg_loss=22.2
+   [ep_0200]  avg_loss=128.5
+   [ep_0300]  avg_loss=2918.5
+   [ep_0400]  avg_loss=163825.1
+
+Left unchecked, the loss eventually reaches extreme values and the agent's
+behavior becomes erratic or degenerates entirely.
+
+**Why this happens**
+
+This is called *Q-value divergence*. The Q-network is trained to predict
+the total future reward of each action. To do that, it computes a *target*
+for each prediction — but the target itself is computed using the
+Q-network's own current predictions. This creates a feedback loop: if
+the predictions are slightly off, the targets drift, which makes the next
+predictions slightly more off, which drifts the targets further.
+
+Under normal conditions, the learning rate is small enough and the target
+network stable enough that this loop stays controlled. Divergence happens
+when the learning rate is too high, causing each update to overshoot.
+The problem is amplified by larger networks (more parameters to overshoot)
+and by prioritized experience replay, which deliberately samples the
+experiences the network is most wrong about — exactly the experiences most
+likely to destabilize it.
+
+**How to fix it**
+
+Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction
+(for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize
+training. If you recently increased the size of the network (via
+``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning
+rate than you used before is likely necessary — larger, more capable
+networks need smaller, more careful updates.
+
+Also consider increasing ``target_update_freq``. The target network is a
+frozen copy of the Q-network used to compute stable training targets; the
+less frequently it is updated, the more stable those targets are. The
+default is 200 steps; raising it to 500 or 1000 slows learning slightly
+but reduces the chance of divergence.
+
+Because divergence compounds over many episodes, a run that has begun
+diverging cannot simply be resumed with a lower learning rate — the
+weights have already drifted far from useful values. Use
+``retro-gamer clean`` to remove the existing checkpoints and start fresh.
+
+
+Agent ignores some actions entirely
+-------------------------------------
+
+**Symptoms**
+
+After training, the agent never (or almost never) turns in certain
+directions, regardless of the board state. If you compare checkpoints at
+different stages of training, the missing directions are absent from the
+very beginning and never appear. The agent may survive for a while but
+always move in only a subset of the possible directions.
+
+**Why this happens**
+
+If some actions lead to immediate death every time they are tried early in
+training, the Q-network quickly learns to assign them very low values.
+This is correct in the specific situation where those actions are always
+fatal — but the network then generalizes that association across *all*
+board positions, even positions where those actions would be safe.
+
+A common cause is a fixed starting position at the edge or corner of the
+board. A snake that always starts in the top-left corner and always begins
+moving downward will die immediately whenever it turns up or left in the
+first step. After thousands of early episodes where those actions produce
+instant death, the network has seen so much evidence that "turn left →
+die" and "turn up → die" that it assigns them low Q-values everywhere.
+
+**How to fix it**
+
+Make sure the game's starting conditions give the agent a chance to try
+every action safely. For a snake game, this means randomizing both the
+starting position (keeping at least one cell away from every edge) and
+the starting direction at the beginning of each episode. An agent that
+starts in different places and orientations each time will quickly learn
+that all four directions can be appropriate depending on context.
+
+
+Agent survives but never moves toward the goal
+-----------------------------------------------
+
+**Symptoms**
+
+The ``avg_steps`` column in the training log increases steadily — the
+agent is surviving longer — but ``avg_reward`` stays negative or barely
+improves. When you watch the agent play, it wanders around the board
+without ever approaching the target object. Episodes end because the
+agent runs into a wall, not because it reached the goal.
+
+**Why this happens**
+
+The reward signal is *asymmetric*: it penalizes moving away from the goal
+but gives no reward for moving toward it. With this signal, the agent
+learns to avoid the penalty by surviving, but it has no positive gradient
+pointing it in the right direction. The eventual goal-reaching reward
+(eating the apple, reaching the exit, etc.) is too rare — especially
+early in training when the agent is mostly acting randomly — to provide
+meaningful learning signal on its own.
+
+From the Q-network's perspective, all directions look roughly equivalent:
+moving toward the goal is 0 reward, moving away is −1. On a large board,
+the probability of eating the apple by chance is small enough that the
+network may never see the positive terminal reward at all during the
+exploration phase.
+
+**How to fix it**
+
+Make the distance-based reward symmetric: give **+1 for moving toward the
+goal** and **−1 for moving away**. This way, every single step provides a
+meaningful signal in the correct direction, and the agent does not need to
+reach the goal by chance in order to start learning. In a snake game,
+computing this signal requires only one line of arithmetic — the change
+in Manhattan distance between the head and the apple from one step to the
+next.
+
+Note that the shaped ±1 signal is a *proxy* for the real objective. If the
+agent learns to follow it too literally, it may take direct paths that run
+through its own body. The −10 death penalty and +50 apple reward are still
+necessary; the shaping only accelerates early learning.
+
+
+Exploration ends before learning is complete
+---------------------------------------------
+
+**Symptoms**
+
+The ``epsilon`` column in the training log reaches ``epsilon_min`` well
+before training is finished. After that point, ``avg_reward`` stops
+improving even though many episodes remain. When you watch the agent play,
+it commits to the same strategy regardless of what is happening on the
+board.
+
+**Why this happens**
+
+Epsilon controls the balance between exploration (random actions) and
+exploitation (using the learned policy). Early in training, when the
+Q-network has seen little data, exploration is essential: the agent needs
+to try different things to accumulate the varied experiences that make
+Q-value estimates reliable. Once epsilon reaches its minimum, the agent
+stops exploring and commits fully to whatever policy it has learned so far.
+
+If ``training_episodes`` is too small relative to ``epsilon_decay``, the
+exploration phase ends while the Q-network is still unreliable. The agent
+then exploits a half-learned policy that cannot improve because it never
+tries anything new.
+
+You can calculate when epsilon will reach its minimum:
+
+.. code-block:: python
+
+   import math
+   episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay)
+
+With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``,
+``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The
+agent should have substantial training time *after* the exploration phase
+ends — so ``training_episodes`` should be at least several times this
+number.
+
+**How to fix it**
+
+Increase ``training_episodes`` so that the agent has many episodes of
+exploitation after the exploration phase ends. For simple games on small
+boards, 10,000 episodes is a reasonable starting point; for complex games
+or large boards, 50,000–100,000 may be needed.
+
+This is always safe to change. Because ``training_episodes`` does not
+affect the network architecture or the reward signal, you can increase it
+in ``config.toml`` and resume training from the latest checkpoint without
+starting fresh.
+
+
+Death penalty dominates all other signals
+-------------------------------------------
+
+**Symptoms**
+
+After a period of training, the agent survives for many steps but rarely
+or never scores. It tends to circle, hug walls, or otherwise avoid the
+goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains
+persistently negative. The agent behaves as if staying alive is the only
+objective.
+
+**Why this happens**
+
+When the penalty for dying is much larger than any other reward in the
+game, the Q-network learns that staying alive is overwhelmingly the most
+important thing to do. Scoring — which requires taking some risk —
+becomes unattractive because a single death outweighs many successful
+goal-reaching events.
+
+For example, if the death penalty is −1000 and each successful apple is
+50, then dying once costs the equivalent of twenty apples. The agent
+learns that the safest strategy is to avoid risk entirely, even if that
+means never eating. From the Q-network's perspective, this is rational:
+it is correctly optimizing the reward signal you gave it.
+
+**How to fix it**
+
+Keep all reward magnitudes in the same order of magnitude. If per-step
+shaping gives ±1 and the goal reward is +50, a death penalty of −10 is
+appropriate: death is clearly bad (ten times worse than a bad step) but
+not so catastrophic that it crowds out everything else. As a rule of
+thumb, no single reward should be more than ten to twenty times larger
+than the typical per-step reward.
+
+Increasing ``gamma`` (the discount factor) is a better way to make the
+agent care more about long-term consequences. A higher gamma causes
+future rewards — including the eventual death penalty — to count more
+heavily in the agent's current decisions, without distorting the relative
+scale of the rewards.
+
+
+Reward signal and human score interfere with each other
+---------------------------------------------------------
+
+**Symptoms**
+
+Human players see scores that go negative, or that include penalties and
+adjustments that make no sense in the context of a normal game. Conversely,
+adjustments made to improve training (removing a per-step shaping penalty,
+changing a death penalty) change the game's visible score in ways that
+affect the experience for human players.
+
+**Why this happens**
+
+Using the same state variable for both the training reward and the
+human-visible score conflates two separate concerns. Training rewards
+benefit from shaping — intermediate signals like "moved toward the goal"
+and "died" that accelerate learning. Scores for human players should
+reflect only the game's actual objectives (apples eaten, enemies defeated,
+distance covered) so that they are legible and motivating.
+
+When these are the same variable, every design decision about one
+necessarily affects the other.
+
+**How to fix it**
+
+Use two separate keys in the game's state dictionary: one for the
+human-facing score (updated only by meaningful in-game events) and one
+for the training reward (updated every step with shaping signals and
+penalties). In the game code:
+
+.. code-block:: python
+
+   # Only updated when the snake eats an apple — clean for human players.
+   game.state['score'] += 50
+
+   # Updated every step — used only by the trainer.
+   game.state['reward'] += old_dist - new_dist   # +1 toward apple, -1 away
+   game.state['reward'] += 50                    # also reward eating
+   game.state['reward'] -= 10                    # death penalty
+
+Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of
+``pyproject.toml`` so the trainer watches the right key. The score display
+remains clean for human players, and you can adjust the training reward
+freely without affecting it.
+
+Note that changing the ``reward`` key is an incompatible change: existing
+checkpoints trained on the old signal will be rejected when you try to
+resume. Run ``retro-gamer clean`` and start fresh after making this change.