Updates across the board
This commit is contained in:
287
docs/troubleshooting.rst
Normal file
287
docs/troubleshooting.rst
Normal file
@@ -0,0 +1,287 @@
|
||||
Troubleshooting
|
||||
===============
|
||||
|
||||
This section describes problems that commonly arise when training an agent
|
||||
with ``retro-gamer``. Each entry names the issue, describes what you will
|
||||
see in the training log or when watching the agent play, explains what is
|
||||
happening in terms of the underlying reinforcement learning, and suggests
|
||||
how to fix it.
|
||||
|
||||
.. contents:: Issues
|
||||
:local:
|
||||
:depth: 1
|
||||
|
||||
|
||||
Loss grows rapidly over training
|
||||
---------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
The ``avg_loss`` column in the training log grows steadily from one
|
||||
checkpoint to the next, often at an accelerating rate::
|
||||
|
||||
[ep_0100] avg_loss=22.2
|
||||
[ep_0200] avg_loss=128.5
|
||||
[ep_0300] avg_loss=2918.5
|
||||
[ep_0400] avg_loss=163825.1
|
||||
|
||||
Left unchecked, the loss eventually reaches extreme values and the agent's
|
||||
behavior becomes erratic or degenerates entirely.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
This is called *Q-value divergence*. The Q-network is trained to predict
|
||||
the total future reward of each action. To do that, it computes a *target*
|
||||
for each prediction — but the target itself is computed using the
|
||||
Q-network's own current predictions. This creates a feedback loop: if
|
||||
the predictions are slightly off, the targets drift, which makes the next
|
||||
predictions slightly more off, which drifts the targets further.
|
||||
|
||||
Under normal conditions, the learning rate is small enough and the target
|
||||
network stable enough that this loop stays controlled. Divergence happens
|
||||
when the learning rate is too high, causing each update to overshoot.
|
||||
The problem is amplified by larger networks (more parameters to overshoot)
|
||||
and by prioritized experience replay, which deliberately samples the
|
||||
experiences the network is most wrong about — exactly the experiences most
|
||||
likely to destabilize it.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction
|
||||
(for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize
|
||||
training. If you recently increased the size of the network (via
|
||||
``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning
|
||||
rate than you used before is likely necessary — larger, more capable
|
||||
networks need smaller, more careful updates.
|
||||
|
||||
Also consider increasing ``target_update_freq``. The target network is a
|
||||
frozen copy of the Q-network used to compute stable training targets; the
|
||||
less frequently it is updated, the more stable those targets are. The
|
||||
default is 200 steps; raising it to 500 or 1000 slows learning slightly
|
||||
but reduces the chance of divergence.
|
||||
|
||||
Because divergence compounds over many episodes, a run that has begun
|
||||
diverging cannot simply be resumed with a lower learning rate — the
|
||||
weights have already drifted far from useful values. Use
|
||||
``retro-gamer clean`` to remove the existing checkpoints and start fresh.
|
||||
|
||||
|
||||
Agent ignores some actions entirely
|
||||
-------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
After training, the agent never (or almost never) turns in certain
|
||||
directions, regardless of the board state. If you compare checkpoints at
|
||||
different stages of training, the missing directions are absent from the
|
||||
very beginning and never appear. The agent may survive for a while but
|
||||
always move in only a subset of the possible directions.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
If some actions lead to immediate death every time they are tried early in
|
||||
training, the Q-network quickly learns to assign them very low values.
|
||||
This is correct in the specific situation where those actions are always
|
||||
fatal — but the network then generalizes that association across *all*
|
||||
board positions, even positions where those actions would be safe.
|
||||
|
||||
A common cause is a fixed starting position at the edge or corner of the
|
||||
board. A snake that always starts in the top-left corner and always begins
|
||||
moving downward will die immediately whenever it turns up or left in the
|
||||
first step. After thousands of early episodes where those actions produce
|
||||
instant death, the network has seen so much evidence that "turn left →
|
||||
die" and "turn up → die" that it assigns them low Q-values everywhere.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Make sure the game's starting conditions give the agent a chance to try
|
||||
every action safely. For a snake game, this means randomizing both the
|
||||
starting position (keeping at least one cell away from every edge) and
|
||||
the starting direction at the beginning of each episode. An agent that
|
||||
starts in different places and orientations each time will quickly learn
|
||||
that all four directions can be appropriate depending on context.
|
||||
|
||||
|
||||
Agent survives but never moves toward the goal
|
||||
-----------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
The ``avg_steps`` column in the training log increases steadily — the
|
||||
agent is surviving longer — but ``avg_reward`` stays negative or barely
|
||||
improves. When you watch the agent play, it wanders around the board
|
||||
without ever approaching the target object. Episodes end because the
|
||||
agent runs into a wall, not because it reached the goal.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
The reward signal is *asymmetric*: it penalizes moving away from the goal
|
||||
but gives no reward for moving toward it. With this signal, the agent
|
||||
learns to avoid the penalty by surviving, but it has no positive gradient
|
||||
pointing it in the right direction. The eventual goal-reaching reward
|
||||
(eating the apple, reaching the exit, etc.) is too rare — especially
|
||||
early in training when the agent is mostly acting randomly — to provide
|
||||
meaningful learning signal on its own.
|
||||
|
||||
From the Q-network's perspective, all directions look roughly equivalent:
|
||||
moving toward the goal is 0 reward, moving away is −1. On a large board,
|
||||
the probability of eating the apple by chance is small enough that the
|
||||
network may never see the positive terminal reward at all during the
|
||||
exploration phase.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Make the distance-based reward symmetric: give **+1 for moving toward the
|
||||
goal** and **−1 for moving away**. This way, every single step provides a
|
||||
meaningful signal in the correct direction, and the agent does not need to
|
||||
reach the goal by chance in order to start learning. In a snake game,
|
||||
computing this signal requires only one line of arithmetic — the change
|
||||
in Manhattan distance between the head and the apple from one step to the
|
||||
next.
|
||||
|
||||
Note that the shaped ±1 signal is a *proxy* for the real objective. If the
|
||||
agent learns to follow it too literally, it may take direct paths that run
|
||||
through its own body. The −10 death penalty and +50 apple reward are still
|
||||
necessary; the shaping only accelerates early learning.
|
||||
|
||||
|
||||
Exploration ends before learning is complete
|
||||
---------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
The ``epsilon`` column in the training log reaches ``epsilon_min`` well
|
||||
before training is finished. After that point, ``avg_reward`` stops
|
||||
improving even though many episodes remain. When you watch the agent play,
|
||||
it commits to the same strategy regardless of what is happening on the
|
||||
board.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
Epsilon controls the balance between exploration (random actions) and
|
||||
exploitation (using the learned policy). Early in training, when the
|
||||
Q-network has seen little data, exploration is essential: the agent needs
|
||||
to try different things to accumulate the varied experiences that make
|
||||
Q-value estimates reliable. Once epsilon reaches its minimum, the agent
|
||||
stops exploring and commits fully to whatever policy it has learned so far.
|
||||
|
||||
If ``training_episodes`` is too small relative to ``epsilon_decay``, the
|
||||
exploration phase ends while the Q-network is still unreliable. The agent
|
||||
then exploits a half-learned policy that cannot improve because it never
|
||||
tries anything new.
|
||||
|
||||
You can calculate when epsilon will reach its minimum:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import math
|
||||
episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay)
|
||||
|
||||
With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``,
|
||||
``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The
|
||||
agent should have substantial training time *after* the exploration phase
|
||||
ends — so ``training_episodes`` should be at least several times this
|
||||
number.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Increase ``training_episodes`` so that the agent has many episodes of
|
||||
exploitation after the exploration phase ends. For simple games on small
|
||||
boards, 10,000 episodes is a reasonable starting point; for complex games
|
||||
or large boards, 50,000–100,000 may be needed.
|
||||
|
||||
This is always safe to change. Because ``training_episodes`` does not
|
||||
affect the network architecture or the reward signal, you can increase it
|
||||
in ``config.toml`` and resume training from the latest checkpoint without
|
||||
starting fresh.
|
||||
|
||||
|
||||
Death penalty dominates all other signals
|
||||
-------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
After a period of training, the agent survives for many steps but rarely
|
||||
or never scores. It tends to circle, hug walls, or otherwise avoid the
|
||||
goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains
|
||||
persistently negative. The agent behaves as if staying alive is the only
|
||||
objective.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
When the penalty for dying is much larger than any other reward in the
|
||||
game, the Q-network learns that staying alive is overwhelmingly the most
|
||||
important thing to do. Scoring — which requires taking some risk —
|
||||
becomes unattractive because a single death outweighs many successful
|
||||
goal-reaching events.
|
||||
|
||||
For example, if the death penalty is −1000 and each successful apple is
|
||||
+50, then dying once costs the equivalent of twenty apples. The agent
|
||||
learns that the safest strategy is to avoid risk entirely, even if that
|
||||
means never eating. From the Q-network's perspective, this is rational:
|
||||
it is correctly optimizing the reward signal you gave it.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Keep all reward magnitudes in the same order of magnitude. If per-step
|
||||
shaping gives ±1 and the goal reward is +50, a death penalty of −10 is
|
||||
appropriate: death is clearly bad (ten times worse than a bad step) but
|
||||
not so catastrophic that it crowds out everything else. As a rule of
|
||||
thumb, no single reward should be more than ten to twenty times larger
|
||||
than the typical per-step reward.
|
||||
|
||||
Increasing ``gamma`` (the discount factor) is a better way to make the
|
||||
agent care more about long-term consequences. A higher gamma causes
|
||||
future rewards — including the eventual death penalty — to count more
|
||||
heavily in the agent's current decisions, without distorting the relative
|
||||
scale of the rewards.
|
||||
|
||||
|
||||
Reward signal and human score interfere with each other
|
||||
---------------------------------------------------------
|
||||
|
||||
**Symptoms**
|
||||
|
||||
Human players see scores that go negative, or that include penalties and
|
||||
adjustments that make no sense in the context of a normal game. Conversely,
|
||||
adjustments made to improve training (removing a per-step shaping penalty,
|
||||
changing a death penalty) change the game's visible score in ways that
|
||||
affect the experience for human players.
|
||||
|
||||
**Why this happens**
|
||||
|
||||
Using the same state variable for both the training reward and the
|
||||
human-visible score conflates two separate concerns. Training rewards
|
||||
benefit from shaping — intermediate signals like "moved toward the goal"
|
||||
and "died" that accelerate learning. Scores for human players should
|
||||
reflect only the game's actual objectives (apples eaten, enemies defeated,
|
||||
distance covered) so that they are legible and motivating.
|
||||
|
||||
When these are the same variable, every design decision about one
|
||||
necessarily affects the other.
|
||||
|
||||
**How to fix it**
|
||||
|
||||
Use two separate keys in the game's state dictionary: one for the
|
||||
human-facing score (updated only by meaningful in-game events) and one
|
||||
for the training reward (updated every step with shaping signals and
|
||||
penalties). In the game code:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Only updated when the snake eats an apple — clean for human players.
|
||||
game.state['score'] += 50
|
||||
|
||||
# Updated every step — used only by the trainer.
|
||||
game.state['reward'] += old_dist - new_dist # +1 toward apple, -1 away
|
||||
game.state['reward'] += 50 # also reward eating
|
||||
game.state['reward'] -= 10 # death penalty
|
||||
|
||||
Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of
|
||||
``pyproject.toml`` so the trainer watches the right key. The score display
|
||||
remains clean for human players, and you can adjust the training reward
|
||||
freely without affecting it.
|
||||
|
||||
Note that changing the ``reward`` key is an incompatible change: existing
|
||||
checkpoints trained on the old signal will be rejected when you try to
|
||||
resume. Run ``retro-gamer clean`` and start fresh after making this change.
|
||||
Reference in New Issue
Block a user