288 lines
12 KiB
ReStructuredText
288 lines
12 KiB
ReStructuredText
Troubleshooting
|
||
===============
|
||
|
||
This section describes problems that commonly arise when training an agent
|
||
with ``retro-gamer``. Each entry names the issue, describes what you will
|
||
see in the training log or when watching the agent play, explains what is
|
||
happening in terms of the underlying reinforcement learning, and suggests
|
||
how to fix it.
|
||
|
||
.. contents:: Issues
|
||
:local:
|
||
:depth: 1
|
||
|
||
|
||
Loss grows rapidly over training
|
||
---------------------------------
|
||
|
||
**Symptoms**
|
||
|
||
The ``avg_loss`` column in the training log grows steadily from one
|
||
checkpoint to the next, often at an accelerating rate::
|
||
|
||
[ep_0100] avg_loss=22.2
|
||
[ep_0200] avg_loss=128.5
|
||
[ep_0300] avg_loss=2918.5
|
||
[ep_0400] avg_loss=163825.1
|
||
|
||
Left unchecked, the loss eventually reaches extreme values and the agent's
|
||
behavior becomes erratic or degenerates entirely.
|
||
|
||
**Why this happens**
|
||
|
||
This is called *Q-value divergence*. The Q-network is trained to predict
|
||
the total future reward of each action. To do that, it computes a *target*
|
||
for each prediction — but the target itself is computed using the
|
||
Q-network's own current predictions. This creates a feedback loop: if
|
||
the predictions are slightly off, the targets drift, which makes the next
|
||
predictions slightly more off, which drifts the targets further.
|
||
|
||
Under normal conditions, the learning rate is small enough and the target
|
||
network stable enough that this loop stays controlled. Divergence happens
|
||
when the learning rate is too high, causing each update to overshoot.
|
||
The problem is amplified by larger networks (more parameters to overshoot)
|
||
and by prioritized experience replay, which deliberately samples the
|
||
experiences the network is most wrong about — exactly the experiences most
|
||
likely to destabilize it.
|
||
|
||
**How to fix it**
|
||
|
||
Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction
|
||
(for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize
|
||
training. If you recently increased the size of the network (via
|
||
``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning
|
||
rate than you used before is likely necessary — larger, more capable
|
||
networks need smaller, more careful updates.
|
||
|
||
Also consider increasing ``target_update_freq``. The target network is a
|
||
frozen copy of the Q-network used to compute stable training targets; the
|
||
less frequently it is updated, the more stable those targets are. The
|
||
default is 200 steps; raising it to 500 or 1000 slows learning slightly
|
||
but reduces the chance of divergence.
|
||
|
||
Because divergence compounds over many episodes, a run that has begun
|
||
diverging cannot simply be resumed with a lower learning rate — the
|
||
weights have already drifted far from useful values. Use
|
||
``retro-gamer clean`` to remove the existing checkpoints and start fresh.
|
||
|
||
|
||
Agent ignores some actions entirely
|
||
-------------------------------------
|
||
|
||
**Symptoms**
|
||
|
||
After training, the agent never (or almost never) turns in certain
|
||
directions, regardless of the board state. If you compare checkpoints at
|
||
different stages of training, the missing directions are absent from the
|
||
very beginning and never appear. The agent may survive for a while but
|
||
always move in only a subset of the possible directions.
|
||
|
||
**Why this happens**
|
||
|
||
If some actions lead to immediate death every time they are tried early in
|
||
training, the Q-network quickly learns to assign them very low values.
|
||
This is correct in the specific situation where those actions are always
|
||
fatal — but the network then generalizes that association across *all*
|
||
board positions, even positions where those actions would be safe.
|
||
|
||
A common cause is a fixed starting position at the edge or corner of the
|
||
board. A snake that always starts in the top-left corner and always begins
|
||
moving downward will die immediately whenever it turns up or left in the
|
||
first step. After thousands of early episodes where those actions produce
|
||
instant death, the network has seen so much evidence that "turn left →
|
||
die" and "turn up → die" that it assigns them low Q-values everywhere.
|
||
|
||
**How to fix it**
|
||
|
||
Make sure the game's starting conditions give the agent a chance to try
|
||
every action safely. For a snake game, this means randomizing both the
|
||
starting position (keeping at least one cell away from every edge) and
|
||
the starting direction at the beginning of each episode. An agent that
|
||
starts in different places and orientations each time will quickly learn
|
||
that all four directions can be appropriate depending on context.
|
||
|
||
|
||
Agent survives but never moves toward the goal
|
||
-----------------------------------------------
|
||
|
||
**Symptoms**
|
||
|
||
The ``avg_steps`` column in the training log increases steadily — the
|
||
agent is surviving longer — but ``avg_reward`` stays negative or barely
|
||
improves. When you watch the agent play, it wanders around the board
|
||
without ever approaching the target object. Episodes end because the
|
||
agent runs into a wall, not because it reached the goal.
|
||
|
||
**Why this happens**
|
||
|
||
The reward signal is *asymmetric*: it penalizes moving away from the goal
|
||
but gives no reward for moving toward it. With this signal, the agent
|
||
learns to avoid the penalty by surviving, but it has no positive gradient
|
||
pointing it in the right direction. The eventual goal-reaching reward
|
||
(eating the apple, reaching the exit, etc.) is too rare — especially
|
||
early in training when the agent is mostly acting randomly — to provide
|
||
meaningful learning signal on its own.
|
||
|
||
From the Q-network's perspective, all directions look roughly equivalent:
|
||
moving toward the goal is 0 reward, moving away is −1. On a large board,
|
||
the probability of eating the apple by chance is small enough that the
|
||
network may never see the positive terminal reward at all during the
|
||
exploration phase.
|
||
|
||
**How to fix it**
|
||
|
||
Make the distance-based reward symmetric: give **+1 for moving toward the
|
||
goal** and **−1 for moving away**. This way, every single step provides a
|
||
meaningful signal in the correct direction, and the agent does not need to
|
||
reach the goal by chance in order to start learning. In a snake game,
|
||
computing this signal requires only one line of arithmetic — the change
|
||
in Manhattan distance between the head and the apple from one step to the
|
||
next.
|
||
|
||
Note that the shaped ±1 signal is a *proxy* for the real objective. If the
|
||
agent learns to follow it too literally, it may take direct paths that run
|
||
through its own body. The −10 death penalty and +50 apple reward are still
|
||
necessary; the shaping only accelerates early learning.
|
||
|
||
|
||
Exploration ends before learning is complete
|
||
---------------------------------------------
|
||
|
||
**Symptoms**
|
||
|
||
The ``epsilon`` column in the training log reaches ``epsilon_min`` well
|
||
before training is finished. After that point, ``avg_reward`` stops
|
||
improving even though many episodes remain. When you watch the agent play,
|
||
it commits to the same strategy regardless of what is happening on the
|
||
board.
|
||
|
||
**Why this happens**
|
||
|
||
Epsilon controls the balance between exploration (random actions) and
|
||
exploitation (using the learned policy). Early in training, when the
|
||
Q-network has seen little data, exploration is essential: the agent needs
|
||
to try different things to accumulate the varied experiences that make
|
||
Q-value estimates reliable. Once epsilon reaches its minimum, the agent
|
||
stops exploring and commits fully to whatever policy it has learned so far.
|
||
|
||
If ``training_episodes`` is too small relative to ``epsilon_decay``, the
|
||
exploration phase ends while the Q-network is still unreliable. The agent
|
||
then exploits a half-learned policy that cannot improve because it never
|
||
tries anything new.
|
||
|
||
You can calculate when epsilon will reach its minimum:
|
||
|
||
.. code-block:: python
|
||
|
||
import math
|
||
episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay)
|
||
|
||
With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``,
|
||
``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The
|
||
agent should have substantial training time *after* the exploration phase
|
||
ends — so ``training_episodes`` should be at least several times this
|
||
number.
|
||
|
||
**How to fix it**
|
||
|
||
Increase ``training_episodes`` so that the agent has many episodes of
|
||
exploitation after the exploration phase ends. For simple games on small
|
||
boards, 10,000 episodes is a reasonable starting point; for complex games
|
||
or large boards, 50,000–100,000 may be needed.
|
||
|
||
This is always safe to change. Because ``training_episodes`` does not
|
||
affect the network architecture or the reward signal, you can increase it
|
||
in ``config.toml`` and resume training from the latest checkpoint without
|
||
starting fresh.
|
||
|
||
|
||
Death penalty dominates all other signals
|
||
-------------------------------------------
|
||
|
||
**Symptoms**
|
||
|
||
After a period of training, the agent survives for many steps but rarely
|
||
or never scores. It tends to circle, hug walls, or otherwise avoid the
|
||
goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains
|
||
persistently negative. The agent behaves as if staying alive is the only
|
||
objective.
|
||
|
||
**Why this happens**
|
||
|
||
When the penalty for dying is much larger than any other reward in the
|
||
game, the Q-network learns that staying alive is overwhelmingly the most
|
||
important thing to do. Scoring — which requires taking some risk —
|
||
becomes unattractive because a single death outweighs many successful
|
||
goal-reaching events.
|
||
|
||
For example, if the death penalty is −1000 and each successful apple is
|
||
+50, then dying once costs the equivalent of twenty apples. The agent
|
||
learns that the safest strategy is to avoid risk entirely, even if that
|
||
means never eating. From the Q-network's perspective, this is rational:
|
||
it is correctly optimizing the reward signal you gave it.
|
||
|
||
**How to fix it**
|
||
|
||
Keep all reward magnitudes in the same order of magnitude. If per-step
|
||
shaping gives ±1 and the goal reward is +50, a death penalty of −10 is
|
||
appropriate: death is clearly bad (ten times worse than a bad step) but
|
||
not so catastrophic that it crowds out everything else. As a rule of
|
||
thumb, no single reward should be more than ten to twenty times larger
|
||
than the typical per-step reward.
|
||
|
||
Increasing ``gamma`` (the discount factor) is a better way to make the
|
||
agent care more about long-term consequences. A higher gamma causes
|
||
future rewards — including the eventual death penalty — to count more
|
||
heavily in the agent's current decisions, without distorting the relative
|
||
scale of the rewards.
|
||
|
||
|
||
Reward signal and human score interfere with each other
|
||
---------------------------------------------------------
|
||
|
||
**Symptoms**
|
||
|
||
Human players see scores that go negative, or that include penalties and
|
||
adjustments that make no sense in the context of a normal game. Conversely,
|
||
adjustments made to improve training (removing a per-step shaping penalty,
|
||
changing a death penalty) change the game's visible score in ways that
|
||
affect the experience for human players.
|
||
|
||
**Why this happens**
|
||
|
||
Using the same state variable for both the training reward and the
|
||
human-visible score conflates two separate concerns. Training rewards
|
||
benefit from shaping — intermediate signals like "moved toward the goal"
|
||
and "died" that accelerate learning. Scores for human players should
|
||
reflect only the game's actual objectives (apples eaten, enemies defeated,
|
||
distance covered) so that they are legible and motivating.
|
||
|
||
When these are the same variable, every design decision about one
|
||
necessarily affects the other.
|
||
|
||
**How to fix it**
|
||
|
||
Use two separate keys in the game's state dictionary: one for the
|
||
human-facing score (updated only by meaningful in-game events) and one
|
||
for the training reward (updated every step with shaping signals and
|
||
penalties). In the game code:
|
||
|
||
.. code-block:: python
|
||
|
||
# Only updated when the snake eats an apple — clean for human players.
|
||
game.state['score'] += 50
|
||
|
||
# Updated every step — used only by the trainer.
|
||
game.state['reward'] += old_dist - new_dist # +1 toward apple, -1 away
|
||
game.state['reward'] += 50 # also reward eating
|
||
game.state['reward'] -= 10 # death penalty
|
||
|
||
Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of
|
||
``pyproject.toml`` so the trainer watches the right key. The score display
|
||
remains clean for human players, and you can adjust the training reward
|
||
freely without affecting it.
|
||
|
||
Note that changing the ``reward`` key is an incompatible change: existing
|
||
checkpoints trained on the old signal will be rejected when you try to
|
||
resume. Run ``retro-gamer clean`` and start fresh after making this change.
|