Troubleshooting =============== This section describes problems that commonly arise when training an agent with ``retro-gamer``. Each entry names the issue, describes what you will see in the training log or when watching the agent play, explains what is happening in terms of the underlying reinforcement learning, and suggests how to fix it. .. contents:: Issues :local: :depth: 1 Loss grows rapidly over training --------------------------------- **Symptoms** The ``avg_loss`` column in the training log grows steadily from one checkpoint to the next, often at an accelerating rate:: [ep_0100] avg_loss=22.2 [ep_0200] avg_loss=128.5 [ep_0300] avg_loss=2918.5 [ep_0400] avg_loss=163825.1 Left unchecked, the loss eventually reaches extreme values and the agent's behavior becomes erratic or degenerates entirely. **Why this happens** This is called *Q-value divergence*. The Q-network is trained to predict the total future reward of each action. To do that, it computes a *target* for each prediction — but the target itself is computed using the Q-network's own current predictions. This creates a feedback loop: if the predictions are slightly off, the targets drift, which makes the next predictions slightly more off, which drifts the targets further. Under normal conditions, the learning rate is small enough and the target network stable enough that this loop stays controlled. Divergence happens when the learning rate is too high, causing each update to overshoot. The problem is amplified by larger networks (more parameters to overshoot) and by prioritized experience replay, which deliberately samples the experiences the network is most wrong about — exactly the experiences most likely to destabilize it. **How to fix it** Reduce ``learning_rate`` in ``config.toml``. A factor-of-ten reduction (for example, from ``0.001`` to ``0.0001``) is usually enough to stabilize training. If you recently increased the size of the network (via ``hidden_sizes``) or enabled ``prioritize_experiences``, a lower learning rate than you used before is likely necessary — larger, more capable networks need smaller, more careful updates. Also consider increasing ``target_update_freq``. The target network is a frozen copy of the Q-network used to compute stable training targets; the less frequently it is updated, the more stable those targets are. The default is 200 steps; raising it to 500 or 1000 slows learning slightly but reduces the chance of divergence. Because divergence compounds over many episodes, a run that has begun diverging cannot simply be resumed with a lower learning rate — the weights have already drifted far from useful values. Use ``retro-gamer clean`` to remove the existing checkpoints and start fresh. Agent ignores some actions entirely ------------------------------------- **Symptoms** After training, the agent never (or almost never) turns in certain directions, regardless of the board state. If you compare checkpoints at different stages of training, the missing directions are absent from the very beginning and never appear. The agent may survive for a while but always move in only a subset of the possible directions. **Why this happens** If some actions lead to immediate death every time they are tried early in training, the Q-network quickly learns to assign them very low values. This is correct in the specific situation where those actions are always fatal — but the network then generalizes that association across *all* board positions, even positions where those actions would be safe. A common cause is a fixed starting position at the edge or corner of the board. A snake that always starts in the top-left corner and always begins moving downward will die immediately whenever it turns up or left in the first step. After thousands of early episodes where those actions produce instant death, the network has seen so much evidence that "turn left → die" and "turn up → die" that it assigns them low Q-values everywhere. **How to fix it** Make sure the game's starting conditions give the agent a chance to try every action safely. For a snake game, this means randomizing both the starting position (keeping at least one cell away from every edge) and the starting direction at the beginning of each episode. An agent that starts in different places and orientations each time will quickly learn that all four directions can be appropriate depending on context. Agent survives but never moves toward the goal ----------------------------------------------- **Symptoms** The ``avg_steps`` column in the training log increases steadily — the agent is surviving longer — but ``avg_reward`` stays negative or barely improves. When you watch the agent play, it wanders around the board without ever approaching the target object. Episodes end because the agent runs into a wall, not because it reached the goal. **Why this happens** The reward signal is *asymmetric*: it penalizes moving away from the goal but gives no reward for moving toward it. With this signal, the agent learns to avoid the penalty by surviving, but it has no positive gradient pointing it in the right direction. The eventual goal-reaching reward (eating the apple, reaching the exit, etc.) is too rare — especially early in training when the agent is mostly acting randomly — to provide meaningful learning signal on its own. From the Q-network's perspective, all directions look roughly equivalent: moving toward the goal is 0 reward, moving away is −1. On a large board, the probability of eating the apple by chance is small enough that the network may never see the positive terminal reward at all during the exploration phase. **How to fix it** Make the distance-based reward symmetric: give **+1 for moving toward the goal** and **−1 for moving away**. This way, every single step provides a meaningful signal in the correct direction, and the agent does not need to reach the goal by chance in order to start learning. In a snake game, computing this signal requires only one line of arithmetic — the change in Manhattan distance between the head and the apple from one step to the next. Note that the shaped ±1 signal is a *proxy* for the real objective. If the agent learns to follow it too literally, it may take direct paths that run through its own body. The −10 death penalty and +50 apple reward are still necessary; the shaping only accelerates early learning. Exploration ends before learning is complete --------------------------------------------- **Symptoms** The ``epsilon`` column in the training log reaches ``epsilon_min`` well before training is finished. After that point, ``avg_reward`` stops improving even though many episodes remain. When you watch the agent play, it commits to the same strategy regardless of what is happening on the board. **Why this happens** Epsilon controls the balance between exploration (random actions) and exploitation (using the learned policy). Early in training, when the Q-network has seen little data, exploration is essential: the agent needs to try different things to accumulate the varied experiences that make Q-value estimates reliable. Once epsilon reaches its minimum, the agent stops exploring and commits fully to whatever policy it has learned so far. If ``training_episodes`` is too small relative to ``epsilon_decay``, the exploration phase ends while the Q-network is still unreliable. The agent then exploits a half-learned policy that cannot improve because it never tries anything new. You can calculate when epsilon will reach its minimum: .. code-block:: python import math episodes = math.log(epsilon_min / epsilon) / math.log(epsilon_decay) With the defaults (``epsilon = 1.0``, ``epsilon_min = 0.05``, ``epsilon_decay = 0.999``), this comes to roughly 3,000 episodes. The agent should have substantial training time *after* the exploration phase ends — so ``training_episodes`` should be at least several times this number. **How to fix it** Increase ``training_episodes`` so that the agent has many episodes of exploitation after the exploration phase ends. For simple games on small boards, 10,000 episodes is a reasonable starting point; for complex games or large boards, 50,000–100,000 may be needed. This is always safe to change. Because ``training_episodes`` does not affect the network architecture or the reward signal, you can increase it in ``config.toml`` and resume training from the latest checkpoint without starting fresh. Death penalty dominates all other signals ------------------------------------------- **Symptoms** After a period of training, the agent survives for many steps but rarely or never scores. It tends to circle, hug walls, or otherwise avoid the goal object entirely. ``avg_steps`` is high but ``avg_reward`` remains persistently negative. The agent behaves as if staying alive is the only objective. **Why this happens** When the penalty for dying is much larger than any other reward in the game, the Q-network learns that staying alive is overwhelmingly the most important thing to do. Scoring — which requires taking some risk — becomes unattractive because a single death outweighs many successful goal-reaching events. For example, if the death penalty is −1000 and each successful apple is +50, then dying once costs the equivalent of twenty apples. The agent learns that the safest strategy is to avoid risk entirely, even if that means never eating. From the Q-network's perspective, this is rational: it is correctly optimizing the reward signal you gave it. **How to fix it** Keep all reward magnitudes in the same order of magnitude. If per-step shaping gives ±1 and the goal reward is +50, a death penalty of −10 is appropriate: death is clearly bad (ten times worse than a bad step) but not so catastrophic that it crowds out everything else. As a rule of thumb, no single reward should be more than ten to twenty times larger than the typical per-step reward. Increasing ``gamma`` (the discount factor) is a better way to make the agent care more about long-term consequences. A higher gamma causes future rewards — including the eventual death penalty — to count more heavily in the agent's current decisions, without distorting the relative scale of the rewards. Reward signal and human score interfere with each other --------------------------------------------------------- **Symptoms** Human players see scores that go negative, or that include penalties and adjustments that make no sense in the context of a normal game. Conversely, adjustments made to improve training (removing a per-step shaping penalty, changing a death penalty) change the game's visible score in ways that affect the experience for human players. **Why this happens** Using the same state variable for both the training reward and the human-visible score conflates two separate concerns. Training rewards benefit from shaping — intermediate signals like "moved toward the goal" and "died" that accelerate learning. Scores for human players should reflect only the game's actual objectives (apples eaten, enemies defeated, distance covered) so that they are legible and motivating. When these are the same variable, every design decision about one necessarily affects the other. **How to fix it** Use two separate keys in the game's state dictionary: one for the human-facing score (updated only by meaningful in-game events) and one for the training reward (updated every step with shaping signals and penalties). In the game code: .. code-block:: python # Only updated when the snake eats an apple — clean for human players. game.state['score'] += 50 # Updated every step — used only by the trainer. game.state['reward'] += old_dist - new_dist # +1 toward apple, -1 away game.state['reward'] += 50 # also reward eating game.state['reward'] -= 10 # death penalty Then set ``reward = "reward"`` in the ``[tool.retro-gamer]`` section of ``pyproject.toml`` so the trainer watches the right key. The score display remains clean for human players, and you can adjust the training reward freely without affecting it. Note that changing the ``reward`` key is an incompatible change: existing checkpoints trained on the old signal will be rejected when you try to resume. Run ``retro-gamer clean`` and start fresh after making this change.