lab_reinforcement_learning/snake_training.md

# Snake Training: Conceptual Questions

Answer each question in the space provided. Use evidence from the training log
and your observations of the agent at different checkpoints to support your answers.

---

## 1. Feature selection

In the first training attempt, the agent received the full 32×16 game board as
its input (6 × 32 × 16 = 3,072 numbers). The agent could see every character on
the board, yet it never learned to reliably find the apple after 45,000 episodes.

When we added `apple_dx` and `apple_dy` — two numbers that encode the direction
from the snake's head to the apple — performance improved dramatically within
hundreds of episodes.

**Why didn't the board encoding help the agent find the apple? What did the two
new features provide that the board encoding could not?**

*Your answer:*

---

## 2. Dimensionality reduction

In the full-board experiment, the agent processed 3,072 input values. When we
switched to the egocentric view (a 17×17 window centered on the snake's head),
the board input shrank to 17 × 17 × 6 = 1,734 values.

**How many input values did the egocentric view save compared to the full board?
What is one thing the agent gained from this change, and one thing it lost?**

*Your answer:*

---

## 3. Exploration vs. exploitation

With `epsilon_decay = 0.995`, epsilon falls from 1.0 to 0.05 by episode ~450.
With `epsilon_decay = 0.9997` (used in the final run), epsilon is still 0.55 at
episode 2,000.

**Sketch a rough curve of epsilon over time for each setting. Why does slower
decay produce a better-trained agent, even though it means the agent takes more
random actions overall?**

*Your answer:*

---

## 4. Runaway loss

In one intermediate experiment, the loss grew from around 35 to hundreds of
thousands within a few hundred episodes:

```
[ep_0300]  avg_loss=48.7   avg_reward=+8.1
[ep_0500]  avg_loss=347    avg_reward=+12.4
[ep_0700]  avg_loss=4,102  avg_reward=+6.5
[ep_1100]  avg_loss=686,000  avg_reward=-3.1
```

This happened because the learning algorithm was using MSE (mean squared error)
loss, which is *quadratic* — an error of size 2 produces a loss of 4, an error
of size 10 produces a loss of 100.

**Describe the feedback loop that caused the loss to spiral upward. Why does
Huber loss (which is linear for large errors) break this cycle?**

*Your answer:*

---

## 5. Interpreting the training curve

Look at the snake training log. The reward climbs, then dips, then climbs again:

```
[ep_1100]  avg_reward=+34.5  avg_steps=57
[ep_1800]  avg_reward=+4.4   avg_steps=98
[ep_3800]  avg_reward=+51.2  avg_steps=33
[ep_9000]  avg_reward=+246.0 avg_steps=85
```

Notice that around episode 3,800, avg_steps dropped sharply (from ~98 to 33)
at the same time reward jumped. Then by episode 9,000, steps rose again while
reward kept climbing.

**What do you think the agent was doing at each of these stages? Use the
avg_steps and avg_reward numbers to support your interpretation.**

*Your answer:*

---

## 6. Policy observation

Run these commands to watch the agent at three checkpoints:

```
retro-gamer play runs/snake --checkpoint ep_1100
retro-gamer play runs/snake --checkpoint ep_5400
retro-gamer play runs/snake --checkpoint ep_17100
```

**Describe the agent's behavior at each checkpoint. What has the agent learned
by episode 5,400 that it hadn't yet learned at episode 1,100? What does the
episode 17,100 agent do that the earlier agents do not?**

*ep_1100:*

*ep_5400:*

*ep_17100:*

---

## 7. CNN vs. MLP

In the first attempt (full board, no explicit features), we used a CNN
(`spatial = true`). In the final run (egocentric board + explicit features), we
used an MLP (`spatial = false`).

**Why might an MLP be a reasonable choice when using the egocentric view, even
though the input is still a 2D board? What does the CNN offer that the MLP does
not, and why is that less important with an egocentric observation?**

*Your answer:*

---

## 8. Hyperparameter comparison

Suppose you ran two otherwise identical training experiments:
- Run A: `learning_rate = 0.001`
- Run B: `learning_rate = 0.0001`

**Based on what you learned from the runaway loss in Question 4, predict what
would happen in each run. What does this tell you about the trade-off when
choosing a learning rate?**

*Your answer:*