144 lines
4.2 KiB
Markdown
144 lines
4.2 KiB
Markdown
# Snake Training: Conceptual Questions
|
||
|
||
Answer each question in the space provided. Use evidence from the training log
|
||
and your observations of the agent at different checkpoints to support your answers.
|
||
|
||
---
|
||
|
||
## 1. Feature selection
|
||
|
||
In the first training attempt, the agent received the full 32×16 game board as
|
||
its input (6 × 32 × 16 = 3,072 numbers). The agent could see every character on
|
||
the board, yet it never learned to reliably find the apple after 45,000 episodes.
|
||
|
||
When we added `apple_dx` and `apple_dy` — two numbers that encode the direction
|
||
from the snake's head to the apple — performance improved dramatically within
|
||
hundreds of episodes.
|
||
|
||
**Why didn't the board encoding help the agent find the apple? What did the two
|
||
new features provide that the board encoding could not?**
|
||
|
||
*Your answer:*
|
||
|
||
---
|
||
|
||
## 2. Dimensionality reduction
|
||
|
||
In the full-board experiment, the agent processed 3,072 input values. When we
|
||
switched to the egocentric view (a 17×17 window centered on the snake's head),
|
||
the board input shrank to 17 × 17 × 6 = 1,734 values.
|
||
|
||
**How many input values did the egocentric view save compared to the full board?
|
||
What is one thing the agent gained from this change, and one thing it lost?**
|
||
|
||
*Your answer:*
|
||
|
||
---
|
||
|
||
## 3. Exploration vs. exploitation
|
||
|
||
With `epsilon_decay = 0.995`, epsilon falls from 1.0 to 0.05 by episode ~450.
|
||
With `epsilon_decay = 0.9997` (used in the final run), epsilon is still 0.55 at
|
||
episode 2,000.
|
||
|
||
**Sketch a rough curve of epsilon over time for each setting. Why does slower
|
||
decay produce a better-trained agent, even though it means the agent takes more
|
||
random actions overall?**
|
||
|
||
*Your answer:*
|
||
|
||
---
|
||
|
||
## 4. Runaway loss
|
||
|
||
In one intermediate experiment, the loss grew from around 35 to hundreds of
|
||
thousands within a few hundred episodes:
|
||
|
||
```
|
||
[ep_0300] avg_loss=48.7 avg_reward=+8.1
|
||
[ep_0500] avg_loss=347 avg_reward=+12.4
|
||
[ep_0700] avg_loss=4,102 avg_reward=+6.5
|
||
[ep_1100] avg_loss=686,000 avg_reward=-3.1
|
||
```
|
||
|
||
This happened because the learning algorithm was using MSE (mean squared error)
|
||
loss, which is *quadratic* — an error of size 2 produces a loss of 4, an error
|
||
of size 10 produces a loss of 100.
|
||
|
||
**Describe the feedback loop that caused the loss to spiral upward. Why does
|
||
Huber loss (which is linear for large errors) break this cycle?**
|
||
|
||
*Your answer:*
|
||
|
||
---
|
||
|
||
## 5. Interpreting the training curve
|
||
|
||
Look at the snake training log. The reward climbs, then dips, then climbs again:
|
||
|
||
```
|
||
[ep_1100] avg_reward=+34.5 avg_steps=57
|
||
[ep_1800] avg_reward=+4.4 avg_steps=98
|
||
[ep_3800] avg_reward=+51.2 avg_steps=33
|
||
[ep_9000] avg_reward=+246.0 avg_steps=85
|
||
```
|
||
|
||
Notice that around episode 3,800, avg_steps dropped sharply (from ~98 to 33)
|
||
at the same time reward jumped. Then by episode 9,000, steps rose again while
|
||
reward kept climbing.
|
||
|
||
**What do you think the agent was doing at each of these stages? Use the
|
||
avg_steps and avg_reward numbers to support your interpretation.**
|
||
|
||
*Your answer:*
|
||
|
||
---
|
||
|
||
## 6. Policy observation
|
||
|
||
Run these commands to watch the agent at three checkpoints:
|
||
|
||
```
|
||
retro-gamer play runs/snake --checkpoint ep_1100
|
||
retro-gamer play runs/snake --checkpoint ep_5400
|
||
retro-gamer play runs/snake --checkpoint ep_17100
|
||
```
|
||
|
||
**Describe the agent's behavior at each checkpoint. What has the agent learned
|
||
by episode 5,400 that it hadn't yet learned at episode 1,100? What does the
|
||
episode 17,100 agent do that the earlier agents do not?**
|
||
|
||
*ep_1100:*
|
||
|
||
*ep_5400:*
|
||
|
||
*ep_17100:*
|
||
|
||
---
|
||
|
||
## 7. CNN vs. MLP
|
||
|
||
In the first attempt (full board, no explicit features), we used a CNN
|
||
(`spatial = true`). In the final run (egocentric board + explicit features), we
|
||
used an MLP (`spatial = false`).
|
||
|
||
**Why might an MLP be a reasonable choice when using the egocentric view, even
|
||
though the input is still a 2D board? What does the CNN offer that the MLP does
|
||
not, and why is that less important with an egocentric observation?**
|
||
|
||
*Your answer:*
|
||
|
||
---
|
||
|
||
## 8. Hyperparameter comparison
|
||
|
||
Suppose you ran two otherwise identical training experiments:
|
||
- Run A: `learning_rate = 0.001`
|
||
- Run B: `learning_rate = 0.0001`
|
||
|
||
**Based on what you learned from the runaway loss in Question 4, predict what
|
||
would happen in each run. What does this tell you about the trade-off when
|
||
choosing a learning rate?**
|
||
|
||
*Your answer:*
|