# Snake Training: Conceptual Questions Answer each question in the space provided. Use evidence from the training log and your observations of the agent at different checkpoints to support your answers. --- ## 1. Feature selection In the first training attempt, the agent received the full 32×16 game board as its input (6 × 32 × 16 = 3,072 numbers). The agent could see every character on the board, yet it never learned to reliably find the apple after 45,000 episodes. When we added `apple_dx` and `apple_dy` — two numbers that encode the direction from the snake's head to the apple — performance improved dramatically within hundreds of episodes. **Why didn't the board encoding help the agent find the apple? What did the two new features provide that the board encoding could not?** *Your answer:* --- ## 2. Dimensionality reduction In the full-board experiment, the agent processed 3,072 input values. When we switched to the egocentric view (a 17×17 window centered on the snake's head), the board input shrank to 17 × 17 × 6 = 1,734 values. **How many input values did the egocentric view save compared to the full board? What is one thing the agent gained from this change, and one thing it lost?** *Your answer:* --- ## 3. Exploration vs. exploitation With `epsilon_decay = 0.995`, epsilon falls from 1.0 to 0.05 by episode ~450. With `epsilon_decay = 0.9997` (used in the final run), epsilon is still 0.55 at episode 2,000. **Sketch a rough curve of epsilon over time for each setting. Why does slower decay produce a better-trained agent, even though it means the agent takes more random actions overall?** *Your answer:* --- ## 4. Runaway loss In one intermediate experiment, the loss grew from around 35 to hundreds of thousands within a few hundred episodes: ``` [ep_0300] avg_loss=48.7 avg_reward=+8.1 [ep_0500] avg_loss=347 avg_reward=+12.4 [ep_0700] avg_loss=4,102 avg_reward=+6.5 [ep_1100] avg_loss=686,000 avg_reward=-3.1 ``` This happened because the learning algorithm was using MSE (mean squared error) loss, which is *quadratic* — an error of size 2 produces a loss of 4, an error of size 10 produces a loss of 100. **Describe the feedback loop that caused the loss to spiral upward. Why does Huber loss (which is linear for large errors) break this cycle?** *Your answer:* --- ## 5. Interpreting the training curve Look at the snake training log. The reward climbs, then dips, then climbs again: ``` [ep_1100] avg_reward=+34.5 avg_steps=57 [ep_1800] avg_reward=+4.4 avg_steps=98 [ep_3800] avg_reward=+51.2 avg_steps=33 [ep_9000] avg_reward=+246.0 avg_steps=85 ``` Notice that around episode 3,800, avg_steps dropped sharply (from ~98 to 33) at the same time reward jumped. Then by episode 9,000, steps rose again while reward kept climbing. **What do you think the agent was doing at each of these stages? Use the avg_steps and avg_reward numbers to support your interpretation.** *Your answer:* --- ## 6. Policy observation Run these commands to watch the agent at three checkpoints: ``` retro-gamer play runs/snake --checkpoint ep_1100 retro-gamer play runs/snake --checkpoint ep_5400 retro-gamer play runs/snake --checkpoint ep_17100 ``` **Describe the agent's behavior at each checkpoint. What has the agent learned by episode 5,400 that it hadn't yet learned at episode 1,100? What does the episode 17,100 agent do that the earlier agents do not?** *ep_1100:* *ep_5400:* *ep_17100:* --- ## 7. CNN vs. MLP In the first attempt (full board, no explicit features), we used a CNN (`spatial = true`). In the final run (egocentric board + explicit features), we used an MLP (`spatial = false`). **Why might an MLP be a reasonable choice when using the egocentric view, even though the input is still a 2D board? What does the CNN offer that the MLP does not, and why is that less important with an egocentric observation?** *Your answer:* --- ## 8. Hyperparameter comparison Suppose you ran two otherwise identical training experiments: - Run A: `learning_rate = 0.001` - Run B: `learning_rate = 0.0001` **Based on what you learned from the runaway loss in Question 4, predict what would happen in each run. What does this tell you about the trade-off when choosing a learning rate?** *Your answer:*