Files
lab_reinforcement_learning/snake_training.md
Chris Proctor 42bc2e7a50 Initial commit
2026-06-22 16:14:58 -04:00

4.2 KiB
Raw Blame History

Snake Training: Conceptual Questions

Answer each question in the space provided. Use evidence from the training log and your observations of the agent at different checkpoints to support your answers.


1. Feature selection

In the first training attempt, the agent received the full 32×16 game board as its input (6 × 32 × 16 = 3,072 numbers). The agent could see every character on the board, yet it never learned to reliably find the apple after 45,000 episodes.

When we added apple_dx and apple_dy — two numbers that encode the direction from the snake's head to the apple — performance improved dramatically within hundreds of episodes.

Why didn't the board encoding help the agent find the apple? What did the two new features provide that the board encoding could not?

Your answer:


2. Dimensionality reduction

In the full-board experiment, the agent processed 3,072 input values. When we switched to the egocentric view (a 17×17 window centered on the snake's head), the board input shrank to 17 × 17 × 6 = 1,734 values.

How many input values did the egocentric view save compared to the full board? What is one thing the agent gained from this change, and one thing it lost?

Your answer:


3. Exploration vs. exploitation

With epsilon_decay = 0.995, epsilon falls from 1.0 to 0.05 by episode ~450. With epsilon_decay = 0.9997 (used in the final run), epsilon is still 0.55 at episode 2,000.

Sketch a rough curve of epsilon over time for each setting. Why does slower decay produce a better-trained agent, even though it means the agent takes more random actions overall?

Your answer:


4. Runaway loss

In one intermediate experiment, the loss grew from around 35 to hundreds of thousands within a few hundred episodes:

[ep_0300]  avg_loss=48.7   avg_reward=+8.1
[ep_0500]  avg_loss=347    avg_reward=+12.4
[ep_0700]  avg_loss=4,102  avg_reward=+6.5
[ep_1100]  avg_loss=686,000  avg_reward=-3.1

This happened because the learning algorithm was using MSE (mean squared error) loss, which is quadratic — an error of size 2 produces a loss of 4, an error of size 10 produces a loss of 100.

Describe the feedback loop that caused the loss to spiral upward. Why does Huber loss (which is linear for large errors) break this cycle?

Your answer:


5. Interpreting the training curve

Look at the snake training log. The reward climbs, then dips, then climbs again:

[ep_1100]  avg_reward=+34.5  avg_steps=57
[ep_1800]  avg_reward=+4.4   avg_steps=98
[ep_3800]  avg_reward=+51.2  avg_steps=33
[ep_9000]  avg_reward=+246.0 avg_steps=85

Notice that around episode 3,800, avg_steps dropped sharply (from ~98 to 33) at the same time reward jumped. Then by episode 9,000, steps rose again while reward kept climbing.

What do you think the agent was doing at each of these stages? Use the avg_steps and avg_reward numbers to support your interpretation.

Your answer:


6. Policy observation

Run these commands to watch the agent at three checkpoints:

retro-gamer play runs/snake --checkpoint ep_1100
retro-gamer play runs/snake --checkpoint ep_5400
retro-gamer play runs/snake --checkpoint ep_17100

Describe the agent's behavior at each checkpoint. What has the agent learned by episode 5,400 that it hadn't yet learned at episode 1,100? What does the episode 17,100 agent do that the earlier agents do not?

ep_1100:

ep_5400:

ep_17100:


7. CNN vs. MLP

In the first attempt (full board, no explicit features), we used a CNN (spatial = true). In the final run (egocentric board + explicit features), we used an MLP (spatial = false).

Why might an MLP be a reasonable choice when using the egocentric view, even though the input is still a 2D board? What does the CNN offer that the MLP does not, and why is that less important with an egocentric observation?

Your answer:


8. Hyperparameter comparison

Suppose you ran two otherwise identical training experiments:

  • Run A: learning_rate = 0.001
  • Run B: learning_rate = 0.0001

Based on what you learned from the runaway loss in Question 4, predict what would happen in each run. What does this tell you about the trade-off when choosing a learning rate?

Your answer: