Files
lab_classification_features/questions.md
2026-06-07 08:55:58 -04:00

3.0 KiB

Spam Classifier: Questions


Checkpoint 1: Exploring the Dataset

1. How many messages are in the dataset? How many are ham, and how many are spam?

Your answer:

2. Look at at least ten spam messages (df[df.label == "spam"]). List three patterns you notice.

3. Look at at least ten ham messages (df[df.label == "ham"]). How do they differ from spam?

Your answer:


Checkpoint 2: Manual Classifier

4. What rules did you write? List each rule and the pattern it targets.

Your answer:

5. Record your best results:

Metric Value
Spam precision
Spam recall
Spam F1

6. Does your classifier make more false positives (ham flagged as spam) or false negatives (spam missed)?

Your answer:

7. Describe one rule you tried that did not help and explain why.

Your answer:


Checkpoint 3: Designing Features by Hand

8. List all the features you implemented and the reasoning behind each:

Feature name What it measures Reasoning

9. Record your best results:

Metric Value
Spam precision
Spam recall
Spam F1

10. Which features received the largest positive weights (most predictive of spam)? The largest negative weights (predictive of ham)? Does this match your expectations?

Your answer:

11. Did any feature you thought would help receive a near-zero weight? Why might the model have decided it was unimportant?

Your answer:


Checkpoint 4: Bag of Words

12. Which transformers did you include in your cleaning pipeline, and in what order? Explain how each one changes the vocabulary.

Your answer:

13. Record your best results:

Metric Value
Spam precision
Spam recall
Spam F1

14. How did the bag-of-words classifier's performance compare to your best hand-designed-features classifier? What do you think accounts for the difference?

Your answer:

15. Look at the words with the strongest weights (in either direction). Do any surprise you? What do they suggest about how the model is making its decisions?

Your answer:


Final Questions

16. Pick a different classification problem (for example: positive vs. negative movie reviews, news articles vs. opinion pieces, or medical vs. general-audience text). Propose five features you would extract to classify it, and explain your reasoning.

Problem I chose:

Feature name What it measures Why it might help

17. Could adding more features ever hurt the performance of a classifier? Explain when and why this might happen.

Your answer:

18. In this lab you split the data into 70% training and 30% testing. What would happen if you used 99% for training and 1% for testing? What about 1% for training and 99% for testing?

Your answer: