Files
lab_estimation/lab_estimation.ipynb
Chris Proctor b81e182942 Initial commit
2026-06-22 16:11:05 -04:00

404 lines
14 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "cell-001",
"metadata": {},
"source": [
"# Estimating Income from Survey Data\n",
"\n",
"This notebook is the second half of the Estimation lab. In the first half, you fit a\n",
"line to data by hand using an interactive toy, building intuition for **parameters**,\n",
"**loss**, and **training**.\n",
"\n",
"Here, you'll apply the same ideas with code, on two datasets:\n",
"\n",
"- **Pok\u00e9mon stats** \u2014 the dataset from the Pok\u00e9mon lab, used here for **demos**. If\n",
" you haven't seen this dataset before, it's a row per Pok\u00e9mon with battle stats like\n",
" `attack` and `total` (the sum of all six stats).\n",
"- **BRFSS** \u2014 the same health survey dataset from the Pok\u00e9mon lab \u2014 used for **your\n",
" turn** sections, where you repeat each technique and dig further.\n",
"\n",
"Each demo introduces one idea using Pok\u00e9mon; each \"your turn\" asks you to apply it to\n",
"BRFSS. Record your results and write your answers in `questions.md`, not here\u2014the\n",
"checkpoints on the lab page tell you exactly what to answer and when."
]
},
{
"cell_type": "code",
"id": "cell-002",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from plotting import plot_regression\n",
"\n",
"%matplotlib inline\n",
"plt.rcParams['figure.figsize'] = (6, 4)"
]
},
{
"cell_type": "markdown",
"id": "cell-003",
"metadata": {},
"source": [
"`scikit-learn` actually has a built-in function for this, but it's worth\n",
"seeing how it works\u2014the same function described on the lab page:"
]
},
{
"cell_type": "code",
"id": "cell-004",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def root_mean_squared_error(y_true, y_pred):\n",
" errors = y_true - y_pred\n",
" squared_errors = errors ** 2\n",
" mean_squared_error = squared_errors.mean()\n",
" return mean_squared_error ** 0.5"
]
},
{
"cell_type": "code",
"id": "cell-005",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pokemon = pd.read_csv(\"pokemon.csv\")\n",
"people = pd.read_csv(\"brfss_2020.csv\")\n",
"print(f\"pokemon: {pokemon.shape}, people: {people.shape}\")\n",
"pokemon.head(3)"
]
},
{
"cell_type": "markdown",
"id": "cell-006",
"metadata": {},
"source": [
"## Column reference\n",
"\n",
"**pokemon** (one row per Pok\u00e9mon):\n",
"\n",
"| Column | Description |\n",
"|--------|-------------|\n",
"| `type`, `subtype` | Elemental type(s), e.g. `Grass`, `Poison` |\n",
"| `total` | Sum of all six stats below |\n",
"| `hp`, `attack`, `defense`, `special_attack`, `special_defense`, `speed` | Battle stats |\n",
"| `generation` | Which numbered generation of games introduced this Pok\u00e9mon (1-6) |\n",
"| `legendary` | Whether this Pok\u00e9mon is Legendary (True/False) |\n",
"\n",
"**people** (one row per BRFSS survey respondent):\n",
"\n",
"| Column | Description |\n",
"|--------|-------------|\n",
"| `age` | Age band (18, 25, 35, 45, 55, or 65 meaning 65+) |\n",
"| `sex` | `male` or `female` |\n",
"| `income` | Annual income band, 1 (under $10k) to 8 (over $75k) |\n",
"| `education` | Highest education level, 1 (did not graduate high school) to 4 (college graduate) |\n",
"| `sexual_orientation` | `heterosexual`, `homosexual`, `bisexual`, or `other` |\n",
"| `health` | Self-reported general health, 1 (poor) to 5 (excellent) |\n",
"| `no_doctor` | Couldn't afford to see a doctor in the last year (True/False) |\n",
"| `exercise` | Did any exercise in the last 30 days (True/False) |\n",
"| `sleep` | Average hours of sleep per night |"
]
},
{
"cell_type": "markdown",
"id": "cell-007",
"metadata": {},
"source": [
"## 1. Fitting a line to one predictor\n",
"\n",
"We'll reuse the same `fit`/`predict` pattern from the lab intro, but now `scikit-learn`\n",
"finds the parameters instead of you nudging them by hand. The demo below splits the\n",
"data, fits a line, and reports its slope, intercept, and **RMSE** (root mean squared\n",
"error) on both a training set and a held-out test set.\n",
"\n",
"RMSE is the same \"total squared error\" you minimized in the toy \u2014 averaged over all\n",
"points instead of summed, then square-rooted so it's back in the original units of\n",
"`response`. A model with RMSE of 8 is, roughly, \"off by about 8 on a typical prediction.\"\n",
"\n",
"`plot_regression(X, y, model)`, imported from `plotting.py`, draws the scatterplot and\n",
"the model's line for you. It picks reasonable defaults for marker size, opacity, and\n",
"jitter based on how much data you're plotting\u2014but if a particular plot still comes out\n",
"hard to read, you can override any of them: `plot_regression(X, y, model, size=10,\n",
"opacity=0.1, jitter=0.3)`. All three are purely cosmetic: they change how the plot\n",
"looks, never the data the model was trained on. You won't need to write any plotting\n",
"code yourself in this notebook."
]
},
{
"cell_type": "code",
"id": "cell-008",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Demo: predict a Pokemon's total stats from its attack stat\n",
"X = pokemon[[\"attack\"]].astype(float)\n",
"y = pokemon[\"total\"]\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"model = LinearRegression()\n",
"model.fit(X_train, y_train)\n",
"\n",
"train_rmse = root_mean_squared_error(y_train, model.predict(X_train))\n",
"test_rmse = root_mean_squared_error(y_test, model.predict(X_test))\n",
"print(f\"slope: {model.coef_[0]:.3f}\")\n",
"print(f\"intercept: {model.intercept_:.3f}\")\n",
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
"\n",
"plot_regression(X, y, model)"
]
},
{
"cell_type": "markdown",
"id": "cell-009",
"metadata": {},
"source": [
"### Your turn\n",
"\n",
"Repeat the same steps \u2014 split, fit, predict, score, and plot \u2014 on `people`, to predict\n",
"`income` from:\n",
"\n",
"1. `education`\n",
"2. `health`\n",
"3. A predictor of your choice\n",
"\n",
"Record your results and answer the questions about them in `questions.md`."
]
},
{
"cell_type": "code",
"id": "cell-010",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your turn: education -> income\n"
]
},
{
"cell_type": "code",
"id": "cell-011",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your turn: health -> income\n"
]
},
{
"cell_type": "code",
"id": "cell-012",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your turn: a predictor of your choice -> income\n"
]
},
{
"cell_type": "markdown",
"id": "cell-013",
"metadata": {},
"source": [
"## 2. Binary predictors: a coefficient is a group difference\n",
"\n",
"`legendary` only takes two values, `True` and `False`. There's no \"in between\" \u2014 so\n",
"the slope for a binary predictor isn't a rate of change, it's the model's predicted\n",
"*difference between the two groups*."
]
},
{
"cell_type": "code",
"id": "cell-014",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Demo: legendary status -> total stats\n",
"pokemon[\"legendary_int\"] = pokemon[\"legendary\"].astype(int)\n",
"X = pokemon[[\"legendary_int\"]]\n",
"y = pokemon[\"total\"]\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"model = LinearRegression()\n",
"model.fit(X_train, y_train)\n",
"\n",
"train_rmse = root_mean_squared_error(y_train, model.predict(X_train))\n",
"test_rmse = root_mean_squared_error(y_test, model.predict(X_test))\n",
"print(f\"slope: {model.coef_[0]:.3f}\")\n",
"print(f\"intercept: {model.intercept_:.3f}\")\n",
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
"\n",
"plot_regression(X, y, model)"
]
},
{
"cell_type": "markdown",
"id": "cell-015",
"metadata": {},
"source": [
"### Your turn\n",
"\n",
"`people` has two binary columns: `exercise` and `no_doctor`. Pick one, cast it to\n",
"`int`, and repeat the same steps to predict `income` from it. Record your results and\n",
"answer the questions about it in `questions.md`."
]
},
{
"cell_type": "code",
"id": "cell-016",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your turn: exercise or no_doctor -> income\n"
]
},
{
"cell_type": "markdown",
"id": "cell-017",
"metadata": {},
"source": [
"## 3. Ordinal predictors: an assumption worth naming\n",
"\n",
"`generation` ranks Pok\u00e9mon games in the order they were released \u2014 it's an **ordinal**\n",
"label, not a true measurement. Treating it as a number assumes each step (1\u21922, 2\u21923, \u2026)\n",
"represents the same-sized \"distance,\" which may not be true at all."
]
},
{
"cell_type": "code",
"id": "cell-018",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Demo: generation -> total stats\n",
"X = pokemon[[\"generation\"]].astype(float)\n",
"y = pokemon[\"total\"]\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"model = LinearRegression()\n",
"model.fit(X_train, y_train)\n",
"\n",
"train_rmse = root_mean_squared_error(y_train, model.predict(X_train))\n",
"test_rmse = root_mean_squared_error(y_test, model.predict(X_test))\n",
"print(f\"slope: {model.coef_[0]:.3f}\")\n",
"print(f\"intercept: {model.intercept_:.3f}\")\n",
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
"\n",
"plot_regression(X, y, model)"
]
},
{
"cell_type": "markdown",
"id": "cell-019",
"metadata": {},
"source": [
"Notice the test RMSE here is barely better than just guessing the mean every time \u2014\n",
"generation tells you almost nothing about total stats.\n",
"\n",
"`education` and `health` in the BRFSS data are ordinal in the same way `generation`\n",
"is: the numbers encode an order, not an evenly-spaced scale. `age`, on the other hand,\n",
"uses the actual lower bound of each band in years, which is a much more defensible\n",
"number to treat as continuous. Answer the question about this in `questions.md`."
]
},
{
"cell_type": "markdown",
"id": "cell-020",
"metadata": {},
"source": [
"## 4. Multiple regression and overfitting\n",
"\n",
"You can give a model more than one predictor at once."
]
},
{
"cell_type": "code",
"id": "cell-021",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Demo: predict total stats from three of its component stats\n",
"predictors = [\"attack\", \"defense\", \"speed\"]\n",
"X = pokemon[predictors].astype(float)\n",
"y = pokemon[\"total\"]\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"multi_model = LinearRegression()\n",
"multi_model.fit(X_train, y_train)\n",
"\n",
"train_rmse = root_mean_squared_error(y_train, multi_model.predict(X_train))\n",
"test_rmse = root_mean_squared_error(y_test, multi_model.predict(X_test))\n",
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
"print(dict(zip(predictors, multi_model.coef_)))"
]
},
{
"cell_type": "markdown",
"id": "cell-022",
"metadata": {},
"source": [
"On a *training* set, RMSE can only go down (or stay the same) as you add predictors \u2014\n",
"the model has more information to work with. That doesn't mean the model is actually\n",
"better: it might be **overfitting**, fitting noise specific to the training data rather\n",
"than a pattern that generalizes. Comparing training RMSE to test RMSE is how you check.\n",
"\n",
"### Your turn\n",
"\n",
"So far you've predicted `income` from things like `education` and `health`. For this\n",
"multiple regression, predict `health` instead, from `education`, `income`,\n",
"`exercise`, `age`, and `no_doctor`\u2014`health` moves from predictor to target, and\n",
"`income` takes its place as a predictor. Predicting `health` from things you can\n",
"observe is arguably more realistic than predicting `income`: income is often public\n",
"record, but health is self-reported, so a model that estimates it from observable\n",
"behaviors has a more obvious real-world use. Record your results and answer the\n",
"questions about it in `questions.md`."
]
},
{
"cell_type": "code",
"id": "cell-023",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your turn: multiple regression on health\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}