404 lines
14 KiB
Plaintext
404 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-001",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Estimating Income from Survey Data\n",
|
|
"\n",
|
|
"This notebook is the second half of the Estimation lab. In the first half, you fit a\n",
|
|
"line to data by hand using an interactive toy, building intuition for **parameters**,\n",
|
|
"**loss**, and **training**.\n",
|
|
"\n",
|
|
"Here, you'll apply the same ideas with code, on two datasets:\n",
|
|
"\n",
|
|
"- **Pok\u00e9mon stats** \u2014 the dataset from the Pok\u00e9mon lab, used here for **demos**. If\n",
|
|
" you haven't seen this dataset before, it's a row per Pok\u00e9mon with battle stats like\n",
|
|
" `attack` and `total` (the sum of all six stats).\n",
|
|
"- **BRFSS** \u2014 the same health survey dataset from the Pok\u00e9mon lab \u2014 used for **your\n",
|
|
" turn** sections, where you repeat each technique and dig further.\n",
|
|
"\n",
|
|
"Each demo introduces one idea using Pok\u00e9mon; each \"your turn\" asks you to apply it to\n",
|
|
"BRFSS. Record your results and write your answers in `questions.md`, not here\u2014the\n",
|
|
"checkpoints on the lab page tell you exactly what to answer and when."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-002",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import seaborn as sns\n",
|
|
"from sklearn.linear_model import LinearRegression\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"from plotting import plot_regression\n",
|
|
"\n",
|
|
"%matplotlib inline\n",
|
|
"plt.rcParams['figure.figsize'] = (6, 4)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-003",
|
|
"metadata": {},
|
|
"source": [
|
|
"`scikit-learn` actually has a built-in function for this, but it's worth\n",
|
|
"seeing how it works\u2014the same function described on the lab page:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-004",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def root_mean_squared_error(y_true, y_pred):\n",
|
|
" errors = y_true - y_pred\n",
|
|
" squared_errors = errors ** 2\n",
|
|
" mean_squared_error = squared_errors.mean()\n",
|
|
" return mean_squared_error ** 0.5"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-005",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pokemon = pd.read_csv(\"pokemon.csv\")\n",
|
|
"people = pd.read_csv(\"brfss_2020.csv\")\n",
|
|
"print(f\"pokemon: {pokemon.shape}, people: {people.shape}\")\n",
|
|
"pokemon.head(3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-006",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Column reference\n",
|
|
"\n",
|
|
"**pokemon** (one row per Pok\u00e9mon):\n",
|
|
"\n",
|
|
"| Column | Description |\n",
|
|
"|--------|-------------|\n",
|
|
"| `type`, `subtype` | Elemental type(s), e.g. `Grass`, `Poison` |\n",
|
|
"| `total` | Sum of all six stats below |\n",
|
|
"| `hp`, `attack`, `defense`, `special_attack`, `special_defense`, `speed` | Battle stats |\n",
|
|
"| `generation` | Which numbered generation of games introduced this Pok\u00e9mon (1-6) |\n",
|
|
"| `legendary` | Whether this Pok\u00e9mon is Legendary (True/False) |\n",
|
|
"\n",
|
|
"**people** (one row per BRFSS survey respondent):\n",
|
|
"\n",
|
|
"| Column | Description |\n",
|
|
"|--------|-------------|\n",
|
|
"| `age` | Age band (18, 25, 35, 45, 55, or 65 meaning 65+) |\n",
|
|
"| `sex` | `male` or `female` |\n",
|
|
"| `income` | Annual income band, 1 (under $10k) to 8 (over $75k) |\n",
|
|
"| `education` | Highest education level, 1 (did not graduate high school) to 4 (college graduate) |\n",
|
|
"| `sexual_orientation` | `heterosexual`, `homosexual`, `bisexual`, or `other` |\n",
|
|
"| `health` | Self-reported general health, 1 (poor) to 5 (excellent) |\n",
|
|
"| `no_doctor` | Couldn't afford to see a doctor in the last year (True/False) |\n",
|
|
"| `exercise` | Did any exercise in the last 30 days (True/False) |\n",
|
|
"| `sleep` | Average hours of sleep per night |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-007",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Fitting a line to one predictor\n",
|
|
"\n",
|
|
"We'll reuse the same `fit`/`predict` pattern from the lab intro, but now `scikit-learn`\n",
|
|
"finds the parameters instead of you nudging them by hand. The demo below splits the\n",
|
|
"data, fits a line, and reports its slope, intercept, and **RMSE** (root mean squared\n",
|
|
"error) on both a training set and a held-out test set.\n",
|
|
"\n",
|
|
"RMSE is the same \"total squared error\" you minimized in the toy \u2014 averaged over all\n",
|
|
"points instead of summed, then square-rooted so it's back in the original units of\n",
|
|
"`response`. A model with RMSE of 8 is, roughly, \"off by about 8 on a typical prediction.\"\n",
|
|
"\n",
|
|
"`plot_regression(X, y, model)`, imported from `plotting.py`, draws the scatterplot and\n",
|
|
"the model's line for you. It picks reasonable defaults for marker size, opacity, and\n",
|
|
"jitter based on how much data you're plotting\u2014but if a particular plot still comes out\n",
|
|
"hard to read, you can override any of them: `plot_regression(X, y, model, size=10,\n",
|
|
"opacity=0.1, jitter=0.3)`. All three are purely cosmetic: they change how the plot\n",
|
|
"looks, never the data the model was trained on. You won't need to write any plotting\n",
|
|
"code yourself in this notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-008",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Demo: predict a Pokemon's total stats from its attack stat\n",
|
|
"X = pokemon[[\"attack\"]].astype(float)\n",
|
|
"y = pokemon[\"total\"]\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|
"\n",
|
|
"model = LinearRegression()\n",
|
|
"model.fit(X_train, y_train)\n",
|
|
"\n",
|
|
"train_rmse = root_mean_squared_error(y_train, model.predict(X_train))\n",
|
|
"test_rmse = root_mean_squared_error(y_test, model.predict(X_test))\n",
|
|
"print(f\"slope: {model.coef_[0]:.3f}\")\n",
|
|
"print(f\"intercept: {model.intercept_:.3f}\")\n",
|
|
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
|
|
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
|
|
"\n",
|
|
"plot_regression(X, y, model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-009",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Your turn\n",
|
|
"\n",
|
|
"Repeat the same steps \u2014 split, fit, predict, score, and plot \u2014 on `people`, to predict\n",
|
|
"`income` from:\n",
|
|
"\n",
|
|
"1. `education`\n",
|
|
"2. `health`\n",
|
|
"3. A predictor of your choice\n",
|
|
"\n",
|
|
"Record your results and answer the questions about them in `questions.md`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-010",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Your turn: education -> income\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-011",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Your turn: health -> income\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-012",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Your turn: a predictor of your choice -> income\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-013",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Binary predictors: a coefficient is a group difference\n",
|
|
"\n",
|
|
"`legendary` only takes two values, `True` and `False`. There's no \"in between\" \u2014 so\n",
|
|
"the slope for a binary predictor isn't a rate of change, it's the model's predicted\n",
|
|
"*difference between the two groups*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-014",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Demo: legendary status -> total stats\n",
|
|
"pokemon[\"legendary_int\"] = pokemon[\"legendary\"].astype(int)\n",
|
|
"X = pokemon[[\"legendary_int\"]]\n",
|
|
"y = pokemon[\"total\"]\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|
"\n",
|
|
"model = LinearRegression()\n",
|
|
"model.fit(X_train, y_train)\n",
|
|
"\n",
|
|
"train_rmse = root_mean_squared_error(y_train, model.predict(X_train))\n",
|
|
"test_rmse = root_mean_squared_error(y_test, model.predict(X_test))\n",
|
|
"print(f\"slope: {model.coef_[0]:.3f}\")\n",
|
|
"print(f\"intercept: {model.intercept_:.3f}\")\n",
|
|
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
|
|
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
|
|
"\n",
|
|
"plot_regression(X, y, model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-015",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Your turn\n",
|
|
"\n",
|
|
"`people` has two binary columns: `exercise` and `no_doctor`. Pick one, cast it to\n",
|
|
"`int`, and repeat the same steps to predict `income` from it. Record your results and\n",
|
|
"answer the questions about it in `questions.md`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-016",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Your turn: exercise or no_doctor -> income\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-017",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Ordinal predictors: an assumption worth naming\n",
|
|
"\n",
|
|
"`generation` ranks Pok\u00e9mon games in the order they were released \u2014 it's an **ordinal**\n",
|
|
"label, not a true measurement. Treating it as a number assumes each step (1\u21922, 2\u21923, \u2026)\n",
|
|
"represents the same-sized \"distance,\" which may not be true at all."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-018",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Demo: generation -> total stats\n",
|
|
"X = pokemon[[\"generation\"]].astype(float)\n",
|
|
"y = pokemon[\"total\"]\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|
"\n",
|
|
"model = LinearRegression()\n",
|
|
"model.fit(X_train, y_train)\n",
|
|
"\n",
|
|
"train_rmse = root_mean_squared_error(y_train, model.predict(X_train))\n",
|
|
"test_rmse = root_mean_squared_error(y_test, model.predict(X_test))\n",
|
|
"print(f\"slope: {model.coef_[0]:.3f}\")\n",
|
|
"print(f\"intercept: {model.intercept_:.3f}\")\n",
|
|
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
|
|
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
|
|
"\n",
|
|
"plot_regression(X, y, model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-019",
|
|
"metadata": {},
|
|
"source": [
|
|
"Notice the test RMSE here is barely better than just guessing the mean every time \u2014\n",
|
|
"generation tells you almost nothing about total stats.\n",
|
|
"\n",
|
|
"`education` and `health` in the BRFSS data are ordinal in the same way `generation`\n",
|
|
"is: the numbers encode an order, not an evenly-spaced scale. `age`, on the other hand,\n",
|
|
"uses the actual lower bound of each band in years, which is a much more defensible\n",
|
|
"number to treat as continuous. Answer the question about this in `questions.md`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-020",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Multiple regression and overfitting\n",
|
|
"\n",
|
|
"You can give a model more than one predictor at once."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-021",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Demo: predict total stats from three of its component stats\n",
|
|
"predictors = [\"attack\", \"defense\", \"speed\"]\n",
|
|
"X = pokemon[predictors].astype(float)\n",
|
|
"y = pokemon[\"total\"]\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
|
"multi_model = LinearRegression()\n",
|
|
"multi_model.fit(X_train, y_train)\n",
|
|
"\n",
|
|
"train_rmse = root_mean_squared_error(y_train, multi_model.predict(X_train))\n",
|
|
"test_rmse = root_mean_squared_error(y_test, multi_model.predict(X_test))\n",
|
|
"print(f\"train RMSE: {train_rmse:.3f}\")\n",
|
|
"print(f\"test RMSE: {test_rmse:.3f}\")\n",
|
|
"print(dict(zip(predictors, multi_model.coef_)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cell-022",
|
|
"metadata": {},
|
|
"source": [
|
|
"On a *training* set, RMSE can only go down (or stay the same) as you add predictors \u2014\n",
|
|
"the model has more information to work with. That doesn't mean the model is actually\n",
|
|
"better: it might be **overfitting**, fitting noise specific to the training data rather\n",
|
|
"than a pattern that generalizes. Comparing training RMSE to test RMSE is how you check.\n",
|
|
"\n",
|
|
"### Your turn\n",
|
|
"\n",
|
|
"So far you've predicted `income` from things like `education` and `health`. For this\n",
|
|
"multiple regression, predict `health` instead, from `education`, `income`,\n",
|
|
"`exercise`, `age`, and `no_doctor`\u2014`health` moves from predictor to target, and\n",
|
|
"`income` takes its place as a predictor. Predicting `health` from things you can\n",
|
|
"observe is arguably more realistic than predicting `income`: income is often public\n",
|
|
"record, but health is self-reported, so a model that estimates it from observable\n",
|
|
"behaviors has a more obvious real-world use. Record your results and answer the\n",
|
|
"questions about it in `questions.md`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"id": "cell-023",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Your turn: multiple regression on health\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"name": "python",
|
|
"version": "3.11"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
} |