lab_pokemon/brfss_2020.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "58f71d07-849f-4cc0-9168-7303e795e244",
   "metadata": {},
   "source": [
    "# BRFSS 2020\n",
    "\n",
    "This lab uses a simplified subset of the BRFSS 2020 dataset, `brfss_2020.csv`. \n",
    "This notebook explains the variables included as well as the process used to produce this file. \n",
    "Read more about BRFSS at https://www.cdc.gov/brfss/annual_data/annual_2020.html\n",
    "\n",
    "**Note:** The simplified data set should not be used for serious statistical arguments. In the interest of making the data easier to understand, we only work with a skewed subset of the data. Specifically, this data set only includes people who answered all of the questions. \n",
    "\n",
    "## Codebook\n",
    "\n",
    "The following variables are included in the simplified dataset. When we talk about \"people\" in this lab, we're referring to the people who responded to the survey, not the whole US population. If you want more details on how questions were asked or how peoples' responses were recorded, please consult the [official codebook](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf).  \n",
    "\n",
    "### age\n",
    "\n",
    "Ages are grouped into age bands of 18-24, 25-34, 35-44, 45-54, 55-64, and 65+. \n",
    "\n",
    "| number | age range | \n",
    "| ------ | --------- |\n",
    "| 18     | 18-24     |\n",
    "| 25     | 25-34     |\n",
    "| 35     | 35-44     |\n",
    "| 45     | 45-54     |\n",
    "| 55     | 55-64     |\n",
    "| 65     | 65+       |\n",
    "\n",
    "### sex\n",
    "\n",
    "Sex only had options for `male` and `female`. In some cases, peoples' current sex is not the same as the sex they were assigned at birth. \n",
    "\n",
    "### income\n",
    "\n",
    "Income is grouped in the following bands. \n",
    "\n",
    "| number | annual income, in $1000   | \n",
    "| ------ | ------------------------- |\n",
    "| 1      | Less than 10              |\n",
    "| 2      | 10-15                     |\n",
    "| 3      | 15-20                     |\n",
    "| 4      | 20-25                     |\n",
    "| 5      | 25-35                     |\n",
    "| 6      | 35-50                     |\n",
    "| 7      | 50-75                     |\n",
    "| 8      | More than 75              |\n",
    "\n",
    "### education\n",
    "\n",
    "Education indicates the highest level of education completed, with codes as follows. \n",
    "\n",
    "| number | education level                   | \n",
    "| ------ | --------------------------------- |\n",
    "| 1      | Did not graduate from high school |\n",
    "| 2      | Graduated from high school        |\n",
    "| 3      | Attended some college             |\n",
    "| 4      | Graduated from college            |\n",
    "\n",
    "### sexual orientation\n",
    "\n",
    "Sexual orientation is reported as `heterosexual`, `homosexual`, `bisexual`, and `other`, with `other` including people who said something else, said they didn't understand the question, or chose not to answer.\n",
    "\n",
    "### height\n",
    "\n",
    "Height is reported in meters.\n",
    "\n",
    "### weight\n",
    "\n",
    "Weight is reported in kilograms.\n",
    "\n",
    "### health\n",
    "\n",
    "Health is peoples' estimate of their general health. \n",
    "\n",
    "| number | health status | \n",
    "| ------ | ------------- |\n",
    "| 1      | Poor          |\n",
    "| 2      | Fair          |\n",
    "| 3      | Good          |\n",
    "| 4      | Very good     |\n",
    "| 5      | Excellent     |\n",
    "\n",
    "### no_doctor\n",
    "\n",
    "No doctor is a boolean variable indicating whether there was a time in the last year when the person needed to see a doctor, but could not afford to do so.\n",
    "\n",
    "### exercise\n",
    "\n",
    "Exercise indicates whether a person has done any physical activity or exercise in the last 30 days, outside of work. \n",
    "\n",
    "### sleep\n",
    "\n",
    "Sleep reports the average hours of sleep a person gets per night.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7a7e909-8b44-43cd-af2a-87266c836668",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Preparing the simplified dataset\n",
    "\n",
    "The following code converts the full BRFSS 2020 dataset into the simplified version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e7b198b3-5781-4815-b862-101501a25782",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First, download and unzip https://www.cdc.gov/brfss/annual_data/2020/files/LLCP2020XPT.zip\n",
    "# You should now have a file called LLCP2020.XPT\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "def prepare_simplified_dataset():\n",
    "    df = pd.read_sas(\"LLCP2020.XPT\")\n",
    "    df = df[odf.DISPCODE == 1100]\n",
    "    df[\"sex\"] = df[\"SEXVAR\"].map({1: \"male\", 2: \"female\"})\n",
    "    df = df[df.GENHLTH <= 5]\n",
    "    df[\"health\"] = df.GENHLTH.map({1:5, 2:4, 3:3, 4:2, 5:1})\n",
    "    df = df[df.MEDCOST <= 2]\n",
    "    df[\"no_doctor\"] = df.MEDCOST.map({1: True, 2: False})\n",
    "    df = df[df.EXERANY2 <= 2]\n",
    "    df[\"exercise\"] = df.EXERANY2.map({1: True, 2: False})\n",
    "    df = df[df.SLEPTIM1 < 25]\n",
    "    df[\"sleep\"] = df.SLEPTIM1.astype(int)\n",
    "    df = df[df.INCOME2 < 9]\n",
    "    df[\"income\"] = df.INCOME2.astype(int)\n",
    "    df = df[~df.WTKG3.isna()]\n",
    "    df[\"weight\"] = df.WTKG3 / 100\n",
    "    df = df[~df.HTM4.isna()]\n",
    "    df[\"height\"] = df.HTM4 / 100\n",
    "    df = df[(df.SOFEMALE.isin([1, 2, 3, 4, 7, 9])) | (df.SOMALE.isin([1, 2, 3, 4, 7, 9]))]\n",
    "    df[\"sexual_orientation\"] = df.SOFEMALE\n",
    "    df[\"sexual_orientation\"].fillna(df.SOMALE, inplace=True)\n",
    "    df[\"sexual_orientation\"] = df[\"sexual_orientation\"].map({1: \"homosexual\", 2: \"heterosexual\", 3: \"bisexual\", 4: \"other\", 7: \"other\", 9: \"other\"})\n",
    "    df = df[df._EDUCAG.isin([1, 2, 3, 4])]\n",
    "    df[\"education\"] = df._EDUCAG.map({1: \"none_completed\", 2: \"high_school\", 3: \"some_college\", 4: \"college\"})\n",
    "    df[\"age\"] = df._AGE_G.map({1: 18, 2: 25, 3: 35, 4: 45, 5: 55, 6: 65})\n",
    "    df = df[[\"age\", \"sex\", \"income\", \"education\", \"sexual_orientation\", \"height\", \"weight\", \"health\", \"no_doctor\", \"exercise\", \"sleep\"]]\n",
    "    df.to_csv(\"brfss_2020.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e51cb566-e5d6-49ea-9789-26c5b57fef61",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}