# BRFSS 2020

This lab uses a simplified subset of the BRFSS 2020 dataset, `brfss_2020.csv`. 
This notebook explains the variables included as well as the process used to produce this file. 
Read more about BRFSS at https://www.cdc.gov/brfss/annual_data/annual_2020.html

**Note:** The simplified data set should not be used for serious statistical arguments. In the interest of making the data easier to understand, we only work with a skewed subset of the data. Specifically, this data set only includes people who answered all of the questions. 

## Codebook

The following variables are included in the simplified dataset. When we talk about "people" in this lab, we're referring to the people who responded to the survey, not the whole US population. If you want more details on how questions were asked or how peoples' responses were recorded, please consult the [official codebook](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf).  

### age

Ages are grouped into age bands of 18-24, 25-34, 35-44, 45-54, 55-64, and 65+. 

| number | age range | 
| ------ | --------- |
| 18     | 18-24     |
| 25     | 25-34     |
| 35     | 35-44     |
| 45     | 45-54     |
| 55     | 55-64     |
| 65     | 65+       |

### sex

Sex only had options for `male` and `female`. In some cases, peoples' current sex is not the same as the sex they were assigned at birth. 

### income

Income is grouped in the following bands. 

| number | annual income, in $1000   | 
| ------ | ------------------------- |
| 1      | Less than 10              |
| 2      | 10-15                     |
| 3      | 15-20                     |
| 4      | 20-25                     |
| 5      | 25-35                     |
| 6      | 35-50                     |
| 7      | 50-75                     |
| 8      | More than 75              |

### education

Education indicates the highest level of education completed, with codes as follows. 

| number | education level                   | 
| ------ | --------------------------------- |
| 1      | Did not graduate from high school |
| 2      | Graduated from high school        |
| 3      | Attended some college             |
| 4      | Graduated from college            |

### sexual orientation

Sexual orientation is reported as `heterosexual`, `homosexual`, `bisexual`, and `other`, with `other` including people who said something else, said they didn't understand the question, or chose not to answer.

### height

Height is reported in meters.

### weight

Weight is reported in kilograms.

### health

Health is peoples' estimate of their general health. 

| number | health status | 
| ------ | ------------- |
| 1      | Poor          |
| 2      | Fair          |
| 3      | Good          |
| 4      | Very good     |
| 5      | Excellent     |

### no_doctor

No doctor is a boolean variable indicating whether there was a time in the last year when the person needed to see a doctor, but could not afford to do so.

### exercise

Exercise indicates whether a person has done any physical activity or exercise in the last 30 days, outside of work. 

### sleep

Sleep reports the average hours of sleep a person gets per night.


---

## Preparing the simplified dataset

The following code converts the full BRFSS 2020 dataset into the simplified version.

In [1]:
# First, download and unzip https://www.cdc.gov/brfss/annual_data/2020/files/LLCP2020XPT.zip
# You should now have a file called LLCP2020.XPT

import pandas as pd

def prepare_simplified_dataset():
    df = pd.read_sas("LLCP2020.XPT")
    df = df[odf.DISPCODE == 1100]
    df["sex"] = df["SEXVAR"].map({1: "male", 2: "female"})
    df = df[df.GENHLTH <= 5]
    df["health"] = df.GENHLTH.map({1:5, 2:4, 3:3, 4:2, 5:1})
    df = df[df.MEDCOST <= 2]
    df["no_doctor"] = df.MEDCOST.map({1: True, 2: False})
    df = df[df.EXERANY2 <= 2]
    df["exercise"] = df.EXERANY2.map({1: True, 2: False})
    df = df[df.SLEPTIM1 < 25]
    df["sleep"] = df.SLEPTIM1.astype(int)
    df = df[df.INCOME2 < 9]
    df["income"] = df.INCOME2.astype(int)
    df = df[~df.WTKG3.isna()]
    df["weight"] = df.WTKG3 / 100
    df = df[~df.HTM4.isna()]
    df["height"] = df.HTM4 / 100
    df = df[(df.SOFEMALE.isin([1, 2, 3, 4, 7, 9])) | (df.SOMALE.isin([1, 2, 3, 4, 7, 9]))]
    df["sexual_orientation"] = df.SOFEMALE
    df["sexual_orientation"].fillna(df.SOMALE, inplace=True)
    df["sexual_orientation"] = df["sexual_orientation"].map({1: "homosexual", 2: "heterosexual", 3: "bisexual", 4: "other", 7: "other", 9: "other"})
    df = df[df._EDUCAG.isin([1, 2, 3, 4])]
    df["education"] = df._EDUCAG.map({1: "none_completed", 2: "high_school", 3: "some_college", 4: "college"})
    df["age"] = df._AGE_G.map({1: 18, 2: 25, 3: 35, 4: 45, 5: 55, 6: 65})
    df = df[["age", "sex", "income", "education", "sexual_orientation", "height", "weight", "health", "no_doctor", "exercise", "sleep"]]
    df.to_csv("brfss_2020.csv", index=False)