Compare commits

..

6 Commits

Author SHA1 Message Date
Chris Proctor 1e649dfc21 Package mode in pyproject 2025-01-13 23:29:46 -05:00
Chris Proctor 1fb67edff9 Poetry update 2025-01-13 21:51:41 -05:00
Chris Proctor 8368652725 Upgrade pyproject for poetry 2 spec 2025-01-13 17:49:56 -05:00
Chris Proctor ad567ef534 Python version to 3.10 2023-08-17 13:25:33 -04:00
Chris Proctor 7cf6016a4c Update gitignore 2023-07-31 10:56:09 -04:00
Chris Proctor 987703b6b4 Update j notebook to lab 2023-07-31 10:55:38 -04:00
11 changed files with 3206 additions and 12432 deletions

View File

@ -1,5 +1,16 @@
# Title. No more than 50 characters ----> |
# Leave a single blank line between the title and the body (excluding comments)
# Body. Write a description of what you've changed. # -----------------------------------------------------------------
# Write your entire commit message above this line.
#
# The first line should be a quick description of what you changed.
# Then leave a blank line.
# Then, taking as many lines as you want, answer the questions below
# (you only need to answer these once).
#
# In the Unit 2 project, you will define some research questions and
# select your own data set. What are some possible questions you
# would be interested in exploring? Can you think of any data sets that
# might be interesting to explore? These could be public data sets,
# or they could be private data sets related to your work or your own
# life (you don't need to share your data set in the unit project).

2
.gitignore vendored
View File

@ -1 +1 @@
poetry.lock .ipynb_checkpoints/*

View File

@ -1,183 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "58f71d07-849f-4cc0-9168-7303e795e244",
"metadata": {},
"source": [
"# BRFSS 2020\n",
"\n",
"This lab uses a simplified subset of the BRFSS 2020 dataset, `brfss_2020.csv`. \n",
"This notebook explains the variables included as well as the process used to produce this file. \n",
"Read more about BRFSS at https://www.cdc.gov/brfss/annual_data/annual_2020.html\n",
"\n",
"**Note:** The simplified data set should not be used for serious statistical arguments. In the interest of making the data easier to understand, we only work with a skewed subset of the data. Specifically, this data set only includes people who answered all of the questions. \n",
"\n",
"## Codebook\n",
"\n",
"The following variables are included in the simplified dataset. When we talk about \"people\" in this lab, we're referring to the people who responded to the survey, not the whole US population. If you want more details on how questions were asked or how peoples' responses were recorded, please consult the [official codebook](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf). \n",
"\n",
"### age\n",
"\n",
"Ages are grouped into age bands of 18-24, 25-34, 35-44, 45-54, 55-64, and 65+. \n",
"\n",
"| number | age range | \n",
"| ------ | --------- |\n",
"| 18 | 18-24 |\n",
"| 25 | 25-34 |\n",
"| 35 | 35-44 |\n",
"| 45 | 45-54 |\n",
"| 55 | 55-64 |\n",
"| 65 | 65+ |\n",
"\n",
"### sex\n",
"\n",
"Sex only had options for `male` and `female`. In some cases, peoples' current sex is not the same as the sex they were assigned at birth. \n",
"\n",
"### income\n",
"\n",
"Income is grouped in the following bands. \n",
"\n",
"| number | annual income, in $1000 | \n",
"| ------ | ------------------------- |\n",
"| 1 | Less than 10 |\n",
"| 2 | 10-15 |\n",
"| 3 | 15-20 |\n",
"| 4 | 20-25 |\n",
"| 5 | 25-35 |\n",
"| 6 | 35-50 |\n",
"| 7 | 50-75 |\n",
"| 8 | More than 75 |\n",
"\n",
"### education\n",
"\n",
"Education indicates the highest level of education completed, with codes as follows. \n",
"\n",
"| number | education level | \n",
"| ------ | --------------------------------- |\n",
"| 1 | Did not graduate from high school |\n",
"| 2 | Graduated from high school |\n",
"| 3 | Attended some college |\n",
"| 4 | Graduated from college |\n",
"\n",
"### sexual orientation\n",
"\n",
"Sexual orientation is reported as `heterosexual`, `homosexual`, `bisexual`, and `other`, with `other` including people who said something else, said they didn't understand the question, or chose not to answer.\n",
"\n",
"### height\n",
"\n",
"Height is reported in meters.\n",
"\n",
"### weight\n",
"\n",
"Weight is reported in kilograms.\n",
"\n",
"### health\n",
"\n",
"Health is peoples' estimate of their general health. \n",
"\n",
"| number | health status | \n",
"| ------ | ------------- |\n",
"| 1 | Poor |\n",
"| 2 | Fair |\n",
"| 3 | Good |\n",
"| 4 | Very good |\n",
"| 5 | Excellent |\n",
"\n",
"### no_doctor\n",
"\n",
"No doctor is a boolean variable indicating whether there was a time in the last year when the person needed to see a doctor, but could not afford to do so.\n",
"\n",
"### exercise\n",
"\n",
"Exercise indicates whether a person has done any physical activity or exercise in the last 30 days, outside of work. \n",
"\n",
"### sleep\n",
"\n",
"Sleep reports the average hours of sleep a person gets per night.\n"
]
},
{
"cell_type": "markdown",
"id": "f7a7e909-8b44-43cd-af2a-87266c836668",
"metadata": {},
"source": [
"---\n",
"\n",
"## Preparing the simplified dataset\n",
"\n",
"The following code converts the full BRFSS 2020 dataset into the simplified version."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e7b198b3-5781-4815-b862-101501a25782",
"metadata": {},
"outputs": [],
"source": [
"# First, download and unzip https://www.cdc.gov/brfss/annual_data/2020/files/LLCP2020XPT.zip\n",
"# You should now have a file called LLCP2020.XPT\n",
"\n",
"import pandas as pd\n",
"\n",
"def prepare_simplified_dataset():\n",
" df = pd.read_sas(\"LLCP2020.XPT\")\n",
" df = df[odf.DISPCODE == 1100]\n",
" df[\"sex\"] = df[\"SEXVAR\"].map({1: \"male\", 2: \"female\"})\n",
" df = df[df.GENHLTH <= 5]\n",
" df[\"health\"] = df.GENHLTH.map({1:5, 2:4, 3:3, 4:2, 5:1})\n",
" df = df[df.MEDCOST <= 2]\n",
" df[\"no_doctor\"] = df.MEDCOST.map({1: True, 2: False})\n",
" df = df[df.EXERANY2 <= 2]\n",
" df[\"exercise\"] = df.EXERANY2.map({1: True, 2: False})\n",
" df = df[df.SLEPTIM1 < 25]\n",
" df[\"sleep\"] = df.SLEPTIM1.astype(int)\n",
" df = df[df.INCOME2 < 9]\n",
" df[\"income\"] = df.INCOME2.astype(int)\n",
" df = df[~df.WTKG3.isna()]\n",
" df[\"weight\"] = df.WTKG3 / 100\n",
" df = df[~df.HTM4.isna()]\n",
" df[\"height\"] = df.HTM4 / 100\n",
" df = df[(df.SOFEMALE.isin([1, 2, 3, 4, 7, 9])) | (df.SOMALE.isin([1, 2, 3, 4, 7, 9]))]\n",
" df[\"sexual_orientation\"] = df.SOFEMALE\n",
" df[\"sexual_orientation\"].fillna(df.SOMALE, inplace=True)\n",
" df[\"sexual_orientation\"] = df[\"sexual_orientation\"].map({1: \"homosexual\", 2: \"heterosexual\", 3: \"bisexual\", 4: \"other\", 7: \"other\", 9: \"other\"})\n",
" df = df[df._EDUCAG.isin([1, 2, 3, 4])]\n",
" df[\"education\"] = df._EDUCAG.map({1: \"none_completed\", 2: \"high_school\", 3: \"some_college\", 4: \"college\"})\n",
" df[\"age\"] = df._AGE_G.map({1: 18, 2: 25, 3: 35, 4: 45, 5: 55, 6: 65})\n",
" df = df[[\"age\", \"sex\", \"income\", \"education\", \"sexual_orientation\", \"height\", \"weight\", \"health\", \"no_doctor\", \"exercise\", \"sleep\"]]\n",
" df.to_csv(\"brfss_2020.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e51cb566-e5d6-49ea-9789-26c5b57fef61",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -175,7 +175,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.11.3" "version": "3.9.10"
} }
}, },
"nbformat": 4, "nbformat": 4,

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

3071
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,20 +1,25 @@
[tool.poetry] [project]
name = "lab_pokemon" name = "lab-pokemon"
version = "0.1.0" version = "0.1.0"
description = "" description = ""
authors = ["Chris <chris@chrisproctor.net>"] authors = [
license = "MIT" {name = "Chris Proctor",email = "chris@chrisproctor.net"}
]
license = {text = "MIT"}
readme = "README.md"
requires-python = ">=3.10,<4.0"
dependencies = [
"jupyter (>=1.1.1,<2.0.0)",
"pandas (>=2.2.3,<3.0.0)",
"matplotlib (>=3.10.0,<4.0.0)",
"seaborn (>=0.13.2,<0.14.0)",
"requests (>=2.32.3,<3.0.0)"
]
[tool.poetry.dependencies]
python = "^3.11"
jupyterlab = "^3.3.3"
pandas = "^1.4.2"
matplotlib = "^3.5.1"
seaborn = "^0.11.2"
requests = "^2.27.1"
[tool.poetry.dev-dependencies]
[build-system] [build-system]
requires = ["poetry-core>=1.0.0"] requires = ["poetry-core>=2.0.0,<3.0.0"]
build-backend = "poetry.core.masonry.api" build-backend = "poetry.core.masonry.api"
[tool.poetry]
package-mode = false

View File

@ -1,2 +0,0 @@
pandas
seaborn