completed proposal.md related to birthrate data

I'm going to be using frequency data for births on given dates over the course of a 10-year period. This will involve a few plots and probably some data tables to represent patterns in the data and dive into the birthday paradox.
2023-08-16 12:22:46 -04:00 · 2023-08-16 12:22:46 -04:00 · 865134b046
parent 4234089432
commit 865134b046
8 changed files with 6046 additions and 11 deletions
--- a/.ipynb_checkpoints/Untitled-checkpoint.ipynb
+++ b/.ipynb_checkpoints/Untitled-checkpoint.ipynb
@ -0,0 +1,6 @@
+{
+ "cells": [],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/.ipynb_checkpoints/argument-checkpoint.ipynb
+++ b/.ipynb_checkpoints/argument-checkpoint.ipynb
@ -0,0 +1,360 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "worldwide-blood",
+   "metadata": {},
+   "source": [
+    "# Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "understanding-numbers",
+   "metadata": {},
+   "source": [
+    "*✏️ Write 2-3 sentences describing your research.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "greater-circular",
+   "metadata": {},
+   "source": [
+    "## Overarching Question: [✏️ PUT YOUR QUESTION HERE ✏️]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "appreciated-testimony",
+   "metadata": {},
+   "source": [
+    "*✏️ Write 2-3 sentences explaining why this question.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "permanent-pollution",
+   "metadata": {},
+   "source": [
+    "# Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "technical-evans",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Include any import statements you will need\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "overhead-sigma",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "### 💻 FILL IN YOUR DATASET FILE NAME BELOW 💻 ###\n",
+    "\n",
+    "file_name = \"YOUR_DATASET_FILE_NAME.csv\"\n",
+    "dataset_path = \"data/\" + file_name\n",
+    "\n",
+    "df = pd.read_csv(dataset_path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "heated-blade",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "continental-franklin",
+   "metadata": {},
+   "source": [
+    "**Data Overview**\n",
+    "\n",
+    "*✏️ Write 2-3 sentences describing this dataset. Be sure to include where the data comes from and what it contains.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "infinite-instrument",
+   "metadata": {},
+   "source": [
+    "# Methods and Results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "basic-canadian",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Import any helper files you need here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "recognized-positive",
+   "metadata": {},
+   "source": [
+    "## First Research Question: [✏️ PUT YOUR QUESTION HERE ✏️]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "graduate-palmer",
+   "metadata": {},
+   "source": [
+    "### Methods"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "endless-variation",
+   "metadata": {},
+   "source": [
+    "*Explain how you will approach this research question below. Consider the following:* \n",
+    "  - *Which aspects of the dataset will you use?* \n",
+    "  - *How will you reorganize/store the data?* \n",
+    "  - *What data science tools/functions will you use and why?* \n",
+    "  \n",
+    "✏️ *Write your answer below:*\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "portuguese-japan",
+   "metadata": {},
+   "source": [
+    "### Results "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "negative-highlight",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#######################################################################\n",
+    "### 💻 YOUR WORK GOES HERE TO ANSWER THE FIRST RESEARCH QUESTION 💻 \n",
+    "### \n",
+    "### Your data analysis may include a statistic and/or a data visualization\n",
+    "#######################################################################"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "victorian-burning",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 💻 YOU CAN ADD NEW CELLS WITH THE \"+\" BUTTON "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "collectible-puppy",
+   "metadata": {},
+   "source": [
+    "## Second Research Question: [✏️ PUT YOUR QUESTION HERE ✏️]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "demographic-future",
+   "metadata": {},
+   "source": [
+    "### Methods"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "incorporate-roller",
+   "metadata": {},
+   "source": [
+    "*Explain how you will approach this research question below. Consider the following:* \n",
+    "  - *Which aspects of the dataset will you use?* \n",
+    "  - *How will you reorganize/store the data?* \n",
+    "  - *What data science tools/functions will you use and why?* \n",
+    "\n",
+    "✏️ *Write your answer below:*\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "juvenile-creation",
+   "metadata": {},
+   "source": [
+    "### Results "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "pursuant-surrey",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#######################################################################\n",
+    "### 💻 YOUR WORK GOES HERE TO ANSWER THE SECOND RESEARCH QUESTION 💻 \n",
+    "###\n",
+    "### Your data analysis may include a statistic and/or a data visualization\n",
+    "#######################################################################"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "located-night",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 💻 YOU CAN ADD NEW CELLS WITH THE \"+\" BUTTON "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "infectious-symbol",
+   "metadata": {},
+   "source": [
+    "# Discussion"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "furnished-camping",
+   "metadata": {
+    "code_folding": []
+   },
+   "source": [
+    "## Considerations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bearing-stadium",
+   "metadata": {},
+   "source": [
+    "*It's important to recognize the limitations of our research.\n",
+    "Consider the following:*\n",
+    "\n",
+    "- *Do the results give an accurate depiction of your research question? Why or why not?*\n",
+    "- *What were limitations of your datset?*\n",
+    "- *Are there any known biases in the data?*\n",
+    "\n",
+    "✏️ *Write your answer below:*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "beneficial-invasion",
+   "metadata": {},
+   "source": [
+    "## Summary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "about-raise",
+   "metadata": {},
+   "source": [
+    "*Summarize what you discovered through the research. Consider the following:*\n",
+    "\n",
+    "- *What did you learn about your media consumption/digital habits?*\n",
+    "- *Did the results make sense?*\n",
+    "- *What was most surprising?*\n",
+    "- *How will this project impact you going forward?*\n",
+    "\n",
+    "✏️ *Write your answer below:*"
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_json": true,
+   "text_representation": {
+    "extension": ".Rmd",
+    "format_name": "rmarkdown",
+    "format_version": "1.2",
+    "jupytext_version": "1.9.1"
+   }
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": false,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/.ipynb_checkpoints/proposal-checkpoint.md
+++ b/.ipynb_checkpoints/proposal-checkpoint.md
@ -0,0 +1,64 @@
+# Project proposal
+
+This planning document will also form the introduction of your 
+argument.
+
+## Overarching Question
+
+### What central question are you interested in exploring? Why are you interested in exploring this question?
+
+*This should be the big picture question that you ask; use at least 5
+sentences to describe why you are interested in it.*
+
+I was originally going to look at weather data but after looking at the datasets available, I'm swapping over to
+birth data instead. I'm interested in looking at trends in birthdates/ranges and whether there is a specific sesonality
+to when children are born. I'm also interested in whether this seasonality (if present) shifts over time over the course
+of set of years (I have data from 2000-2014). Part of why I am interested in this particular trend is a statement I've heard
+about the birthday paradox where 23 people in a room leads to a 50/50 chance that two people have the same birthday. With
+75 people in a room, there is a 99.9% chance. I'm curious if this is valid or not, and if it is, whether it is linked to
+a specific seasonality where some range (maybe large?) is just substantially more common which leads to this statistical
+occurrance.
+
+### What specific research questions will you investigate?
+
+*List 2-4 specific research questions. Each should be answerable 
+using your data set.*
+
+1. Is there a pattern of seasonality in data showing the number of births each day over a given year?
+2. Does the pattern of seasonality shift over the course of a number of years?
+3. How does this data represent the birthday paradox related to the probability of two people having the same birthday?
+
+## Data source
+
+### What data set will you use to answer your overarching question? 
+
+*Give the title of your data set and provide a link to your data.*
+
+https://github.com/fivethirtyeight/data/blob/master/births/US_births_2000-2014_SSA.csv
+"US births 2000-2014"
+
+### Where is this data from?
+
+*Describe the source of the data set--not just where you downloaded it, but
+the person or organization who gathered the data. Explain why you trust them.*
+
+The data is available through fivethirtyeight, but the original data pulled comes from the United States' CDC, NCHS, and SSA. Generally, fivetehirtyeight has a repuation for being center/neutral politically and in this case, there isn't a whole lot of bias with the specific data presented. The data itself is coming from reporting numbers through various government agencies tracking birth data in the US and while it may not accound for some somehow off-the-grid birth occurrances, any baby with a birth certificate should be accounted for in this dataset.
+
+### What is this data about?
+
+*Describe the nature of the data in the dataset, including the number of rows 
+and some of the columns which will be important to you.*
+
+The dataset has 5480 rows, where 5479 are the dates 1/1/2000 through 12/31/2014. The column data presents the specific date day, month and year, as well as the day of the week (presented as a number 1-7) and the number of births occurring on that date. The specific day of the week is included in the dataset because fivethirtyeight was using this particular column in an article about few babies being born on Friday the 13th than expected statistically, but will not be of particular use to me here.
+
+## Methods 
+
+### How will you use your data set to answer your quantitative questions?
+
+*For each research question, explain what you will do with the data set 
+to answer the question, and how you will present your answer (e.g. a chart or a table).*
+
+Questions 1 and 2 will both involve creating a plot of birth frequency vs date for each year available. Question 1
+will look at each plot as individual datasets, question 2 will focus on relating those datasets to each other.
+
+Question 3 will involve a calculation of the probability based on a birthday paradox formula related to the phenomenon. This will be presented using a few tables to show different results based on given numbers of people.
--- a/Untitled.ipynb
+++ b/Untitled.ipynb
@ -0,0 +1,6 @@
+{
+ "cells": [],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/argument.ipynb
+++ b/argument.ipynb
@ -42,26 +42,31 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "id": "technical-evans",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "#Include any import statements you will need\n",
    "import pandas as pd\n",
-    "import matplotlib.pyplot as plt"
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
   "id": "overhead-sigma",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
   "outputs": [],
   "source": [
    "### 💻 FILL IN YOUR DATASET FILE NAME BELOW 💻 ###\n",
    "\n",
-    "file_name = \"YOUR_DATASET_FILE_NAME.csv\"\n",
+    "file_name = \"US_births_2000-2014_SSA.csv\"\n",
    "dataset_path = \"data/\" + file_name\n",
    "\n",
    "df = pd.read_csv(dataset_path)"
@ -69,10 +74,99 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
   "id": "heated-blade",
-   "metadata": {},
-   "outputs": [],
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>year</th>\n",
+       "      <th>month</th>\n",
+       "      <th>date_of_month</th>\n",
+       "      <th>day_of_week</th>\n",
+       "      <th>births</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2000</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>6</td>\n",
+       "      <td>9083</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2000</td>\n",
+       "      <td>1</td>\n",
+       "      <td>2</td>\n",
+       "      <td>7</td>\n",
+       "      <td>8006</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2000</td>\n",
+       "      <td>1</td>\n",
+       "      <td>3</td>\n",
+       "      <td>1</td>\n",
+       "      <td>11363</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2000</td>\n",
+       "      <td>1</td>\n",
+       "      <td>4</td>\n",
+       "      <td>2</td>\n",
+       "      <td>13032</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2000</td>\n",
+       "      <td>1</td>\n",
+       "      <td>5</td>\n",
+       "      <td>3</td>\n",
+       "      <td>12558</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   year  month  date_of_month  day_of_week  births\n",
+       "0  2000      1              1            6    9083\n",
+       "1  2000      1              2            7    8006\n",
+       "2  2000      1              3            1   11363\n",
+       "3  2000      1              4            2   13032\n",
+       "4  2000      1              5            3   12558"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": [
    "df.head()"
   ]
@ -310,7 +404,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.10.12"
  },
  "toc": {
   "base_numbering": 1,
--- a/data/US_births_2000-2014_SSA.csv
+++ b/data/US_births_2000-2014_SSA.csv
--- a/proposal.md
+++ b/proposal.md
@ -10,30 +10,55 @@ argument.
 *This should be the big picture question that you ask; use at least 5
 sentences to describe why you are interested in it.*

+I was originally going to look at weather data but after looking at the datasets available, I'm swapping over to
+birth data instead. I'm interested in looking at trends in birthdates/ranges and whether there is a specific sesonality
+to when children are born. I'm also interested in whether this seasonality (if present) shifts over time over the course
+of set of years (I have data from 2000-2014). Part of why I am interested in this particular trend is a statement I've heard
+about the birthday paradox where 23 people in a room leads to a 50/50 chance that two people have the same birthday. With
+75 people in a room, there is a 99.9% chance. I'm curious if this is valid or not, and if it is, whether it is linked to
+a specific seasonality where some range (maybe large?) is just substantially more common which leads to this statistical
+occurrance.
+
 ### What specific research questions will you investigate?

 *List 2-4 specific research questions. Each should be answerable 
 using your data set.*

+1. Is there a pattern of seasonality in data showing the number of births each day over a given year?
+2. Does the pattern of seasonality shift over the course of a number of years?
+3. How does this data represent the birthday paradox related to the probability of two people having the same birthday?
+
 ## Data source

 ### What data set will you use to answer your overarching question? 

 *Give the title of your data set and provide a link to your data.*

+https://github.com/fivethirtyeight/data/blob/master/births/US_births_2000-2014_SSA.csv
+"US births 2000-2014"
+
 ### Where is this data from?

 *Describe the source of the data set--not just where you downloaded it, but
 the person or organization who gathered the data. Explain why you trust them.*

+The data is available through fivethirtyeight, but the original data pulled comes from the United States' CDC, NCHS, and SSA. Generally, fivetehirtyeight has a repuation for being center/neutral politically and in this case, there isn't a whole lot of bias with the specific data presented. The data itself is coming from reporting numbers through various government agencies tracking birth data in the US and while it may not accound for some somehow off-the-grid birth occurrances, any baby with a birth certificate should be accounted for in this dataset.
+
 ### What is this data about?

 *Describe the nature of the data in the dataset, including the number of rows 
 and some of the columns which will be important to you.*

+The dataset has 5480 rows, where 5479 are the dates 1/1/2000 through 12/31/2014. The column data presents the specific date day, month and year, as well as the day of the week (presented as a number 1-7) and the number of births occurring on that date. The specific day of the week is included in the dataset because fivethirtyeight was using this particular column in an article about few babies being born on Friday the 13th than expected statistically, but will not be of particular use to me here.
+
 ## Methods 

 ### How will you use your data set to answer your quantitative questions?

 *For each research question, explain what you will do with the data set 
 to answer the question, and how you will present your answer (e.g. a chart or a table).*
+
+Questions 1 and 2 will both involve creating a plot of birth frequency vs date for each year available. Question 1
+will look at each plot as individual datasets, question 2 will focus on relating those datasets to each other.
+
+Question 3 will involve a calculation of the probability based on a birthday paradox formula related to the phenomenon. This will be presented using a few tables to show different results based on given numbers of people.
--- a/pyproject.toml
+++ b/pyproject.toml
@ -7,7 +7,7 @@ readme = "README.md"
 packages = [{include = "project_argument"}]

 [tool.poetry.dependencies]
-python = "^3.11"
+python = "^3.10"
 jupyter = "^1.0.0"
 seaborn = "^0.12.2"
 pandas = "^2.0.3"