project_argument/proposal.md

3.9 KiB

Project proposal

This planning document will also form the introduction of your argument.

Overarching Question

What central question are you interested in exploring? Why are you interested in exploring this question?

This should be the big picture question that you ask; use at least 5 sentences to describe why you are interested in it.

I was originally going to look at weather data but after looking at the datasets available, I'm swapping over to birth data instead. I'm interested in looking at trends in birthdates/ranges and whether there is a specific sesonality to when children are born. I'm also interested in whether this seasonality (if present) shifts over time over the course of set of years (I have data from 2000-2014). Part of why I am interested in this particular trend is a statement I've heard about the birthday paradox where 23 people in a room leads to a 50/50 chance that two people have the same birthday. With 75 people in a room, there is a 99.9% chance. I'm curious if this is valid or not, and if it is, whether it is linked to a specific seasonality where some range (maybe large?) is just substantially more common which leads to this statistical occurrance.

What specific research questions will you investigate?

List 2-4 specific research questions. Each should be answerable using your data set.

  1. Is there a pattern of seasonality in data showing the number of births each day over a given year?
  2. Does the pattern of seasonality shift over the course of a number of years?
  3. How does this data represent the birthday paradox related to the probability of two people having the same birthday?

Data source

What data set will you use to answer your overarching question?

Give the title of your data set and provide a link to your data.

https://github.com/fivethirtyeight/data/blob/master/births/US_births_2000-2014_SSA.csv "US births 2000-2014"

Where is this data from?

Describe the source of the data set--not just where you downloaded it, but the person or organization who gathered the data. Explain why you trust them.

The data is available through fivethirtyeight, but the original data pulled comes from the United States' CDC, NCHS, and SSA. Generally, fivetehirtyeight has a repuation for being center/neutral politically and in this case, there isn't a whole lot of bias with the specific data presented. The data itself is coming from reporting numbers through various government agencies tracking birth data in the US and while it may not accound for some somehow off-the-grid birth occurrances, any baby with a birth certificate should be accounted for in this dataset.

What is this data about?

Describe the nature of the data in the dataset, including the number of rows and some of the columns which will be important to you.

The dataset has 5480 rows, where 5479 are the dates 1/1/2000 through 12/31/2014. The column data presents the specific date day, month and year, as well as the day of the week (presented as a number 1-7) and the number of births occurring on that date. The specific day of the week is included in the dataset because fivethirtyeight was using this particular column in an article about few babies being born on Friday the 13th than expected statistically, but will not be of particular use to me here.

Methods

How will you use your data set to answer your quantitative questions?

For each research question, explain what you will do with the data set to answer the question, and how you will present your answer (e.g. a chart or a table).

Questions 1 and 2 will both involve creating a plot of birth frequency vs date for each year available. Question 1 will look at each plot as individual datasets, question 2 will focus on relating those datasets to each other.

Question 3 will involve a calculation of the probability based on a birthday paradox formula related to the phenomenon. This will be presented using a few tables to show different results based on given numbers of people.