{
"cells": [
{
"cell_type": "markdown",
"id": "worldwide-blood",
"metadata": {},
"source": [
"# A Data Science Investigation About Fatal Car Crashes in America "
]
},
{
"cell_type": "markdown",
"id": "understanding-numbers",
"metadata": {},
"source": [
"*✏️ Write 2-3 sentences describing your research.*\n",
"\n",
"It's a collection of data on the reasons fatal car crashes occur in every state of America, and it will be used to determine which region of America is the deadliest. "
]
},
{
"cell_type": "markdown",
"id": "greater-circular",
"metadata": {},
"source": [
"## Overarching Question: What is the deadliest region in America to drive on?"
]
},
{
"cell_type": "markdown",
"id": "appreciated-testimony",
"metadata": {},
"source": [
"*✏️ Write 2-3 sentences explaining why this question.*\n",
"\n",
"I am interested in this because I live on the Northeast Coast and we have a lot of car \n",
"accidents. People drive very fast here. The roads are not always paved properly and maintained. I want to know if it's just bad luck when people get into accidents or if it's their own fault. "
]
},
{
"cell_type": "markdown",
"id": "permanent-pollution",
"metadata": {},
"source": [
"# Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "technical-evans",
"metadata": {},
"outputs": [],
"source": [
"#Include any import statements you will need\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "overhead-sigma",
"metadata": {},
"outputs": [],
"source": [
"### 💻 FILL IN YOUR DATASET FILE NAME BELOW 💻 ###\n",
"\n",
"file_name = \"B_D - bad-drivers.csv\"\n",
"dataset_path = \"data/\" + file_name\n",
"\n",
"df = pd.read_csv(dataset_path)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "heated-blade",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
State
\n",
"
number_drivers_fatal_billion_miles
\n",
"
percentage_drivers_fatal_speeding
\n",
"
percentage_drivers_fatal_alcohol_impaired
\n",
"
percentage_drivers_fatal_not_distracted
\n",
"
percentage_drivers_fatal_no_previous_accidents
\n",
"
car_insurance_premiums
\n",
"
region
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Alabama
\n",
"
18.8
\n",
"
39
\n",
"
30
\n",
"
96
\n",
"
80
\n",
"
784.55
\n",
"
Southeast
\n",
"
\n",
"
\n",
"
1
\n",
"
Alaska
\n",
"
18.1
\n",
"
41
\n",
"
25
\n",
"
90
\n",
"
94
\n",
"
1053.48
\n",
"
West
\n",
"
\n",
"
\n",
"
2
\n",
"
Arizona
\n",
"
18.6
\n",
"
35
\n",
"
28
\n",
"
84
\n",
"
96
\n",
"
899.47
\n",
"
Southeast
\n",
"
\n",
"
\n",
"
3
\n",
"
Arkansas
\n",
"
22.4
\n",
"
18
\n",
"
26
\n",
"
94
\n",
"
95
\n",
"
827.34
\n",
"
Southeast
\n",
"
\n",
"
\n",
"
4
\n",
"
California
\n",
"
12.0
\n",
"
35
\n",
"
28
\n",
"
91
\n",
"
89
\n",
"
878.41
\n",
"
West
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" State number_drivers_fatal_billion_miles \\\n",
"0 Alabama 18.8 \n",
"1 Alaska 18.1 \n",
"2 Arizona 18.6 \n",
"3 Arkansas 22.4 \n",
"4 California 12.0 \n",
"\n",
" percentage_drivers_fatal_speeding \\\n",
"0 39 \n",
"1 41 \n",
"2 35 \n",
"3 18 \n",
"4 35 \n",
"\n",
" percentage_drivers_fatal_alcohol_impaired \\\n",
"0 30 \n",
"1 25 \n",
"2 28 \n",
"3 26 \n",
"4 28 \n",
"\n",
" percentage_drivers_fatal_not_distracted \\\n",
"0 96 \n",
"1 90 \n",
"2 84 \n",
"3 94 \n",
"4 91 \n",
"\n",
" percentage_drivers_fatal_no_previous_accidents car_insurance_premiums \\\n",
"0 80 784.55 \n",
"1 94 1053.48 \n",
"2 96 899.47 \n",
"3 95 827.34 \n",
"4 89 878.41 \n",
"\n",
" region \n",
"0 Southeast \n",
"1 West \n",
"2 Southeast \n",
"3 Southeast \n",
"4 West "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "continental-franklin",
"metadata": {},
"source": [
"**Data Overview**\n",
"\n",
"*✏️ Write 2-3 sentences describing this dataset. Be sure to include where the data comes from and what it contains.*\n",
"\n",
"### When is this data set from?\n",
"\n",
"I got the data set from FiveThirtyEight. It was used for an article called\n",
"\"Dear Mona, Which state has the worst drivers?\" in October 2014. The person who wrote the article is Mona Chalabi, they are a data editor at the Guardian US, \n",
"a columnist at New York Magazine, and a lead news writer for FiveThirtyEight.\n",
"\n",
"The date is about fatal collisions in each state. There are 8 rows:\n",
"\n",
"1. State\n",
"2. Number of drivers involved in fatal collisions per billion miles\n",
"3. Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding\n",
"4. Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired\n",
"5. Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted\n",
"6. Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents\n",
"7. Car Insurance Premiums ($)\n",
"8. Region\n",
"\n",
"### How did this data set get clean?\n",
"\n",
"I did not need to do much cleaning of the data myself, but I did add a column called \"Region\" to separate the state into 5 different regions: Northwest, Midwest, Southeast, West, and Northeast. I also excluded data on Losses incurred by insurance companies for collisions per insured driver because insurance companies are well known for finding ways to get out of paying customers for collisions, thus it is not an accurate representation of fatal car crashes. \n",
"\n",
"## What specific research questions will you investigate?\n",
"\n",
"1. What region has the highest drinking and driving cause of fatal collisions?\n",
"\n",
"2. What region has the highest car insurance premiums?\n",
"\n",
"3. What region is the most unlucky state for fatal collisions?\n",
"\n",
"4. Is there a connection between the speed and the roads that are causing fatal collisions, that would make the Car Insurance Premiums more expensive?\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f7bba5f3-5911-4a76-ad43-f6ce78cd4fb3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['State', 'number_drivers_fatal_billion_miles',\n",
" 'percentage_drivers_fatal_speeding',\n",
" 'percentage_drivers_fatal_alcohol_impaired',\n",
" 'percentage_drivers_fatal_not_distracted',\n",
" 'percentage_drivers_fatal_no_previous_accidents',\n",
" 'car_insurance_premiums', 'region'],\n",
" dtype='object')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"id": "infinite-instrument",
"metadata": {},
"source": [
"# Methods and Results"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "basic-canadian",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import seaborn as sns\n",
"sns.set_theme"
]
},
{
"cell_type": "markdown",
"id": "recognized-positive",
"metadata": {},
"source": [
"## First Research Question: What region has the highest drinking and driving cause of fatal collisions?"
]
},
{
"cell_type": "markdown",
"id": "graduate-palmer",
"metadata": {},
"source": [
"### Methods"
]
},
{
"cell_type": "markdown",
"id": "endless-variation",
"metadata": {},
"source": [
"*Explain how you will approach this research question below. Consider the following:* \n",
" - *Which aspects of the dataset will you use?* \n",
" - *How will you reorganize/store the data?* \n",
" - *What data science tools/functions will you use and why?* \n",
" \n",
"✏️ *Write your answer below:*\n",
"\n",
"To answer this question, I will organize the data for each state by the region it is in. Then, calculate the average percentage of drivers involved in fatal collisions who were alcohol-impaired. Finally, I will make a bar plot to compare the average number of fatal collisions that involved drinking and driving for each of the regions\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "portuguese-japan",
"metadata": {},
"source": [
"### Results "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "negative-highlight",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"region\n",
"Southeast 29.687500\n",
"West 30.363636\n",
"Northwest 31.000000\n",
"Northeast 31.444444\n",
"Midwest 31.666667\n",
"Name: percentage_drivers_fatal_alcohol_impaired, dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#######################################################################\n",
"### 💻 YOUR WORK GOES HERE TO ANSWER THE FIRST RESEARCH QUESTION 💻 \n",
"### \n",
"### Your data analysis may include a statistic and/or a data visualization\n",
"#######################################################################\n",
"\n",
"region = df.groupby(\"region\").percentage_drivers_fatal_alcohol_impaired.mean().sort_values()\n",
"region\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "victorian-burning",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(data=df, x=\"region\", y=\"percentage_drivers_fatal_alcohol_impaired\", errorbar=\"sd\")"
]
},
{
"cell_type": "markdown",
"id": "collectible-puppy",
"metadata": {},
"source": [
"## Second Research Question: What region has the highest car insurance premiums?\n"
]
},
{
"cell_type": "markdown",
"id": "demographic-future",
"metadata": {},
"source": [
"### Methods"
]
},
{
"cell_type": "markdown",
"id": "incorporate-roller",
"metadata": {},
"source": [
"*Explain how you will approach this research question below. Consider the following:* \n",
" - *Which aspects of the dataset will you use?* \n",
" - *How will you reorganize/store the data?* \n",
" - *What data science tools/functions will you use and why?* \n",
"\n",
"✏️ *Write your answer below:*\n",
"\n",
"To answer this question, I will organize the data for each state by the region it is in. Then, compare the average cost of car insurance and see which region is the highest.\n"
]
},
{
"cell_type": "markdown",
"id": "juvenile-creation",
"metadata": {},
"source": [
"### Results "
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "pursuant-surrey",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"region\n",
"Midwest 756.630833\n",
"West 855.624545\n",
"Southeast 905.472500\n",
"Northeast 975.038889\n",
"Northwest 1160.163333\n",
"Name: car_insurance_premiums, dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#######################################################################\n",
"### 💻 YOUR WORK GOES HERE TO ANSWER THE SECOND RESEARCH QUESTION 💻 \n",
"###\n",
"### Your data analysis may include a statistic and/or a data visualization\n",
"#######################################################################\n",
"\n",
"df.groupby(\"region\").car_insurance_premiums.mean().sort_values()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "located-night",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(data=df, x=\"region\", y=\"car_insurance_premiums\", errorbar=\"sd\")"
]
},
{
"cell_type": "markdown",
"id": "8ab785a8-ac72-4fec-8d4b-7ad7f93de32d",
"metadata": {},
"source": [
"## Third Research Question: What region is the most unlucky state for fatal collisions?"
]
},
{
"cell_type": "markdown",
"id": "810dc600-da04-437d-a546-4d6c5bec01c6",
"metadata": {},
"source": [
"### Methods"
]
},
{
"cell_type": "markdown",
"id": "be64d030-0f40-4c32-ac3e-be494e64b3a7",
"metadata": {},
"source": [
"*Explain how you will approach this research question below. Consider the following:* \n",
" - *Which aspects of the dataset will you use?* \n",
" - *How will you reorganize/store the data?* \n",
" - *What data science tools/functions will you use and why?* \n",
"\n",
"✏️ *Write your answer below:*\n",
"\n",
"To answer this question, I will organize the data for each state by the region it is in. Then, compare the average percentage of Drivers Involved In Fatal Collisions Who Were Not Distracted and the average percentage of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "096fe314-2953-4644-86e0-cd717f77eb8f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(data=region_mean, x=\"region\", y=\"percentage_drivers_fatal_no_previous_accidents\")"
]
},
{
"cell_type": "markdown",
"id": "d66967db-fe78-4889-824e-f7bce4e02cc8",
"metadata": {},
"source": [
"## Fourth Research Question: Is there a connection between the average Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding and the region with the most expensive car insurance premiums?"
]
},
{
"cell_type": "markdown",
"id": "9661b6d4-3c4f-42a2-8916-b9df38375760",
"metadata": {},
"source": [
"### Methods"
]
},
{
"cell_type": "markdown",
"id": "cc44ade7-3ae9-44a0-b821-29ecb1b66385",
"metadata": {},
"source": [
"Explain how you will approach this research question below. Consider the following:\n",
"\n",
"Which aspects of the dataset will you use?\n",
"How will you reorganize/store the data?\n",
"What data science tools/functions will you use and why?\n",
"✏️ Write your answer below:\n",
"\n",
"To answer this question, I will organize the data for each state by the region it is in. Then, compare the average Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding to see if there is a connection with the region with the highest car insurance."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "b921b74c-951a-4f30-a42e-292f011fd61a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"