{ "cells": [ { "cell_type": "markdown", "id": "4647d855", "metadata": {}, "source": [ "# Lab 04: Data Science Tools\n", "\n", "## 0. Intro to Jupyter Notebooks\n", "\n", "Welcome to your first Jupyter notebook! Notebooks are made up of cells. Some cells contain text (like this one) and others contain Python code.\n", "\n", "Each cell can be in two different modes: editing or running. To edit a cell, double-click on it. When you're done editing, press **Shift+Enter** to run it. You can use [Markdown](https://www.markdownguide.org/cheat-sheet/) to add basic formatting to the text. Before you go on, try editing the text in this cell." ] }, { "cell_type": "code", "execution_count": 1, "id": "355492e0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Other cells are code cells, containing Python code. (This is a comment, of course!)\n", "# Try running this cell (again, shift+Enter). You'll see the result of the final statement \n", "# printed below the cell. \n", "# Then try changing the Python code and re-run it.\n", "\n", "30+20" ] }, { "cell_type": "markdown", "id": "03fc2d21", "metadata": {}, "source": [ "## 0.1 Cells share state\n", "\n", "Even though code cells run one at a time, anything that happens in a cell (like declaring a variable or running a function) affects the whole notebook. Try running these two cells a few times, in different orders. What happens when you run **Cell B** over and over?" ] }, { "cell_type": "code", "execution_count": null, "id": "76cfb7a4", "metadata": {}, "outputs": [], "source": [ "# Cell A\n", "x = 10\n", "x" ] }, { "cell_type": "code", "execution_count": null, "id": "6de0894c", "metadata": {}, "outputs": [], "source": [ "# Cell B\n", "x = x * 2\n", "x" ] }, { "cell_type": "markdown", "id": "36e850fd", "metadata": {}, "source": [ "## 0.2 Saving your work\n", "\n", "When you finish working on a notebook, save your work using the top left icon in the menu bar above. Your notebook is stored in the file `lab_04.ipynb` in the lab directory. You can commit your changes to `ipynb` files just like any other file. Once you finish with Jupyter, you can stop the server by **Ctrl + C** in the terminal. \n", "\n", "*If you're doing this lab on a cloud-based platform like Binder, then you can't save your work. So don't close the tab!*" ] }, { "cell_type": "markdown", "id": "88057cce", "metadata": {}, "source": [ "---\n", "\n", "## 1. Pandas\n", "\n", "Pandas is probably the most important Python library for data science. Pandas provides an object called a **DataFrame**, which is basically a table with rows and columns. Most of the time, you will load data into Pandas using a `.csv` file. CSV files can be exported from Excel or Google Sheets, and are a common format for public data sets. \n", "\n", "In this lab, we'll be working with two data sets: The first contains Pokémon characteristics and the second comes from a wide-scale survey conducted by the US Centers for Disease Control ([details](https://www.cdc.gov/brfss/annual_data/annual_2020.html)). We will demonstrate techniques with Pokémon; your job is to replicate these tasks with the CDC dataset. \n", "\n", "**Note:** Pandas has *extensive* capabilities, and there's no way we could possibly present them all here. If you have a clearly-formed idea of what you want to do with tabular data, there's a way to do it. This lab introduces *some* of what Pandas can do, but expect to spend time reading the documentation and Stack Overflow when you start working on new tasks. \n", "\n", "## 1.0 Getting started\n", "\n", "First, we'll import pandas (using the conventional variable name `pd`) and load the two datasets. *Run these cells and every code cell you encounter in this notebook.*" ] }, { "cell_type": "code", "execution_count": 4, "id": "43f4949a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd \n", "pokemon = pd.read_csv(\"pokemon.csv\")\n", "people = pd.read_csv(\"brfss_2020.csv\")\n", "#find another csv file!" ] }, { "cell_type": "markdown", "id": "5179cb62", "metadata": {}, "source": [ "## 1.1 A first look\n", "\n", "#### Demo\n", "\n", "Let's start by learning the *shape* of the data. How many columns are there? How many rows? What kinds of data are included?" ] }, { "cell_type": "code", "execution_count": 5, "id": "420d195a", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametypesubtypetotalhpattackdefensespecial_attackspecial_defensespeedgenerationlegendary
0BulbasaurGrassPoison3184549496565451False
1IvysaurGrassPoison4056062638080601False
2VenusaurGrassPoison525808283100100801False
3VenusaurMega VenusaurGrassPoison62580100123122120801False
4CharmanderFireNaN3093952436050651False
.......................................
795DiancieRockFairy60050100150100150506True
796DiancieMega DiancieRockFairy700501601101601101106True
797HoopaHoopa ConfinedPsychicGhost6008011060150130706True
798HoopaHoopa UnboundPsychicDark6808016060170130806True
799VolcanionFireWater6008011012013090706True
\n", "

800 rows × 12 columns

\n", "
" ], "text/plain": [ " name type subtype total hp attack defense \\\n", "0 Bulbasaur Grass Poison 318 45 49 49 \n", "1 Ivysaur Grass Poison 405 60 62 63 \n", "2 Venusaur Grass Poison 525 80 82 83 \n", "3 VenusaurMega Venusaur Grass Poison 625 80 100 123 \n", "4 Charmander Fire NaN 309 39 52 43 \n", ".. ... ... ... ... .. ... ... \n", "795 Diancie Rock Fairy 600 50 100 150 \n", "796 DiancieMega Diancie Rock Fairy 700 50 160 110 \n", "797 HoopaHoopa Confined Psychic Ghost 600 80 110 60 \n", "798 HoopaHoopa Unbound Psychic Dark 680 80 160 60 \n", "799 Volcanion Fire Water 600 80 110 120 \n", "\n", " special_attack special_defense speed generation legendary \n", "0 65 65 45 1 False \n", "1 80 80 60 1 False \n", "2 100 100 80 1 False \n", "3 122 120 80 1 False \n", "4 60 50 65 1 False \n", ".. ... ... ... ... ... \n", "795 100 150 50 6 True \n", "796 160 110 110 6 True \n", "797 150 130 70 6 True \n", "798 170 130 80 6 True \n", "799 130 90 70 6 True \n", "\n", "[800 rows x 12 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon" ] }, { "cell_type": "markdown", "id": "2cb67dff", "metadata": {}, "source": [ "OK, 800 Pokémon, with 12 columns for each. And you can see all the columns. Not all the data is shown in this preview, of course. If there were more columns than could be displayed, you could see them all by typing `pokemon.columns`. \n", "\n", "---\n", "\n", "#### Your turn!\n", "\n", "Now do the same for your data set, `people`." ] }, { "cell_type": "code", "execution_count": null, "id": "dab2c1e7", "metadata": {}, "outputs": [], "source": [ "#Your code here" ] }, { "cell_type": "markdown", "id": "9bbf59e9", "metadata": {}, "source": [ "---\n", "## 1.2 Descriptive Statistics\n", "\n", "#### Demo\n", "\n", "Let's get a sense of the data contained in some of the columns. For categorical data like `generation`, it makes sense to look at value counts--showing us how many of each category there are. You can use the optional keyword `normalize=True` to see percentage of total instead of frequencies. You can put the optional keyword `normalize=True` in the () to see percentage of total instead of frequencies." ] }, { "cell_type": "code", "execution_count": 8, "id": "7707f6f2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 166\n", "5 165\n", "3 160\n", "4 121\n", "2 106\n", "6 82\n", "Name: generation, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.generation.value_counts()" ] }, { "cell_type": "markdown", "id": "914f0e59", "metadata": {}, "source": [ "For numeric data, we could start by looking at the mean value. We can select multiple columns and get all the column means at once." ] }, { "cell_type": "markdown", "id": "421adbdd", "metadata": {}, "source": [ "---\n", "#### Your turn!\n", "\n", "**1.2.0.** In this survey, people are grouped into age bands of 18, 25, 35, 45, 55, and 65. Using the people survey, what percentage of people are in each age band? (When we talk about \"people\" in this lab, we're referring to the people who responded to the survey, not the whole US population.)" ] }, { "cell_type": "code", "execution_count": 7, "id": "ccbc1a21", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "65 0.336326\n", "55 0.206369\n", "45 0.157669\n", "35 0.135527\n", "25 0.108866\n", "18 0.055244\n", "Name: age, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Hint: pokemon.generation.value_counts()\n", "people.age.value_counts(normalize=True)" ] }, { "cell_type": "markdown", "id": "1c120fcc", "metadata": {}, "source": [ "**1.2.1.** The `exercise` column indicates whether a person has done any physical activity or exercise in the last 30 days, outside of work. What percentage of people have done exercise?" ] }, { "cell_type": "code", "execution_count": null, "id": "b13a99c2", "metadata": {}, "outputs": [], "source": [ "#Hint: pokemon.generation.value_counts()\n", "test3 = people.exercise.value_counts(True)\n", "test3" ] }, { "cell_type": "markdown", "id": "cec02e11", "metadata": {}, "source": [ "--- \n", "#### Demo\n", "For numeric data, we could start by looking at the mean value. We can select multiple columns and get all the column means at once." ] }, { "cell_type": "code", "execution_count": null, "id": "58aa56db", "metadata": {}, "outputs": [], "source": [ "pokemon[[\"hp\", \"attack\", \"defense\", \"speed\"]].mean()" ] }, { "cell_type": "markdown", "id": "a8a08607", "metadata": {}, "source": [ "We can also compute the mean of boolean data. In this case, True will map to 1 and False will map to 0. So the mean value equals the percentage of data which is True. " ] }, { "cell_type": "code", "execution_count": null, "id": "1d165f28", "metadata": {}, "outputs": [], "source": [ "pokemon.legendary.mean()" ] }, { "cell_type": "markdown", "id": "2e64d3a1", "metadata": {}, "source": [ "---\n", "#### Your turn!\n", "**1.2.3.** Using the people survey, What are the mean height and weight of people in this survey?" ] }, { "cell_type": "code", "execution_count": null, "id": "c9338ff0", "metadata": {}, "outputs": [], "source": [ "#Hint: pokemon[[\"hp\", \"attack\", \"defense\", \"speed\"]].mean()\n", "people[['height', 'weight']].mean()" ] }, { "cell_type": "markdown", "id": "8b7738ef", "metadata": {}, "source": [ "---\n", "## 1.3 Filtering\n", "\n", "Sometimes we're just interested in a selection of the data set. The way to do this is to create a boolean series, and then use this to select which rows you want to include. Vocabulary note: A dataframe is two-dimensional, with rows and columns. A series (a single row or a single column) is one-dimensional. \n", "\n", "#### Demo\n", "`pokemon.legendary` is already boolean, so we can use this to select just the legendary pokémon. " ] }, { "cell_type": "code", "execution_count": 9, "id": "2b1b1c85", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametypesubtypetotalhpattackdefensespecial_attackspecial_defensespeedgenerationlegendary
156ArticunoIceFlying580908510095125851True
157ZapdosElectricFlying580909085125901001True
158MoltresFireFlying580901009012585901True
162MewtwoPsychicNaN68010611090154901301True
163MewtwoMega Mewtwo XPsychicFighting7801061901001541001301True
.......................................
795DiancieRockFairy60050100150100150506True
796DiancieMega DiancieRockFairy700501601101601101106True
797HoopaHoopa ConfinedPsychicGhost6008011060150130706True
798HoopaHoopa UnboundPsychicDark6808016060170130806True
799VolcanionFireWater6008011012013090706True
\n", "

65 rows × 12 columns

\n", "
" ], "text/plain": [ " name type subtype total hp attack defense \\\n", "156 Articuno Ice Flying 580 90 85 100 \n", "157 Zapdos Electric Flying 580 90 90 85 \n", "158 Moltres Fire Flying 580 90 100 90 \n", "162 Mewtwo Psychic NaN 680 106 110 90 \n", "163 MewtwoMega Mewtwo X Psychic Fighting 780 106 190 100 \n", ".. ... ... ... ... ... ... ... \n", "795 Diancie Rock Fairy 600 50 100 150 \n", "796 DiancieMega Diancie Rock Fairy 700 50 160 110 \n", "797 HoopaHoopa Confined Psychic Ghost 600 80 110 60 \n", "798 HoopaHoopa Unbound Psychic Dark 680 80 160 60 \n", "799 Volcanion Fire Water 600 80 110 120 \n", "\n", " special_attack special_defense speed generation legendary \n", "156 95 125 85 1 True \n", "157 125 90 100 1 True \n", "158 125 85 90 1 True \n", "162 154 90 130 1 True \n", "163 154 100 130 1 True \n", ".. ... ... ... ... ... \n", "795 100 150 50 6 True \n", "796 160 110 110 6 True \n", "797 150 130 70 6 True \n", "798 170 130 80 6 True \n", "799 130 90 70 6 True \n", "\n", "[65 rows x 12 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "legendary = pokemon[pokemon.legendary]\n", "legendary" ] }, { "cell_type": "markdown", "id": "7ece74db", "metadata": {}, "source": [ "Let's get all the ice pokémon. We can create a boolean series from another series..." ] }, { "cell_type": "code", "execution_count": 11, "id": "d3ffce2d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "795 False\n", "796 False\n", "797 False\n", "798 False\n", "799 False\n", "Name: type, Length: 800, dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pokemon.type == \"Ice\"" ] }, { "cell_type": "markdown", "id": "53210011", "metadata": {}, "source": [ "And then use this series to select just the ice pokémon. " ] }, { "cell_type": "code", "execution_count": 12, "id": "0572dd7d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametypesubtypetotalhpattackdefensespecial_attackspecial_defensespeedgenerationlegendary
133JynxIcePsychic45565503511595951False
156ArticunoIceFlying580908510095125851True
238SwinubIceGround2505050403030502False
239PiloswineIceGround450100100806060502False
243DelibirdIceFlying3304555456545752False
257SmoochumIcePsychic3054530158565652False
395SnoruntIceNaN3005050505050503False
396GlalieIceNaN4808080808080803False
397GlalieMega GlalieIceNaN5808012080120801003False
398SphealIceWater2907040505550253False
399SealeoIceWater4109060707570453False
400WalreinIceWater53011080909590653False
415RegiceIceNaN5808050100100200503True
522GlaceonIceNaN525656011013095654False
524MamoswineIceGround530110130807060804False
530FroslassIceGhost48070807080701104False
643VanilliteIceNaN3053650506560445False
644VanillishIceNaN3955165658075595False
645VanilluxeIceNaN53571958511095795False
674CubchooIceNaN3055570406040405False
675BearticIceNaN48595110807080505False
676CryogonalIceNaN485705030951351055False
788BergmiteIceNaN3045569853235286False
789AvaluggIceNaN514951171844446286False
\n", "
" ], "text/plain": [ " name type subtype total hp attack defense \\\n", "133 Jynx Ice Psychic 455 65 50 35 \n", "156 Articuno Ice Flying 580 90 85 100 \n", "238 Swinub Ice Ground 250 50 50 40 \n", "239 Piloswine Ice Ground 450 100 100 80 \n", "243 Delibird Ice Flying 330 45 55 45 \n", "257 Smoochum Ice Psychic 305 45 30 15 \n", "395 Snorunt Ice NaN 300 50 50 50 \n", "396 Glalie Ice NaN 480 80 80 80 \n", "397 GlalieMega Glalie Ice NaN 580 80 120 80 \n", "398 Spheal Ice Water 290 70 40 50 \n", "399 Sealeo Ice Water 410 90 60 70 \n", "400 Walrein Ice Water 530 110 80 90 \n", "415 Regice Ice NaN 580 80 50 100 \n", "522 Glaceon Ice NaN 525 65 60 110 \n", "524 Mamoswine Ice Ground 530 110 130 80 \n", "530 Froslass Ice Ghost 480 70 80 70 \n", "643 Vanillite Ice NaN 305 36 50 50 \n", "644 Vanillish Ice NaN 395 51 65 65 \n", "645 Vanilluxe Ice NaN 535 71 95 85 \n", "674 Cubchoo Ice NaN 305 55 70 40 \n", "675 Beartic Ice NaN 485 95 110 80 \n", "676 Cryogonal Ice NaN 485 70 50 30 \n", "788 Bergmite Ice NaN 304 55 69 85 \n", "789 Avalugg Ice NaN 514 95 117 184 \n", "\n", " special_attack special_defense speed generation legendary \n", "133 115 95 95 1 False \n", "156 95 125 85 1 True \n", "238 30 30 50 2 False \n", "239 60 60 50 2 False \n", "243 65 45 75 2 False \n", "257 85 65 65 2 False \n", "395 50 50 50 3 False \n", "396 80 80 80 3 False \n", "397 120 80 100 3 False \n", "398 55 50 25 3 False \n", "399 75 70 45 3 False \n", "400 95 90 65 3 False \n", "415 100 200 50 3 True \n", "522 130 95 65 4 False \n", "524 70 60 80 4 False \n", "530 80 70 110 4 False \n", "643 65 60 44 5 False \n", "644 80 75 59 5 False \n", "645 110 95 79 5 False \n", "674 60 40 40 5 False \n", "675 70 80 50 5 False \n", "676 95 135 105 5 False \n", "788 32 35 28 6 False \n", "789 44 46 28 6 False " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ice = pokemon[pokemon.type == \"Ice\"]\n", "ice" ] }, { "cell_type": "markdown", "id": "c01f16a5", "metadata": {}, "source": [ "---\n", "#### Your turn!\n", "\n", "**1.3.0.** `no_doctor` indicates whether there was a time in the last year when the person needed to see a doctor, but could not afford to do so. Create a dataframe containing only these people. " ] }, { "cell_type": "code", "execution_count": 10, "id": "e6f35a92", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexincomeeducationsexual_orientationheightweighthealthno_doctorexercisesleep
055female52other1.5583.012TrueTrue7
235female84heterosexual1.6577.114TrueTrue7
2435male83heterosexual1.7394.354TrueFalse8
5035female42heterosexual1.7881.654TrueFalse10
6645female64heterosexual1.5772.574TrueTrue7
....................................
16640718male52heterosexual1.6868.043TrueTrue8
16640925male62heterosexual1.5758.514TrueFalse7
16641455female83heterosexual1.6388.453TrueFalse6
16641665female52heterosexual1.5055.343TrueFalse6
16642335female54heterosexual1.6068.044TrueTrue6
\n", "

13784 rows × 11 columns

\n", "
" ], "text/plain": [ " age sex income education sexual_orientation height weight \\\n", "0 55 female 5 2 other 1.55 83.01 \n", "2 35 female 8 4 heterosexual 1.65 77.11 \n", "24 35 male 8 3 heterosexual 1.73 94.35 \n", "50 35 female 4 2 heterosexual 1.78 81.65 \n", "66 45 female 6 4 heterosexual 1.57 72.57 \n", "... ... ... ... ... ... ... ... \n", "166407 18 male 5 2 heterosexual 1.68 68.04 \n", "166409 25 male 6 2 heterosexual 1.57 58.51 \n", "166414 55 female 8 3 heterosexual 1.63 88.45 \n", "166416 65 female 5 2 heterosexual 1.50 55.34 \n", "166423 35 female 5 4 heterosexual 1.60 68.04 \n", "\n", " health no_doctor exercise sleep \n", "0 2 True True 7 \n", "2 4 True True 7 \n", "24 4 True False 8 \n", "50 4 True False 10 \n", "66 4 True True 7 \n", "... ... ... ... ... \n", "166407 3 True True 8 \n", "166409 4 True False 7 \n", "166414 3 True False 6 \n", "166416 3 True False 6 \n", "166423 4 True True 6 \n", "\n", "[13784 rows x 11 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Hint ice = pokemon[pokemon.type == \"Ice\"]\n", "noDoc = people[people.no_doctor]\n", "noDoc" ] }, { "cell_type": "markdown", "id": "c72e2155", "metadata": {}, "source": [ "Let's get the high-speed ice pokémon. You can join conditions together using the `&` (and) and `|` (or) operators. `~` means \"not\", so `pokemon[~(pokemon.type == \"Ice\")]` would select all the non-ice pokémon. Due to order of operations, each condition needs to be wrapped in parentheses.\n", "\n", "You could get the pokémon who are fire or ice by selecting `pokemon[(pokemon.type == \"Fire\") | (pokemon.type == \"Ice\")]`, but lets get the high-speed ice pokemon now." ] }, { "cell_type": "code", "execution_count": 15, "id": "97f3332e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametypesubtypetotalhpattackdefensespecial_attackspecial_defensespeedgenerationlegendary
133JynxIcePsychic45565503511595951False
156ArticunoIceFlying580908510095125851True
396GlalieIceNaN4808080808080803False
397GlalieMega GlalieIceNaN5808012080120801003False
524MamoswineIceGround530110130807060804False
530FroslassIceGhost48070807080701104False
676CryogonalIceNaN485705030951351055False
\n", "
" ], "text/plain": [ " name type subtype total hp attack defense \\\n", "133 Jynx Ice Psychic 455 65 50 35 \n", "156 Articuno Ice Flying 580 90 85 100 \n", "396 Glalie Ice NaN 480 80 80 80 \n", "397 GlalieMega Glalie Ice NaN 580 80 120 80 \n", "524 Mamoswine Ice Ground 530 110 130 80 \n", "530 Froslass Ice Ghost 480 70 80 70 \n", "676 Cryogonal Ice NaN 485 70 50 30 \n", "\n", " special_attack special_defense speed generation legendary \n", "133 115 95 95 1 False \n", "156 95 125 85 1 True \n", "396 80 80 80 3 False \n", "397 120 80 100 3 False \n", "524 70 60 80 4 False \n", "530 80 70 110 4 False \n", "676 95 135 105 5 False " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "high_speed_ice = pokemon[(pokemon.type == \"Ice\") & (pokemon.speed >= 80)]\n", "high_speed_ice" ] }, { "cell_type": "markdown", "id": "22215027", "metadata": {}, "source": [ "---\n", "#### Your turn!\n", "**1.3.1.** `health` asks people for their general health, with the meanings of numbers shown below. Create a dataframe which contains people whose general health is good or better. \n", "\n", "| number | health status | \n", "| ------ | ----------- |\n", "| 1 | Poor |\n", "| 2 | Fair |\n", "| 3 | Good |\n", "| 4 | Very good |\n", "| 5 | Excellent |" ] }, { "cell_type": "code", "execution_count": 13, "id": "a80f0adc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexincomeeducationsexual_orientationheightweighthealthno_doctorexercisesleep
235female84heterosexual1.6577.114TrueTrue7
355male84heterosexual1.8381.655FalseTrue8
455female84heterosexual1.8076.664FalseTrue8
555male84heterosexual1.8074.845FalseTrue7
855female64heterosexual1.7363.505FalseTrue7
....................................
16641165male71heterosexual1.78117.934FalseTrue7
16641535male84heterosexual1.7599.794FalseTrue7
16641735female82heterosexual1.7395.254FalseFalse4
16642125male72heterosexual1.7886.184FalseTrue6
16642335female54heterosexual1.6068.044TrueTrue6
\n", "

93585 rows × 11 columns

\n", "
" ], "text/plain": [ " age sex income education sexual_orientation height weight \\\n", "2 35 female 8 4 heterosexual 1.65 77.11 \n", "3 55 male 8 4 heterosexual 1.83 81.65 \n", "4 55 female 8 4 heterosexual 1.80 76.66 \n", "5 55 male 8 4 heterosexual 1.80 74.84 \n", "8 55 female 6 4 heterosexual 1.73 63.50 \n", "... ... ... ... ... ... ... ... \n", "166411 65 male 7 1 heterosexual 1.78 117.93 \n", "166415 35 male 8 4 heterosexual 1.75 99.79 \n", "166417 35 female 8 2 heterosexual 1.73 95.25 \n", "166421 25 male 7 2 heterosexual 1.78 86.18 \n", "166423 35 female 5 4 heterosexual 1.60 68.04 \n", "\n", " health no_doctor exercise sleep \n", "2 4 True True 7 \n", "3 5 False True 8 \n", "4 4 False True 8 \n", "5 5 False True 7 \n", "8 5 False True 7 \n", "... ... ... ... ... \n", "166411 4 False True 7 \n", "166415 4 False True 7 \n", "166417 4 False False 4 \n", "166421 4 False True 6 \n", "166423 4 True True 6 \n", "\n", "[93585 rows x 11 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Hint = high_speed_ice = pokemon[(pokemon.type == \"Ice\") & (pokemon.speed >= 80)]\n", "goodHealth = people[(people.health >3)]\n", "goodHealth" ] }, { "cell_type": "markdown", "id": "0e3ec616", "metadata": {}, "source": [ "**1.3.2.** `education` indicates the highest level of education completed, with codes as follows. Create a dataframe which only contains female college graduates who needed a doctor but couldn't afford one. (The survey asked people for their current sex, and only had options for male and female.)\n", "\n", "| number | education level | \n", "| ------ | ----------- |\n", "| 1 | Did not graduate from high school |\n", "| 2 | Graduated from high school |\n", "| 3 | Attended some college |\n", "| 4 | Graduated from college |" ] }, { "cell_type": "code", "execution_count": null, "id": "1ee0b68f", "metadata": {}, "outputs": [], "source": [ "#Hint: high_speed_ice = pokemon[(pokemon.type == \"Ice\") & (pokemon.speed >= 80)]\n", "sickgirlbosses = people[(people.no_doctor)&(people.education==4)&(people.sex=='female')]\n", "sickgirlbosses" ] }, { "cell_type": "markdown", "id": "b41ee22d", "metadata": {}, "source": [ "---\n", "## 1.4. Grouping\n", "\n", "Now things get crazy. You can group a dataframe using one or more columns, and then compare their statistics. \n", "\n", "#### Demo\n", "\n", "Do different types of pokémon move at different speeds? We'll use `sort_values` to put these in order from slow to fast." ] }, { "cell_type": "code", "execution_count": null, "id": "d8c9a61c", "metadata": {}, "outputs": [], "source": [ "pokemon.groupby(\"type\").speed.mean().sort_values()" ] }, { "cell_type": "markdown", "id": "bbecbb71", "metadata": {}, "source": [ "----\n", "### Your turn\n", "\n", "**1.4.0.** `income` records peoples' annual income, in the following bands. `sleep` records the average hours of sleep someone gets per night. Is there a difference in the average hours of sleep by income level?\n", "\n", "| number | annual income, in $1000 | \n", "| ------ | ----------- |\n", "| 1 | Less than 10 |\n", "| 2 | 10-15 |\n", "| 3 | 15-20 |\n", "| 4 | 20-25 |\n", "| 5 | 25-35 |\n", "| 6 | 35-50 |\n", "| 7 | 50-75 |\n", "| 8 | More than 75 |" ] }, { "cell_type": "code", "execution_count": null, "id": "9ce2d00f", "metadata": {}, "outputs": [], "source": [ "#pokemon.groupby(\"type\").speed.mean().sort_values()\n", "a = people.groupby(\"income\").sleep.mean()\n", "a" ] }, { "cell_type": "markdown", "id": "3ff40dde", "metadata": {}, "source": [ "---\n", "#### Demo\n", "Do types differ in other stats? Let's sort by hit points. " ] }, { "cell_type": "code", "execution_count": 16, "id": "1f1dd086", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hpattackdefense
type
Bug56.88405870.97101470.724638
Electric59.79545569.09090966.295455
Ghost64.43750073.78125081.187500
Steel65.22222292.703704126.370370
Rock65.36363692.863636100.795455
Dark66.80645288.38709770.225806
Poison67.25000074.67857168.821429
Grass67.27142973.21428670.800000
Fighting69.85185296.77777865.925926
Fire69.90384684.76923167.769231
Psychic70.63157971.45614067.684211
Flying70.75000078.75000066.250000
Ice72.00000072.75000071.416667
Water72.06250074.15178672.946429
Ground73.78125095.75000084.843750
Fairy74.11764761.52941265.705882
Normal77.27551073.46938859.846939
Dragon83.312500112.12500086.375000
\n", "
" ], "text/plain": [ " hp attack defense\n", "type \n", "Bug 56.884058 70.971014 70.724638\n", "Electric 59.795455 69.090909 66.295455\n", "Ghost 64.437500 73.781250 81.187500\n", "Steel 65.222222 92.703704 126.370370\n", "Rock 65.363636 92.863636 100.795455\n", "Dark 66.806452 88.387097 70.225806\n", "Poison 67.250000 74.678571 68.821429\n", "Grass 67.271429 73.214286 70.800000\n", "Fighting 69.851852 96.777778 65.925926\n", "Fire 69.903846 84.769231 67.769231\n", "Psychic 70.631579 71.456140 67.684211\n", "Flying 70.750000 78.750000 66.250000\n", "Ice 72.000000 72.750000 71.416667\n", "Water 72.062500 74.151786 72.946429\n", "Ground 73.781250 95.750000 84.843750\n", "Fairy 74.117647 61.529412 65.705882\n", "Normal 77.275510 73.469388 59.846939\n", "Dragon 83.312500 112.125000 86.375000" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ptypes = pokemon.groupby(\"type\")\n", "ptypes[[\"hp\", \"attack\", \"defense\"]].mean().sort_values(\"hp\")" ] }, { "cell_type": "markdown", "id": "d2f5ce5c", "metadata": {}, "source": [ "Which type/subtype combinations are most likely to have legendary pokémon?" ] }, { "cell_type": "code", "execution_count": null, "id": "7e1b2e97", "metadata": {}, "outputs": [], "source": [ "legendary_percentages = pokemon.groupby([\"type\", \"subtype\"]).legendary.mean().sort_values() \n", "legendary_percentages[legendary_percentages > 0.5] " ] }, { "cell_type": "markdown", "id": "4697836f", "metadata": {}, "source": [ "---\n", "#### Your turn!\n", "**1.4.0.** Is there a difference in peoples' general health, by sex and education level? " ] }, { "cell_type": "code", "execution_count": 22, "id": "b9f40fc7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sex education\n", "female 1 2.848040\n", "male 1 3.031525\n", "female 2 3.315797\n", "male 2 3.440818\n", "female 3 3.483379\n", "male 3 3.549105\n", " 4 3.826512\n", "female 4 3.844340\n", "Name: health, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Hint: legendary_percentages = pokemon.groupby([\"type\", \"subtype\"]).legendary.mean().sort_values() \n", "\n", "b = people.groupby([\"sex\", \"education\"]).health.mean().sort_values()\n", "b" ] }, { "cell_type": "markdown", "id": "93ca2aa9", "metadata": {}, "source": [ "---\n", "### 1.5. Plotting \n", "\n", "Pandas has excellent built-in plotting capabilities, but \n", "we are going to use the [seaborn](https://seaborn.pydata.org/) library because it's a bit \n", "more intuitive and produces more beautiful plots. `set_theme`, called here without any arguments, assigns the default color palette. " ] }, { "cell_type": "code", "execution_count": 24, "id": "dadbe683", "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "sns.set_theme()" ] }, { "cell_type": "markdown", "id": "76bab223", "metadata": {}, "source": [ "#### Demo\n", "\n", "**When you want to visualize the distribution of a series**, a [histogram](https://seaborn.pydata.org/generated/seaborn.histplot.html) puts data into bins and plots the number of data points in each bin.\n", "\n", "Let's see the distribution of pokémon attack values. Note how assigning `x=\"attack\"` spreads attack values over the x-axis, while `y=\"attack\"` spreads attack values over the y-axis. The number of bins is selected automatically, but you can specify this with the optional `bins` argument. " ] }, { "cell_type": "code", "execution_count": 25, "id": "242e70d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.histplot(data=pokemon, x=\"attack\")" ] }, { "cell_type": "code", "execution_count": 27, "id": "6ae16707", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.histplot(data=pokemon, y=\"attack\", bins=5)" ] }, { "cell_type": "markdown", "id": "d9b4cf0e", "metadata": {}, "source": [ "---\n", "#### Your turn!\n", "\n", "**1.5.0.** Plot a histogram of peoples' heights." ] }, { "cell_type": "code", "execution_count": null, "id": "379ad264", "metadata": {}, "outputs": [], "source": [ "#Hint: sns.histplot(data=pokemon, x=\"attack\")\n", "sns.histplot(data=people, x=\"height\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }