Lesson 5: Pandas as a dataframe API

Open In Colab

Learning objectives

Students will gain a better understanding of how to use pandas for manipulating data frames

Specific coding skills learned: - Subsetting data frames - Joining data frames together - Other useful pandas functions (incuding groupby, statistics, and conditional subsetting)

%pip install pandas 
%pip install matplotlib 
%pip install numpy 
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np

Collecting pandas
  Downloading pandas-2.3.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.3.0-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (62 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/hostedtoolcache/Python/3.13.4/x64/lib/python3.13/site-packages (from pandas) (2.9.0.post0)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.13.4/x64/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Downloading pandas-2.3.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/12.0 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.0/12.0 MB 210.1 MB/s eta 0:00:00
Downloading numpy-2.3.0-cp313-cp313-manylinux_2_28_x86_64.whl (16.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/16.6 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.6/16.6 MB 221.0 MB/s eta 0:00:00
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, numpy, pandas
   ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 [tzdata]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 [numpy]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 [pandas]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 [pandas]
Successfully installed numpy-2.3.0 pandas-2.3.0 pytz-2025.2 tzdata-2025.2
Note: you may need to restart the kernel to use updated packages.
Collecting matplotlib
  Downloading matplotlib-3.10.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.58.2-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (106 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.2 kB)
Requirement already satisfied: numpy>=1.23 in /opt/hostedtoolcache/Python/3.13.4/x64/lib/python3.13/site-packages (from matplotlib) (2.3.0)
Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.13.4/x64/lib/python3.13/site-packages (from matplotlib) (25.0)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.2.1-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (8.9 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB)
Requirement already satisfied: python-dateutil>=2.7 in /opt/hostedtoolcache/Python/3.13.4/x64/lib/python3.13/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.13.4/x64/lib/python3.13/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)
Downloading matplotlib-3.10.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/8.6 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 120.7 MB/s eta 0:00:00
Downloading contourpy-1.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (322 kB)
Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Downloading fonttools-4.58.2-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/4.9 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 MB 126.4 MB/s eta 0:00:00
Downloading kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/1.5 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 143.1 MB/s eta 0:00:00
Downloading pillow-11.2.1-cp313-cp313-manylinux_2_28_x86_64.whl (4.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/4.6 MB ? eta -:--:--   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.6/4.6 MB 136.6 MB/s eta 0:00:00
Downloading pyparsing-3.2.3-py3-none-any.whl (111 kB)
Installing collected packages: pyparsing, pillow, kiwisolver, fonttools, cycler, contourpy, matplotlib
   ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/7 [pillow]   ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/7 [pillow]   ━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 3/7 [fonttools]   ━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 3/7 [fonttools]   ━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 3/7 [fonttools]   ━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 3/7 [fonttools]   ━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 3/7 [fonttools]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 6/7 [matplotlib]   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 [matplotlib]
Successfully installed contourpy-1.3.2 cycler-0.12.1 fonttools-4.58.2 kiwisolver-1.4.8 matplotlib-3.10.3 pillow-11.2.1 pyparsing-3.2.3
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: numpy in /opt/hostedtoolcache/Python/3.13.4/x64/lib/python3.13/site-packages (2.3.0)
Note: you may need to restart the kernel to use updated packages.

Introduction

In our last class, we discussed the basics of how to use pandas to read in our data. When we load in data from a CSV file with Pandas using a function such as reas_csv, we get a data structure known as a data frame. Basically, this is a two-dimensional table of rows and columns.

This is useful in terms of allowing us to visualize our data, but most users will want to utilize the data for their own research purposes. Therefore, users will find it helpful to use pandas as an application programming interface (or API), which is basically a set of functions that allow users access to the features of the data.

For this class, we are going to be using daily weather history data from multiple U.S. cities from July 2014 - June 2015 (downloaded from FiveThirtyEight’s public repository of data found here: https://github.com/fivethirtyeight/data/tree/master/us-weather-history).

The first set of data we are going to read in is from Philadelphia.

!wget https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KPHL.csv
philly_weather = pd.read_csv('KPHL.csv',index_col='date')
philly_weather.head()

--2025-06-11 17:25:17--  https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KPHL.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20560 (20K) [text/plain]
Saving to: ‘KPHL.csv’

KPHL.csv              0%[                    ]       0  --.-KB/s               KPHL.csv            100%[===================>]  20.08K  --.-KB/s    in 0s      

2025-06-11 17:25:17 (113 MB/s) - ‘KPHL.csv’ saved [20560/20560]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	83	72	93	68	86	52	102	1988	1901	0.00	0.13	1.04
2014-7-2	86	75	96	68	86	52	103	1965	1901	0.21	0.13	1.73
2014-7-3	83	74	92	68	87	54	104	1957	1966	0.09	0.12	3.66
2014-7-4	73	68	78	68	87	52	103	1986	1966	0.04	0.13	2.08
2014-7-5	74	64	83	69	87	55	100	1963	1999	0.00	0.13	4.38

Subsetting Data Frames

The rows of this data frame represent each day during this period, and the columns represent the weather data that has been collected.

As you can see, there are numerous types of data that has been collected, including the temperatures that occurred that day, what the average temperatures have been for that day over the years, the records for that day, and the amount of precipitation that occurred both on that day and on average for that day.

Question 1: Let’s say that you do not care about the temperature data, so most of these columns are not neccessary. How can you get the data you want to have?

Answer: You can subset the data frame to only contain the columns you need

In pandas, one of the ways you can select certain subsets of the data is to specify them by the row and/or column names.

To select a particular column (for example, the actual preciptation for each day) , you use square brackets [] along with the column name of the column of interest in quotation marks. Alternatively, this is known as “slicing”, since you are taking a slice of the original data frame.

philly_weather["actual_precipitation"]

date
2014-7-1     0.00
2014-7-2     0.21
2014-7-3     0.09
2014-7-4     0.04
2014-7-5     0.00
             ... 
2015-6-26    0.00
2015-6-27    1.34
2015-6-28    0.22
2015-6-29    0.00
2015-6-30    1.50
Name: actual_precipitation, Length: 365, dtype: float64

In order to get both columns of precipitation data, we need to use double brackets, then within the brackets we list the columns of interest with a comma between them, as shown below.

philly_weather[["actual_precipitation","average_precipitation","record_precipitation"]]

	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	0.00	0.13	1.04
2014-7-2	0.21	0.13	1.73
2014-7-3	0.09	0.12	3.66
2014-7-4	0.04	0.13	2.08
2014-7-5	0.00	0.13	4.38
...	...	...	...
2015-6-26	0.00	0.12	2.77
2015-6-27	1.34	0.10	3.27
2015-6-28	0.22	0.11	2.48
2015-6-29	0.00	0.12	4.62
2015-6-30	1.50	0.12	1.56

365 rows × 3 columns

Another way of selecting these columns is to add to the end of our data frame object “.loc” (which looks for the names of the columns by labels). Since our matrix is rows by columns, the “:” symbol tells the computer that we want all of the rows, but only the columns with specific labels.

philly_weather.loc[:,["actual_precipitation","average_precipitation","record_precipitation"]]

	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	0.00	0.13	1.04
2014-7-2	0.21	0.13	1.73
2014-7-3	0.09	0.12	3.66
2014-7-4	0.04	0.13	2.08
2014-7-5	0.00	0.13	4.38
...	...	...	...
2015-6-26	0.00	0.12	2.77
2015-6-27	1.34	0.10	3.27
2015-6-28	0.22	0.11	2.48
2015-6-29	0.00	0.12	4.62
2015-6-30	1.50	0.12	1.56

365 rows × 3 columns

Likewise, we can do the same for the rows, only placing the “:” after the comma (in the columns position), and selecting the rows by label before the comma, as shown below.

philly_weather.loc[["2014-7-1"],:]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	83	72	93	68	86	52	102	1988	1901	0.0	0.13	1.04

philly_weather.loc[["2014-7-1","2014-7-2"],:]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	83	72	93	68	86	52	102	1988	1901	0.00	0.13	1.04
2014-7-2	86	75	96	68	86	52	103	1965	1901	0.21	0.13	1.73

While this is not too difficult if we only need a handful of rows and/or columns, this could become time-consuming if we want to get a series of rows and/or columns.

Question 2: Instead of using the labels of the rows/columns, what is another attribute you can use to subset the data?

Answer: You can use their numerical location (ex: the 1st row or 2nd column)

A little earlier in the lesson, we used “.loc” to select rows and columns by their label. To do this by their index, we can instead use “.iloc” to select the row and/or column index value(s).

Given that the “actual_precipitation” and “average_precipitation” columns represent the 9th through 11th rows respectively, this is how we subset the data to obtain the values in these columns .

(note: the first piece of code means “select the data from the 9th column up to, but not including, the 12th column”).

philly_weather.iloc[:,9:12]

	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	0.00	0.13	1.04
2014-7-2	0.21	0.13	1.73
2014-7-3	0.09	0.12	3.66
2014-7-4	0.04	0.13	2.08
2014-7-5	0.00	0.13	4.38
...	...	...	...
2015-6-26	0.00	0.12	2.77
2015-6-27	1.34	0.10	3.27
2015-6-28	0.22	0.11	2.48
2015-6-29	0.00	0.12	4.62
2015-6-30	1.50	0.12	1.56

365 rows × 3 columns

philly_weather.iloc[0:2,:]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	83	72	93	68	86	52	102	1988	1901	0.00	0.13	1.04
2014-7-2	86	75	96	68	86	52	103	1965	1901	0.21	0.13	1.73

If the numbers are not in a sequence (ex: columns 0, 5, and 7), you can get the subset of the data like this

philly_weather.iloc[:,[0,5,7]]

	actual_mean_temp	record_min_temp	record_min_temp_year
date
2014-7-1	83	52	1988
2014-7-2	86	52	1965
2014-7-3	83	54	1957
2014-7-4	73	52	1986
2014-7-5	74	55	1963
...	...	...	...
2015-6-26	76	51	1960
2015-6-27	67	50	1965
2015-6-28	73	54	1961
2015-6-29	72	52	1888
2015-6-30	77	52	1988

365 rows × 3 columns

Joining Data Frames

Of course, we might not only be interested in the data from one city. Imagine that we want to do side-by-side comparisons of the average precipitation in Seattle vs. Philadelphia.

Let’s first read in the weather pattern data for Seattle as well.

!wget https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KSEA.csv
seattle_weather = pd.read_csv('KSEA.csv',index_col='date')
seattle_weather.head()

--2025-06-11 17:25:18--  https://raw.githubusercontent.com/fivethirtyeight/data/master/us-weather-history/KSEA.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20603 (20K) [text/plain]
Saving to: ‘KSEA.csv’

KSEA.csv              0%[                    ]       0  --.-KB/s               KSEA.csv            100%[===================>]  20.12K  --.-KB/s    in 0s      

2025-06-11 17:25:18 (136 MB/s) - ‘KSEA.csv’ saved [20603/20603]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	77	60	94	54	73	45	94	1948	2014	0.0	0.03	0.75
2014-7-2	70	58	81	54	73	43	93	1954	2015	0.0	0.03	0.50
2014-7-3	64	57	71	54	74	44	92	1954	2015	0.0	0.03	0.47
2014-7-4	66	57	75	54	74	43	92	1949	2015	0.0	0.03	0.57
2014-7-5	66	56	76	55	74	47	91	1951	1958	0.0	0.03	0.64

Using what we learned earlier, we can subset our data to only look at the columns dealing with precipitation:

philly_weather_precip = philly_weather.iloc[:,9:12]
seattle_weather_precip = seattle_weather.iloc[:,9:12]

Rather than having to flip between both data frames, we can instead combine them into one separate data frame using the function “join”.

There are four different ways that we can join our data, which we will go through below: “left”, “right”, “inner”, and “outer”

For the sake of an example, let us sort our data such that Philadelphia’s precipitation data is sorted by the lowest to highest record precipitation, while Seattle’s precipitation data is sorted from the highest to lowest record precipitation.

philly_weather_precip_reordered = philly_weather_precip.sort_values(by=["record_precipitation"], axis=0, ascending=True)
philly_weather_precip_reordered

	actual_precipitation	average_precipitation	record_precipitation
date
2015-4-29	0.00	0.12	0.85
2014-10-13	0.02	0.10	0.90
2015-2-25	0.00	0.10	0.90
2015-3-9	0.00	0.11	0.94
2014-11-30	0.00	0.12	0.96
...	...	...	...
2014-8-13	0.00	0.11	5.21
2014-10-8	0.00	0.11	5.53
2014-8-3	0.23	0.12	5.63
2014-9-16	0.20	0.13	6.63
2014-7-28	0.23	0.14	8.02

365 rows × 3 columns

seattle_weather_precip_reordered = seattle_weather_precip.sort_values(by=["record_precipitation"], axis=0, ascending=False)
seattle_weather_precip_reordered

	actual_precipitation	average_precipitation	record_precipitation
date
2014-10-20	0.46	0.12	5.02
2014-12-3	0.00	0.20	3.77
2014-11-20	0.14	0.23	3.41
2014-11-6	0.16	0.21	3.29
2015-2-8	0.14	0.13	3.06
...	...	...	...
2014-7-11	0.00	0.03	0.25
2014-7-28	0.00	0.02	0.22
2014-7-29	0.00	0.02	0.17
2014-8-8	0.00	0.02	0.13
2014-7-15	0.00	0.02	0.11

365 rows × 3 columns

A left join indicates that we preserve the order of the data frame is getting another data frame joined to it (in this case, the Philadelphia precipitation data). Since the suffix names are the same for both dataframes, we use lsuffix and rsuffix to specify which columns came from which dataframe.

(Note: ‘left’ is the default setting)

philly_weather_precip_reordered.join(seattle_weather_precip_reordered, how ='left', lsuffix = '_philly', rsuffix = '_seattle')

	actual_precipitation_philly	average_precipitation_philly	record_precipitation_philly	actual_precipitation_seattle	average_precipitation_seattle	record_precipitation_seattle
date
2015-4-29	0.00	0.12	0.85	0.00	0.07	1.06
2014-10-13	0.02	0.10	0.90	0.30	0.10	0.66
2015-2-25	0.00	0.10	0.90	0.16	0.13	0.98
2015-3-9	0.00	0.11	0.94	0.00	0.12	1.47
2014-11-30	0.00	0.12	0.96	0.00	0.22	1.50
...	...	...	...	...	...	...
2014-8-13	0.00	0.11	5.21	0.85	0.03	0.85
2014-10-8	0.00	0.11	5.53	0.00	0.08	1.16
2014-8-3	0.23	0.12	5.63	0.00	0.01	0.47
2014-9-16	0.20	0.13	6.63	0.00	0.05	0.76
2014-7-28	0.23	0.14	8.02	0.00	0.02	0.22

365 rows × 6 columns

Likewise, a right join indicates that we preserve the order of the data frame that is being joined to the first data frame (in this case, the Seattle precipitation data).

philly_weather_precip_reordered.join(seattle_weather_precip_reordered, how ='right', lsuffix = '_philly', rsuffix = '_seattle')

	actual_precipitation_philly	average_precipitation_philly	record_precipitation_philly	actual_precipitation_seattle	average_precipitation_seattle	record_precipitation_seattle
date
2014-10-20	0.00	0.10	2.72	0.46	0.12	5.02
2014-12-3	0.11	0.12	1.60	0.00	0.20	3.77
2014-11-20	0.00	0.10	2.59	0.14	0.23	3.41
2014-11-6	0.52	0.09	1.41	0.16	0.21	3.29
2015-2-8	0.00	0.09	1.15	0.14	0.13	3.06
...	...	...	...	...	...	...
2014-7-11	0.00	0.14	1.92	0.00	0.03	0.25
2014-7-28	0.23	0.14	8.02	0.00	0.02	0.22
2014-7-29	0.00	0.14	3.53	0.00	0.02	0.17
2014-8-8	0.00	0.12	4.40	0.00	0.02	0.13
2014-7-15	0.81	0.15	2.96	0.00	0.02	0.11

365 rows × 6 columns

If we want to order the dates lexographically, we can change the how parameter to ‘outer’ (or the union of the datasets).

philly_weather_precip_reordered.join(seattle_weather_precip_reordered, how ='outer', lsuffix = '_philly', rsuffix = '_seattle')

	actual_precipitation_philly	average_precipitation_philly	record_precipitation_philly	actual_precipitation_seattle	average_precipitation_seattle	record_precipitation_seattle
date
2014-10-1	0.00	0.11	3.00	0.00	0.07	0.56
2014-10-10	0.00	0.11	2.10	0.01	0.09	1.02
2014-10-11	0.48	0.10	2.04	0.29	0.09	0.89
2014-10-12	0.00	0.11	1.62	0.00	0.09	0.70
2014-10-13	0.02	0.10	0.90	0.30	0.10	0.66
...	...	...	...	...	...	...
2015-6-5	0.13	0.13	2.31	0.00	0.06	0.63
2015-6-6	0.05	0.12	1.95	0.00	0.06	1.54
2015-6-7	0.00	0.13	3.50	0.00	0.06	0.81
2015-6-8	0.88	0.12	2.00	0.00	0.07	0.81
2015-6-9	0.02	0.12	2.03	0.00	0.06	0.72

365 rows × 6 columns

When we combine these datasets together, we can easily compare data from both data frames, for example plotting the record precipitation in Seattle vs. Philadelphia

all_precipitation = philly_weather_precip.join(seattle_weather_precip,lsuffix = '_philly', rsuffix = '_seattle')
ax = plt.gca()
all_precipitation.plot(kind='line',y='record_precipitation_philly',ax=ax, color = 'purple',figsize=(12,5))
all_precipitation.plot(kind='line',y='record_precipitation_seattle',ax=ax, color = 'green',figsize=(12,5))
plt.show()

Let us pretend that we want to add information about each month (stored in the months object)

months = pd.Series(['July', 'August','September','October','November','December','January','February','March','April','May','June'])
months = months.repeat([31,31,30,31,30,31,31,28,31,30,31,30])
months = pd.DataFrame(months, columns = ['months'])
months.index = all_precipitation.index
months.index.name = 'date'
months

	months
date
2014-7-1	July
2014-7-2	July
2014-7-3	July
2014-7-4	July
2014-7-5	July
...	...
2015-6-26	June
2015-6-27	June
2015-6-28	June
2015-6-29	June
2015-6-30	June

365 rows × 1 columns

Now, let us say that in addition to the rainfall data, we also started to collect information about the average dewpoint on each day for the first three months in Philadelphia. However, due to statewide budget cuts scientists could not afford to gather data for the rest of the year.

dewpoint = np.random.randint(40,70,size=92)
dewpoint = pd.DataFrame(dewpoint, columns=['dewpoint'])
dewpoint.index = all_precipitation.index[0:92]
dewpoint.index.name = 'date'
dewpoint

	dewpoint
date
2014-7-1	47
2014-7-2	59
2014-7-3	67
2014-7-4	41
2014-7-5	45
...	...
2014-9-26	46
2014-9-27	65
2014-9-28	51
2014-9-29	68
2014-9-30	59

92 rows × 1 columns

As done previously, we can join the datasets together, but you will notice below that all of the dates where there was no dewpoint recorded are labeled ‘NaN’ (“Not a number”, or undefined, data points).

philly_weather_precip.join(dewpoint)

	actual_precipitation	average_precipitation	record_precipitation	dewpoint
date
2014-7-1	0.00	0.13	1.04	47.0
2014-7-2	0.21	0.13	1.73	59.0
2014-7-3	0.09	0.12	3.66	67.0
2014-7-4	0.04	0.13	2.08	41.0
2014-7-5	0.00	0.13	4.38	45.0
...	...	...	...	...
2015-6-26	0.00	0.12	2.77	NaN
2015-6-27	1.34	0.10	3.27	NaN
2015-6-28	0.22	0.11	2.48	NaN
2015-6-29	0.00	0.12	4.62	NaN
2015-6-30	1.50	0.12	1.56	NaN

365 rows × 4 columns

What if you only want to keep the values where dewpoints are recorded? To do this, you can use join with the how parameter set to ‘inner’ (or intersection of the data), as shown below.

philly_weather_precip.join(dewpoint, how='inner')

	actual_precipitation	average_precipitation	record_precipitation	dewpoint
date
2014-7-1	0.00	0.13	1.04	47
2014-7-2	0.21	0.13	1.73	59
2014-7-3	0.09	0.12	3.66	67
2014-7-4	0.04	0.13	2.08	41
2014-7-5	0.00	0.13	4.38	45
...	...	...	...	...
2014-9-26	0.00	0.13	2.79	46
2014-9-27	0.00	0.14	1.85	65
2014-9-28	0.00	0.13	3.22	51
2014-9-29	0.00	0.14	1.87	68
2014-9-30	0.00	0.13	2.41	59

92 rows × 4 columns

Other useful pandas functions: conditional subsetting

There are other ways to subset datasets by their index or their rownames/colnames.

Let us say that you are only looking for days where the max temperature was greater than 90. How would we be able to get only those rows?

To do this, we can do something called conditional subsetting.

Remember in lesson 1 when we learned about comparisons, such as “greater than” ( > ), “lesser than” ( < ), “equals” (==)? We can subset our data this way as well.

To pick the rows where the max temperature was greater than 90, we can subset the data as follows

philly_weather[philly_weather["actual_max_temp"] > 90]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	83	72	93	68	86	52	102	1988	1901	0.00	0.13	1.04
2014-7-2	86	75	96	68	86	52	103	1965	1901	0.21	0.13	1.73
2014-7-3	83	74	92	68	87	54	104	1957	1966	0.09	0.12	3.66
2014-7-7	84	74	93	69	87	55	103	1968	2010	0.00	0.13	2.97
2014-7-8	84	72	95	69	87	53	100	1960	1993	0.17	0.14	2.28
2014-7-13	82	72	91	69	87	57	98	1998	1994	0.05	0.14	2.80
2014-7-15	82	73	91	70	87	56	103	1940	1995	0.81	0.15	2.96
2014-7-23	83	72	93	70	87	58	101	1950	2011	0.05	0.14	3.05
2014-7-27	81	70	91	70	87	53	100	1962	1940	1.25	0.15	2.16
2014-8-31	83	72	93	66	83	45	101	1965	1953	1.11	0.10	3.41
2014-9-2	84	74	93	65	83	47	100	1963	1953	0.18	0.12	1.89
2014-9-6	83	72	93	64	81	48	95	1962	1983	0.01	0.11	2.19
2015-5-31	81	70	92	59	78	42	97	1965	1991	0.00	0.13	1.65
2015-6-11	81	70	92	63	82	44	95	1972	1986	0.00	0.11	2.04
2015-6-12	85	75	95	63	82	48	95	1979	1947	0.00	0.11	3.05
2015-6-16	83	74	91	64	83	49	100	1963	1991	0.00	0.11	3.30
2015-6-21	82	73	91	66	84	48	99	1968	1923	0.30	0.11	1.62
2015-6-22	82	73	91	66	85	48	100	1963	1988	0.00	0.11	2.47
2015-6-23	83	70	95	66	85	47	97	1963	1888	0.55	0.11	2.12

As another example, what if we wanted to get all the days where it did not rain?

philly_weather[philly_weather["actual_precipitation"] == 0.00]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	83	72	93	68	86	52	102	1988	1901	0.0	0.13	1.04
2014-7-5	74	64	83	69	87	55	100	1963	1999	0.0	0.13	4.38
2014-7-6	75	64	86	69	87	52	102	1960	2010	0.0	0.13	1.36
2014-7-7	84	74	93	69	87	55	103	1968	2010	0.0	0.13	2.97
2014-7-9	80	71	89	69	87	53	103	1963	1936	0.0	0.15	1.79
...	...	...	...	...	...	...	...	...	...	...	...	...
2015-6-17	78	69	86	64	83	44	98	1964	1957	0.0	0.10	2.41
2015-6-22	82	73	91	66	85	48	100	1963	1988	0.0	0.11	2.47
2015-6-24	77	69	85	66	85	52	99	1947	1923	0.0	0.11	2.61
2015-6-26	76	68	84	67	85	51	100	1960	1952	0.0	0.12	2.77
2015-6-29	72	63	81	68	86	52	102	1888	1934	0.0	0.12	4.62

243 rows × 12 columns

What if we want to omit years where the high occurred that year? (Hint: remember the boolean operators AND (&) and OR (|)

Answer:

philly_weather[(philly_weather["record_max_temp_year"] != 2014) & (philly_weather["record_max_temp_year"] != 2015)]

	actual_mean_temp	actual_min_temp	actual_max_temp	average_min_temp	average_max_temp	record_min_temp	record_max_temp	record_min_temp_year	record_max_temp_year	actual_precipitation	average_precipitation	record_precipitation
date
2014-7-1	83	72	93	68	86	52	102	1988	1901	0.00	0.13	1.04
2014-7-2	86	75	96	68	86	52	103	1965	1901	0.21	0.13	1.73
2014-7-3	83	74	92	68	87	54	104	1957	1966	0.09	0.12	3.66
2014-7-4	73	68	78	68	87	52	103	1986	1966	0.04	0.13	2.08
2014-7-5	74	64	83	69	87	55	100	1963	1999	0.00	0.13	4.38
...	...	...	...	...	...	...	...	...	...	...	...	...
2015-6-26	76	68	84	67	85	51	100	1960	1952	0.00	0.12	2.77
2015-6-27	67	63	71	67	86	50	97	1965	1963	1.34	0.10	3.27
2015-6-28	73	66	79	67	86	54	97	1961	1963	0.22	0.11	2.48
2015-6-29	72	63	81	68	86	52	102	1888	1934	0.00	0.12	4.62
2015-6-30	77	67	86	68	86	52	100	1988	1964	1.50	0.12	1.56

364 rows × 12 columns

Other useful pandas functions: groupby

Let us go back to to the all_precip_wmonths object, when we added information about which month each day was in for the combined Philly and Seattle precipitation data. One neat thing we can do is to group data based on certain column values (such as months), as shown below.

all_precip_wmonths.groupby(["months"])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 all_precip_wmonths.groupby(["months"])

NameError: name 'all_precip_wmonths' is not defined

We can do functions such as mean, min, and max for each month, rather than every day.

all_precip_wmonths.groupby(["months"]).mean()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 all_precip_wmonths.groupby(["months"]).mean()

NameError: name 'all_precip_wmonths' is not defined

all_precip_wmonths.groupby(["months"]).min()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 all_precip_wmonths.groupby(["months"]).min()

NameError: name 'all_precip_wmonths' is not defined

all_precip_wmonths.groupby(['months']).max()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 all_precip_wmonths.groupby(['months']).max()

NameError: name 'all_precip_wmonths' is not defined

Remember when we plotted the record precipitation in Philadelphia vs. Seattle? What if we instead plotted the mean month-to-month instead of daily?

mean_months_precip = all_precip_wmonths.groupby(["months"]).mean()
ax2 = plt.gca()
mean_months_precip.plot(kind='line',y='record_precipitation_philly',ax=ax2, color = 'purple',figsize=(12,5))
mean_months_precip.plot(kind='line',y='record_precipitation_seattle',ax=ax2, color = 'green',figsize=(12,5))
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[30], line 1
----> 1 mean_months_precip = all_precip_wmonths.groupby(["months"]).mean()
      2 ax2 = plt.gca()
      3 mean_months_precip.plot(kind='line',y='record_precipitation_philly',ax=ax2, color = 'purple',figsize=(12,5))

NameError: name 'all_precip_wmonths' is not defined

From this graph, we can see that months where the record precipitation was on average the highest in Philadelphia (such as August and July) are the lowest precipitation times in Seattle!

Other useful pandas functions: statistics

Lastly, we can perform statistics on our data as a whole. We have already gone through how certain mathematical functions such as min, max, or mean can be calculated on different groups, but this can also be done to columns of the data frame as well.

For example, we can get the mean value of the actual precipitation in Seattle and Philadelphia, as shown below:

all_precip_wmonths["actual_precipitation_seattle"].mean()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[31], line 1
----> 1 all_precip_wmonths["actual_precipitation_seattle"].mean()

NameError: name 'all_precip_wmonths' is not defined

all_precip_wmonths["actual_precipitation_philly"].mean()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[32], line 1
----> 1 all_precip_wmonths["actual_precipitation_philly"].mean()

NameError: name 'all_precip_wmonths' is not defined

As you can see, on average the daily precipitation in Philadelphia is only 0.02 inches greater than Seattle.

If you wanted to look at both at the same time, we follow the same rules that we did in label-based subsetting:

all_precip_wmonths[["actual_precipitation_seattle","actual_precipitation_philly"]].mean()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[33], line 1
----> 1 all_precip_wmonths[["actual_precipitation_seattle","actual_precipitation_philly"]].mean()

NameError: name 'all_precip_wmonths' is not defined

In general, if you wanted to look at multiple statistics at once, you can call the function ‘describe’, as shown below:

all_precip_wmonths[["actual_precipitation_seattle","actual_precipitation_philly"]].describe()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 all_precip_wmonths[["actual_precipitation_seattle","actual_precipitation_philly"]].describe()

NameError: name 'all_precip_wmonths' is not defined

In-class exercises

In-class exercise 1: How would we select the precipitation data from July 1st and July 2nd, 2014, in Philadelphia, using label-based subsetting?

Answer:

In-class exercise 2: What would we do to only select the precipitation data from July 1st and July 2nd, 2014 using index-based subsetting?

Answer:

In-class exercise 3: Using the ‘all_precipitation’ and ‘months’ objects called earlier, how can we join these dataframes together and save it as a new object titled ‘all_precip_wmonths’?

Answer:

In-class exercise 4: Go back and use the philly_weather and seattle_weather objects to figure out which days had a record minimum tempeature under 10 in either dataset

Answer:

In-class exercise 5: Using the all_precip_wmonths object, how would you find the differences in standard deviation in average precipitation between Philadelphia and Seattle for each month? (Hint: you may want to use the pandas function “values” at some point in this answer)

Answer:

Homework

There are 81 different exercises pertaining to manipulating Pandas data frames found here: https://www.w3resource.com/python-exercises/pandas/index-dataframe.php (with the predicted output as well as possible solutions to the problems).

I would suggest doing problems 3, 4, 5, 7, 9, 10, 12, 13, 14, 15, 24, and 31 (and if you have time/interest, problems 33, 44, 45, 49, and 52 are also good choices)