import pandas as pd


united_2022 = pd.read_csv('../data/united_2022.csv')


united_2022.head(n = 1)


united_2022.describe() # DataFrame.describe() provides an range of summary statistics


# Rather than using describe(), we can apply individual methods
united_2022['Departure delay (Minutes)'].mean()

14.024842309558467


united_2022['Departure delay (Minutes)'].median() # Median

0.0


united_2022['Departure delay (Minutes)'].std() # Standard deviation

46.833962555880554


import statistics ## We don't have to rely only on methods provided by `pandas`
statistics.stdev(united_2022.iloc[:,10])

46.833962555880554


united_2022.describe(include = 'all') # Adding include = 'all' tells pandas to summarize all variables


united_2022['Flight Number'].mode() # Mode, most frequent value

0    1561
Name: Flight Number, dtype: int64


united_2022['Origin Airport'].value_counts() # Counts of unique values

DEN    32245
EWR    25761
ORD    24767
IAH    22561
SFO    16511
IAD    11286
LAX     9986
MCO     5680
DFW     2658
ATL     1297
PHL     1221
JFK      602
Name: Origin Airport, dtype: int64


united_2022['Origin Airport'].value_counts(normalize = True) # We can further normalize them by the number of rows

DEN    0.208604
EWR    0.166657
ORD    0.160226
IAH    0.145955
SFO    0.106815
IAD    0.073013
LAX    0.064603
MCO    0.036746
DFW    0.017196
ATL    0.008391
PHL    0.007899
JFK    0.003895
Name: Origin Airport, dtype: float64


# Calculate crosstabulation between origin and destination airports
pd.crosstab(united_2022['Origin Airport'], united_2022['Destination Airport'])


# It is often useful to see the proportions/percentages rather than raw counts
pd.crosstab(united_2022['Origin Airport'], united_2022['Destination Airport'], normalize = 'columns')


from plotnine import *


origin_plot = ggplot(data = united_2022) + geom_bar(aes(x = 'Origin Airport'))
origin_plot

<ggplot: (8785261968216)>


# First we need to group dataset by 'Origin Airport' and summarize it with `size()`
united_2022_origin_grouped = united_2022.groupby(['Origin Airport']).size()


united_2022_origin_grouped.plot(kind = 'bar')

<AxesSubplot:xlabel='Origin Airport'>


import matplotlib.pyplot as plt


# `matplotlib` is more low-level library
# plots would need more work to be 'prettified'
plt.bar(x = united_2022_origin_grouped.index, height = united_2022_origin_grouped.values)

<BarContainer object of 12 artists>


# Here we change default axes' labels and then apply B&W theme
origin_plot_pretty = origin_plot +\
    labs(y = 'flights') +\
    theme_bw()
origin_plot_pretty

<ggplot: (8785259675203)>


origin_plot_pretty.save('../temp/q1_plot_pretty.pdf')

/home/tpaskhalis/Decrypted/Git/DS3/venv/lib/python3.8/site-packages/plotnine/ggplot.py:718: PlotnineWarning: Saving 6.4 x 4.8 in image.
/home/tpaskhalis/Decrypted/Git/DS3/venv/lib/python3.8/site-packages/plotnine/ggplot.py:719: PlotnineWarning: Filename: ../temp/q1_plot_pretty.pdf

	Flight Number	Scheduled elapsed time (Minutes)	Actual elapsed time (Minutes)	Departure delay (Minutes)	Taxi-Out time (Minutes)	Delay Carrier (Minutes)	Delay Weather (Minutes)	Delay National Aviation System (Minutes)	Delay Security (Minutes)	Delay Late Aircraft Arrival (Minutes)
count	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000
mean	1442.155795	192.808876	180.380618	14.024842	18.978451	5.010209	0.552172	2.590923	0.001844	5.616122
std	732.636240	86.048770	88.502329	46.833963	10.081536	25.454662	12.039743	13.370774	0.281220	28.164622
min	42.000000	42.000000	0.000000	-29.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	691.000000	134.000000	123.000000	-4.000000	14.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	1546.000000	173.000000	164.000000	0.000000	17.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	2060.000000	239.000000	230.000000	12.000000	21.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	3010.000000	685.000000	711.000000	1403.000000	184.000000	1215.000000	876.000000	999.000000	70.000000	1280.000000

	Carrier Code	Date (MM/DD/YYYY)	Flight Number	Tail Number	Origin Airport	Destination Airport	Scheduled departure time	Actual departure time	Scheduled elapsed time (Minutes)	Actual elapsed time (Minutes)	Departure delay (Minutes)	Wheels-off time	Taxi-Out time (Minutes)	Delay Carrier (Minutes)	Delay Weather (Minutes)	Delay National Aviation System (Minutes)	Delay Security (Minutes)	Delay Late Aircraft Arrival (Minutes)
count	154575	154575	154575.000000	151157	154575	154575	154575	154575	154575.000000	154575.000000	154575.000000	154575	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000	154575.000000
unique	1	151	NaN	747	12	110	1066	1307	NaN	NaN	NaN	1305	NaN	NaN	NaN	NaN	NaN	NaN
top	UA	05/26/2022	NaN	N32404	DEN	EWR	06:00:00	00:00:00	NaN	NaN	NaN	00:00:00	NaN	NaN	NaN	NaN	NaN	NaN
freq	154575	1143	NaN	337	32245	9800	1686	3418	NaN	NaN	NaN	3473	NaN	NaN	NaN	NaN	NaN	NaN
mean	NaN	NaN	1442.155795	NaN	NaN	NaN	NaN	NaN	192.808876	180.380618	14.024842	NaN	18.978451	5.010209	0.552172	2.590923	0.001844	5.616122
std	NaN	NaN	732.636240	NaN	NaN	NaN	NaN	NaN	86.048770	88.502329	46.833963	NaN	10.081536	25.454662	12.039743	13.370774	0.281220	28.164622
min	NaN	NaN	42.000000	NaN	NaN	NaN	NaN	NaN	42.000000	0.000000	-29.000000	NaN	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	NaN	NaN	691.000000	NaN	NaN	NaN	NaN	NaN	134.000000	123.000000	-4.000000	NaN	14.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	NaN	NaN	1546.000000	NaN	NaN	NaN	NaN	NaN	173.000000	164.000000	0.000000	NaN	17.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	NaN	NaN	2060.000000	NaN	NaN	NaN	NaN	NaN	239.000000	230.000000	12.000000	NaN	21.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	NaN	NaN	3010.000000	NaN	NaN	NaN	NaN	NaN	685.000000	711.000000	1403.000000	NaN	184.000000	1215.000000	876.000000	999.000000	70.000000	1280.000000

Method	Numeric	Categorical	Description
`count`	yes	yes	Number of non-NA observations
`value_counts`	yes	yes	Number of unique observations by value
`describe`	yes	yes	Set of summary statistics for Series/DataFrame
`min`, `max`	yes	yes (caution)	Minimum and maximum values
`quantile`	yes	no	Sample quantile ranging from 0 to 1
`sum`	yes	yes (caution)	Sum of values
`prod`	yes	no	Product of values
`mean`	yes	no	Mean
`median`	yes	no	Median (50% quantile)
`var`	yes	no	Sample variance
`std`	yes	no	Sample standard deviation
`skew`	yes	no	Sample skewness (third moment)
`kurt`	yes	no	Sample kurtosis (fourth moment)

Destination Airport	ABQ	ALB	ANC	ATL	AUS	BDL	BIL	BNA	BOI	BOS	...	SLC	SMF	SNA	SRQ	STL	STT	SYR	TPA	TUL	TUS
Origin Airport
ATL	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
DEN	267	0	154	313	585	151	300	151	300	670	...	422	633	617	151	257	0	110	565	108	232
DFW	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
EWR	0	0	0	586	553	0	0	346	0	1280	...	221	150	375	290	0	153	1	1157	0	0
IAD	0	0	0	147	270	340	0	162	0	480	...	0	88	0	2	0	151	123	400	0	0
IAH	68	0	0	23	897	86	0	112	0	464	...	139	175	537	0	96	130	0	637	118	36
JFK	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
LAX	0	0	0	0	111	0	0	0	0	250	...	0	0	0	0	0	0	0	142	0	0
MCO	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ORD	29	199	66	205	518	314	0	141	0	643	...	105	280	497	224	0	40	108	576	0	72
PHL	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
SFO	81	0	0	26	421	0	0	150	0	425	...	63	0	488	0	0	0	0	248	0	0

Destination Airport	ABQ	ALB	ANC	ATL	AUS	BDL	BIL	BNA	BOI	BOS	...	SLC	SMF	SNA	SRQ	STL	STT	SYR	TPA	TUL	TUS
Origin Airport
ATL	0.000000	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.000000	0.0	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
DEN	0.600000	0.0	0.7	0.240769	0.174367	0.169473	1.0	0.142185	1.0	0.159069	...	0.444211	0.477376	0.245426	0.226387	0.728045	0.000000	0.321637	0.151678	0.477876	0.682353
DFW	0.000000	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.000000	0.0	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
EWR	0.000000	0.0	0.0	0.450769	0.164829	0.000000	0.0	0.325800	0.0	0.303894	...	0.232632	0.113122	0.149165	0.434783	0.000000	0.322785	0.002924	0.310604	0.000000	0.000000
IAD	0.000000	0.0	0.0	0.113077	0.080477	0.381594	0.0	0.152542	0.0	0.113960	...	0.000000	0.066365	0.000000	0.002999	0.000000	0.318565	0.359649	0.107383	0.000000	0.000000
IAH	0.152809	0.0	0.0	0.017692	0.267362	0.096521	0.0	0.105461	0.0	0.110161	...	0.146316	0.131976	0.213604	0.000000	0.271955	0.274262	0.000000	0.171007	0.522124	0.105882
JFK	0.000000	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.000000	0.0	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
LAX	0.000000	0.0	0.0	0.000000	0.033085	0.000000	0.0	0.000000	0.0	0.059354	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.038121	0.000000	0.000000
MCO	0.000000	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.000000	0.0	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
ORD	0.065169	1.0	0.3	0.157692	0.154396	0.352413	0.0	0.132768	0.0	0.152659	...	0.110526	0.211161	0.197693	0.335832	0.000000	0.084388	0.315789	0.154631	0.000000	0.211765
PHL	0.000000	0.0	0.0	0.000000	0.000000	0.000000	0.0	0.000000	0.0	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
SFO	0.182022	0.0	0.0	0.020000	0.125484	0.000000	0.0	0.141243	0.0	0.100902	...	0.066316	0.000000	0.194113	0.000000	0.000000	0.000000	0.000000	0.066577	0.000000	0.000000

Part 4: Exploratory Data Analysis & Data Visualization¶

Introduction to Python¶

Tom Paskhalis¶

2022-07-27¶

Data Science Summer School 2022¶

Exploratory data analysis¶

Measurement scales¶

Measurement scales in Pandas¶

Loading the dataset¶

Summarizing numeric variables¶

Methods for summarizing numeric variables¶

Summarizing categorical variables¶

Methods for summarizing categorical variables¶

Summary of descriptive statistics methods¶

Crosstabulation¶

Crosstabulation in pandas¶

Margins in crosstab¶

Data visualization¶

Data visualization in Python¶

`plotnine` - `ggplot` for Python¶

Grammar of graphics¶

Structure of ggplot calls in `plotnine`¶

Creating a ggplot in `plotnine`¶

Compare to base `pandas`¶

Compare to `matplotlib`¶

Prettifying ggplot in `plotnine`¶

Other geometric objects (`geom_`)¶

Writing plots out in `plotnine`¶

Additional visualization materials¶

The end¶

Method	Description
`geom_bar()`, `geom_col()`	Bar charts
`geom_boxplot()`	Box and whisker plot
`geom_histogram()`	Histogram
`geom_point()`	Scatterplot
`geom_line()`, `geom_path()`	Lines
`geom_map()`	Geographic areas
`geom_smooth()`	Smoothed conditional means
`geom_violin()`	Violin plots

Part 4: Exploratory Data Analysis & Data Visualization¶

Introduction to Python¶

Tom Paskhalis¶

2022-07-27¶

Data Science Summer School 2022¶

Exploratory data analysis¶

Measurement scales¶

Measurement scales in Pandas¶

Loading the dataset¶

Summarizing numeric variables¶

Methods for summarizing numeric variables¶

Summarizing categorical variables¶

Methods for summarizing categorical variables¶

Summary of descriptive statistics methods¶

Crosstabulation¶

Crosstabulation in pandas¶

Margins in crosstab¶

Data visualization¶

Data visualization in Python¶

plotnine - ggplot for Python¶

Grammar of graphics¶

Structure of ggplot calls in plotnine¶

Creating a ggplot in plotnine¶

Compare to base pandas¶

Compare to matplotlib¶

Prettifying ggplot in plotnine¶

Other geometric objects (geom_)¶

Writing plots out in plotnine¶

Additional visualization materials¶

The end¶

`plotnine` - `ggplot` for Python¶

Structure of ggplot calls in `plotnine`¶

Creating a ggplot in `plotnine`¶

Compare to base `pandas`¶

Compare to `matplotlib`¶

Prettifying ggplot in `plotnine`¶

Other geometric objects (`geom_`)¶

Writing plots out in `plotnine`¶