# Using 'as' allows to avoid typing full name each time the module is referred to
import pandas as pd


sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1

0     150.0
1     120.0
2    3000.0
dtype: float64


sr1[0] # Slicing is simiar to standard Python objects

150.0


sr1[sr1 > 200]

2    3000.0
dtype: float64


d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}


sr2 = pd.Series(d)
sr2

apple          150.0
banana         120.0
watermelon    3000.0
dtype: float64


sr2[0] # Recall that this slicing would be impossible for standard dictionary

150.0


sr2.index

Index(['apple', 'banana', 'watermelon'], dtype='object')


data = {'fruit': ['apple', 'banana', 'watermelon'], # DataFrame can be constructed from
        'weight': [150.0, 120.0, 3000.0],           # a dict of equal-length lists/arrays
        'berry': [False, True, True]}           
df = pd.DataFrame(data)
df


df.iloc[0] # First row

fruit     apple
weight    150.0
berry     False
Name: 0, dtype: object


df.iloc[:,0] # First column

0         apple
1        banana
2    watermelon
Name: fruit, dtype: object


df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)


df[:2]  # Shortcut


df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'


df[['fruit', 'berry']] # Shortcut


df.columns # Retrieve the names of all columns

Index(['fruit', 'weight', 'berry'], dtype='object')


df.columns[0] # This Index object is subsettable

'fruit'


df.columns.str.startswith('fr') # As column names are strings, we can apply str methods

array([ True, False, False])


df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria


df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries


df[df['berry'] == False] # The same can be achieved with more concise syntax


weight200 = df[df['weight'] > 200] # Create new dataset with rows where weight is higher than 200
weight200


df['fruit'].map(lambda x: x.upper())

0         APPLE
1        BANANA
2    WATERMELON
Name: fruit, dtype: object


transform = lambda x: x.capitalize()


transformed = df['fruit'].map(transform)


transformed

0         Apple
1        Banana
2    Watermelon
Name: fruit, dtype: object


f = open('../temp/test.txt', 'w') # Create a new file object in write mode


f.write('This is a test file.') # Write a string of characters to it

20


f.close() # Flush output buffers to disk and close the connection


with open('../temp/test.txt', 'r') as f: # Note that we use 'r' mode for reading
    text = f.read()


text

'This is a test file.'


united_2022 = pd.read_csv('../data/united_2022.csv')


# Data dimensionality, 2d for tabular
united_2022.ndim

2


# Size of each dimension (analogous to R's dim())
# (n_rows, n_columns)
united_2022.shape

(154575, 18)


# Total number of cells (n_rows * n_columns)
united_2022.size

2782350


united_2022.head() # Returns the top n (n=5 default) rows


united_2022.tail() # Returns the bottom n (n=5 default) rows


united_2022.to_csv('../temp/united_2022.csv')

Expression	Selection Operation
`df[val]`	Column or sequence of columns +convenience (e.g. slice)
`df.loc[lab_i]`	Row or subset of rows by label
`df.loc[:, lab_j]`	Column or subset of columns by label
`df.loc[lab_i, lab_j]`	Both rows and columns by label
`df.iloc[i]`	Row or subset of rows by integer position
`df.iloc[:, j]`	Column or subset of columns by integer position
`df.iloc[i, j]`	Both rows and columns by integer position
`df.at[lab_i, lab_j]`	Single scalar value by row and column label
`df.iat[i, j]`	Single scalar value by row and column integer position

	Carrier Code	Date (MM/DD/YYYY)	Flight Number	Tail Number	Origin Airport	Destination Airport	Scheduled departure time	Actual departure time	Scheduled elapsed time (Minutes)	Actual elapsed time (Minutes)	Departure delay (Minutes)	Wheels-off time	Taxi-Out time (Minutes)	Delay National Aviation System (Minutes)	Delay Late Aircraft Arrival (Minutes)
0	UA	01/01/2022	225	N488UA	ATL	DEN	16:15:00	17:23:00	211	240	68	17:35:00	12	29	68
1	UA	01/01/2022	282	N447UA	ATL	IAH	19:00:00	19:02:00	138	126	2	19:15:00	13	0	0
2	UA	01/01/2022	340	N809UA	ATL	DEN	08:20:00	08:17:00	211	283	-3	08:33:00	16	69	0
3	UA	01/02/2022	225	N463UA	ATL	DEN	16:15:00	16:36:00	211	193	21	16:48:00	12	0	0
4	UA	01/02/2022	282	N63899	ATL	IAH	19:00:00	18:54:00	138	129	-6	19:07:00	13	0	0

	Carrier Code	Date (MM/DD/YYYY)	Flight Number	Tail Number	Origin Airport	Destination Airport	Scheduled departure time	Actual departure time	Scheduled elapsed time (Minutes)	Actual elapsed time (Minutes)	Departure delay (Minutes)	Wheels-off time	Taxi-Out time (Minutes)	Delay Carrier (Minutes)	Delay National Aviation System (Minutes)
154570	UA	05/31/2022	2652	N47524	SFO	BOS	13:25:00	13:19:00	354	318	-6	13:38:00	19	0	0
154571	UA	05/31/2022	2655	N15969	SFO	EWR	08:50:00	10:19:00	327	343	89	10:37:00	18	89	16
154572	UA	05/31/2022	2657	N77431	SFO	PHX	19:00:00	18:52:00	119	115	-8	19:09:00	17	0	0
154573	UA	05/31/2022	2669	N76523	SFO	SAN	10:54:00	10:57:00	101	85	3	11:11:00	14	0	0
154574	UA	05/31/2022	2670	N37253	SFO	TPA	09:59:00	09:52:00	310	305	-7	10:08:00	16	0	0

Part 3: Pandas and Data I/O¶

Introduction to Python¶

Tom Paskhalis¶

2022-07-27¶

Data Science Summer School 2022¶

Rectangular data¶

Tidy data¶

Data in Python¶

Pandas¶

Series¶

Indexing in Series¶

DataFrame - the workhorse of data analysis¶

Indexing in DataFrame¶

Summary of indexing in DataFrame¶

Subsetting in DataFrame¶

Columns in DataFrame¶

Filtering in DataFrame¶

Variable transformation¶

File object¶

Data input and output¶

Data output example¶

Data input example¶

Reading and writing data in `pandas`¶

Flights statistics¶

Reading data in `pandas` example¶

Basic DataFrame info¶

Visual data inspection¶

Visual data inspection continued¶

Reading in other (non-`.csv`) data files¶

Writing data out in `pandas`¶

Additional pandas materials¶

Next¶

Part 3: Pandas and Data I/O¶

Introduction to Python¶

Tom Paskhalis¶

2022-07-27¶

Data Science Summer School 2022¶

Rectangular data¶

Tidy data¶

Data in Python¶

Pandas¶

Series¶

Indexing in Series¶

DataFrame - the workhorse of data analysis¶

Indexing in DataFrame¶

Summary of indexing in DataFrame¶

Subsetting in DataFrame¶

Columns in DataFrame¶

Filtering in DataFrame¶

Variable transformation¶

File object¶

Data input and output¶

Data output example¶

Data input example¶

Reading and writing data in pandas¶

Flights statistics¶

Reading data in pandas example¶

Basic DataFrame info¶

Visual data inspection¶

Visual data inspection continued¶

Reading in other (non-.csv) data files¶

Writing data out in pandas¶

Additional pandas materials¶

Next¶

Reading and writing data in `pandas`¶

Reading data in `pandas` example¶

Reading in other (non-`.csv`) data files¶

Writing data out in `pandas`¶