Source: R for Data Science
pandas
library has become the de facto standard for data manipulationnumpy
(array data type), scipy
(linear algebra) and scikit-learn
(machine learning)# Using 'as' allows to avoid typing full name each time the module is referred to
import pandas as pd
sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1
0 150.0 1 120.0 2 3000.0 dtype: float64
sr1[0] # Slicing is simiar to standard Python objects
150.0
sr1[sr1 > 200]
2 3000.0 dtype: float64
d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}
sr2 = pd.Series(d)
sr2
apple 150.0 banana 120.0 watermelon 3000.0 dtype: float64
sr2[0] # Recall that this slicing would be impossible for standard dictionary
150.0
sr2.index
Index(['apple', 'banana', 'watermelon'], dtype='object')
data = {'fruit': ['apple', 'banana', 'watermelon'], # DataFrame can be constructed from
'weight': [150.0, 120.0, 3000.0], # a dict of equal-length lists/arrays
'berry': [False, True, True]}
df = pd.DataFrame(data)
df
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
2 | watermelon | 3000.0 | True |
DataFrame.loc()
provides method for label locationDataFrame.iloc()
provides method for index locationdf.iloc[0] # First row
fruit apple weight 150.0 berry False Name: 0, dtype: object
df.iloc[:,0] # First column
0 apple 1 banana 2 watermelon Name: fruit, dtype: object
Expression | Selection Operation |
---|---|
df[val] |
Column or sequence of columns +convenience (e.g. slice) |
df.loc[lab_i] |
Row or subset of rows by label |
df.loc[:, lab_j] |
Column or subset of columns by label |
df.loc[lab_i, lab_j] |
Both rows and columns by label |
df.iloc[i] |
Row or subset of rows by integer position |
df.iloc[:, j] |
Column or subset of columns by integer position |
df.iloc[i, j] |
Both rows and columns by integer position |
df.at[lab_i, lab_j] |
Single scalar value by row and column label |
df.iat[i, j] |
Single scalar value by row and column integer position |
df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
df[:2] # Shortcut
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'
fruit | berry | |
---|---|---|
0 | apple | False |
1 | banana | True |
2 | watermelon | True |
df[['fruit', 'berry']] # Shortcut
fruit | berry | |
---|---|---|
0 | apple | False |
1 | banana | True |
2 | watermelon | True |
df.columns # Retrieve the names of all columns
Index(['fruit', 'weight', 'berry'], dtype='object')
df.columns[0] # This Index object is subsettable
'fruit'
df.columns.str.startswith('fr') # As column names are strings, we can apply str methods
array([ True, False, False])
df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria
fruit | |
---|---|
0 | apple |
1 | banana |
2 | watermelon |
df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
df[df['berry'] == False] # The same can be achieved with more concise syntax
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
weight200 = df[df['weight'] > 200] # Create new dataset with rows where weight is higher than 200
weight200
fruit | weight | berry | |
---|---|---|---|
2 | watermelon | 3000.0 | True |
map()
methoddf['fruit'].map(lambda x: x.upper())
0 APPLE 1 BANANA 2 WATERMELON Name: fruit, dtype: object
transform = lambda x: x.capitalize()
transformed = df['fruit'].map(transform)
transformed
0 Apple 1 Banana 2 Watermelon Name: fruit, dtype: object
open()
:<variable_name> = open(<filepath>, <mode>)
r
)ead a file (default)w
)rite an object to a filex
)clusively create, failing if a file existsa
)ppend to a filer+
mode if you need to read and write to filef = open('../temp/test.txt', 'w') # Create a new file object in write mode
f.write('This is a test file.') # Write a string of characters to it
20
f.close() # Flush output buffers to disk and close the connection
with
statement can be usedwith open('../temp/test.txt', 'r') as f: # Note that we use 'r' mode for reading
text = f.read()
text
'This is a test file.'
pandas
¶pandas
provides high-level methods that takes care of file connectionsread_<format>
and to_<format>
name patterns<variable_name> = pd.read_<format>(<filepath>)
<variable_name>.to_<format>(<filepath>)
pandas
example¶united_2022 = pd.read_csv('../data/united_2022.csv')
# Data dimensionality, 2d for tabular
united_2022.ndim
2
# Size of each dimension (analogous to R's dim())
# (n_rows, n_columns)
united_2022.shape
(154575, 18)
# Total number of cells (n_rows * n_columns)
united_2022.size
2782350
united_2022.head() # Returns the top n (n=5 default) rows
Carrier Code | Date (MM/DD/YYYY) | Flight Number | Tail Number | Origin Airport | Destination Airport | Scheduled departure time | Actual departure time | Scheduled elapsed time (Minutes) | Actual elapsed time (Minutes) | Departure delay (Minutes) | Wheels-off time | Taxi-Out time (Minutes) | Delay Carrier (Minutes) | Delay Weather (Minutes) | Delay National Aviation System (Minutes) | Delay Security (Minutes) | Delay Late Aircraft Arrival (Minutes) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UA | 01/01/2022 | 225 | N488UA | ATL | DEN | 16:15:00 | 17:23:00 | 211 | 240 | 68 | 17:35:00 | 12 | 0 | 0 | 29 | 0 | 68 |
1 | UA | 01/01/2022 | 282 | N447UA | ATL | IAH | 19:00:00 | 19:02:00 | 138 | 126 | 2 | 19:15:00 | 13 | 0 | 0 | 0 | 0 | 0 |
2 | UA | 01/01/2022 | 340 | N809UA | ATL | DEN | 08:20:00 | 08:17:00 | 211 | 283 | -3 | 08:33:00 | 16 | 0 | 0 | 69 | 0 | 0 |
3 | UA | 01/02/2022 | 225 | N463UA | ATL | DEN | 16:15:00 | 16:36:00 | 211 | 193 | 21 | 16:48:00 | 12 | 0 | 0 | 0 | 0 | 0 |
4 | UA | 01/02/2022 | 282 | N63899 | ATL | IAH | 19:00:00 | 18:54:00 | 138 | 129 | -6 | 19:07:00 | 13 | 0 | 0 | 0 | 0 | 0 |
united_2022.tail() # Returns the bottom n (n=5 default) rows
Carrier Code | Date (MM/DD/YYYY) | Flight Number | Tail Number | Origin Airport | Destination Airport | Scheduled departure time | Actual departure time | Scheduled elapsed time (Minutes) | Actual elapsed time (Minutes) | Departure delay (Minutes) | Wheels-off time | Taxi-Out time (Minutes) | Delay Carrier (Minutes) | Delay Weather (Minutes) | Delay National Aviation System (Minutes) | Delay Security (Minutes) | Delay Late Aircraft Arrival (Minutes) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
154570 | UA | 05/31/2022 | 2652 | N47524 | SFO | BOS | 13:25:00 | 13:19:00 | 354 | 318 | -6 | 13:38:00 | 19 | 0 | 0 | 0 | 0 | 0 |
154571 | UA | 05/31/2022 | 2655 | N15969 | SFO | EWR | 08:50:00 | 10:19:00 | 327 | 343 | 89 | 10:37:00 | 18 | 89 | 0 | 16 | 0 | 0 |
154572 | UA | 05/31/2022 | 2657 | N77431 | SFO | PHX | 19:00:00 | 18:52:00 | 119 | 115 | -8 | 19:09:00 | 17 | 0 | 0 | 0 | 0 | 0 |
154573 | UA | 05/31/2022 | 2669 | N76523 | SFO | SAN | 10:54:00 | 10:57:00 | 101 | 85 | 3 | 11:11:00 | 14 | 0 | 0 | 0 | 0 | 0 |
154574 | UA | 05/31/2022 | 2670 | N37253 | SFO | TPA | 09:59:00 | 09:52:00 | 310 | 305 | -7 | 10:08:00 | 16 | 0 | 0 | 0 | 0 | 0 |
pandas
¶df.to_csv(path)
as opposed to df = pd.read_csv(path)
df.to_excel(path)
, df.to_stata(path)
united_2022.to_csv('../temp/united_2022.csv')
Books:
McKinney, Wes. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. 2nd ed. Sebastopol, CA: O'Reilly Media
From the original author of the library!
Online: