Source: R for Data Science
pandas
library has become the de facto standard for data manipulationnumpy
(array data type), scipy
(linear algebra) and scikit-learn
(machine learning)# Using 'as' allows to avoid typing full name each time the module is referred to
import pandas as pd
sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1
0 150.0 1 120.0 2 3000.0 dtype: float64
sr1[0] # Slicing is simiar to standard Python objects
150.0
sr1[sr1 > 200]
2 3000.0 dtype: float64
d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}
sr2 = pd.Series(d)
sr2
apple 150.0 banana 120.0 watermelon 3000.0 dtype: float64
sr2[0] # Recall that this slicing would be impossible for standard dictionary
150.0
sr2.index
Index(['apple', 'banana', 'watermelon'], dtype='object')
data = {'fruit': ['apple', 'banana', 'watermelon'], # DataFrame can be constructed from
'weight': [150.0, 120.0, 3000.0], # a dict of equal-length lists/arrays
'berry': [False, True, True]}
df = pd.DataFrame(data)
df
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
2 | watermelon | 3000.0 | True |
DataFrame.loc()
provides method for label locationDataFrame.iloc()
provides method for index locationdf.iloc[0] # First row
fruit apple weight 150.0 berry False Name: 0, dtype: object
df.iloc[:,0] # First column
0 apple 1 banana 2 watermelon Name: fruit, dtype: object
Expression | Selection Operation |
---|---|
df[val] |
Column or sequence of columns +convenience (e.g. slice) |
df.loc[lab_i] |
Row or subset of rows by label |
df.loc[:, lab_j] |
Column or subset of columns by label |
df.loc[lab_i, lab_j] |
Both rows and columns by label |
df.iloc[i] |
Row or subset of rows by integer position |
df.iloc[:, j] |
Column or subset of columns by integer position |
df.iloc[i, j] |
Both rows and columns by integer position |
df.at[lab_i, lab_j] |
Single scalar value by row and column label |
df.iat[i, j] |
Single scalar value by row and column integer position |
df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
df[:2] # Shortcut
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'
fruit | berry | |
---|---|---|
0 | apple | False |
1 | banana | True |
2 | watermelon | True |
df[['fruit', 'berry']] # Shortcut
fruit | berry | |
---|---|---|
0 | apple | False |
1 | banana | True |
2 | watermelon | True |
df.columns # Retrieve the names of all columns
Index(['fruit', 'weight', 'berry'], dtype='object')
df.columns[0] # This Index object is subsettable
'fruit'
df.columns.str.startswith('fr') # As column names are strings, we can apply str methods
array([ True, False, False])
df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria
fruit | |
---|---|
0 | apple |
1 | banana |
2 | watermelon |
df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
df[df['berry'] == False] # The same can be achieved with more concise syntax
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
weight200 = df[df['weight'] > 200] # Create new dataset with rows where weight is higher than 200
weight200
fruit | weight | berry | |
---|---|---|---|
2 | watermelon | 3000.0 | True |
map()
methoddf['fruit'].map(lambda x: x.upper())
0 APPLE 1 BANANA 2 WATERMELON Name: fruit, dtype: object
transform = lambda x: x.capitalize()
transformed = df['fruit'].map(transform)
transformed
0 Apple 1 Banana 2 Watermelon Name: fruit, dtype: object
open()
:<variable_name> = open(<filepath>, <mode>)
r
)ead a file (default)w
)rite an object to a filex
)clusively create, failing if a file existsa
)ppend to a filer+
mode if you need to read and write to filef = open('../temp/test.txt', 'w') # Create a new file object in write mode
f.write('This is a test file.') # Write a string of characters to it
20
f.close() # Flush output buffers to disk and close the connection
with
statement can be usedwith open('../temp/test.txt', 'r') as f: # Note that we use 'r' mode for reading
text = f.read()
text
'This is a test file.'
pandas
¶pandas
provides high-level methods that takes care of file connectionsread_<format>
and to_<format>
name patterns<variable_name> = pd.read_<format>(<filepath>)
<variable_name>.to_<format>(<filepath>)
pandas
example¶# We specify that we want to combine first two rows as a header
kaggle2021 = pd.read_csv('../data/kaggle_survey_2021_responses.csv', header = [0,1])
/home/tpaskhalis/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (195,201) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
kaggle2021.head() # Returns the top n (n=5 default) rows
Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q38_B_Part_3 | Q38_B_Part_4 | Q38_B_Part_5 | Q38_B_Part_6 | Q38_B_Part_7 | Q38_B_Part_8 | Q38_B_Part_9 | Q38_B_Part_10 | Q38_B_Part_11 | Q38_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL | ... | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - ClearML | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - MLflow | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other | |
0 | 910 | 50-54 | Man | India | Bachelor’s degree | Other | 5-10 years | Python | R | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 784 | 50-54 | Man | Indonesia | Master’s degree | Program/Project Manager | 20+ years | NaN | NaN | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |
2 | 924 | 22-24 | Man | Pakistan | Master’s degree | Software Engineer | 1-3 years | Python | NaN | NaN | ... | NaN | NaN | TensorBoard | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 575 | 45-49 | Man | Mexico | Doctoral degree | Research Scientist | 20+ years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |
4 | 781 | 45-49 | Man | India | Doctoral degree | Other | < 1 years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 369 columns
kaggle2021.tail() # Returns the bottom n (n=5 default) rows
Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q38_B_Part_3 | Q38_B_Part_4 | Q38_B_Part_5 | Q38_B_Part_6 | Q38_B_Part_7 | Q38_B_Part_8 | Q38_B_Part_9 | Q38_B_Part_10 | Q38_B_Part_11 | Q38_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL | ... | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - ClearML | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - MLflow | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other | |
25968 | 1756 | 30-34 | Man | Egypt | Bachelor’s degree | Data Analyst | 1-3 years | Python | NaN | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25969 | 253 | 22-24 | Man | China | Master’s degree | Student | 1-3 years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25970 | 494 | 50-54 | Man | Sweden | Doctoral degree | Research Scientist | I have never written code | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |
25971 | 277 | 45-49 | Man | United States of America | Master’s degree | Data Scientist | 5-10 years | Python | NaN | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25972 | 255 | 18-21 | Man | India | Bachelor’s degree | Business Analyst | I have never written code | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |
5 rows × 369 columns
pandas
¶df.to_csv(path)
as opposed to df = pd.read_csv(path)
df.to_excel(path)
, df.to_stata(path)
kaggle2021.to_csv('../temp/kaggle2021.csv')
Books:
McKinney, Wes. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. 2nd ed. Sebastopol, CA: O'Reilly Media
From the original author of the library!
Online: