- Tidy data is a specific subset of rectangular data, where:
- Each variable is in a column
- Each observation is in a row
- Each value is in a cell

Source: R for Data Science

- Python can hold and manipulate > 1 dataset at the same time
- Python stores objects in memory
- The limit on the size of data is determined by your computer memory
- Most functionality for dealing with data is provided by external libraries

- Standard Python library does not have data type for tabular data
- However,
`pandas`

library has become the de facto standard for data manipulation - pandas is built upon (and often used in conjuction with) other computational libraries
- E.g.
`numpy`

(array data type),`scipy`

(linear algebra) and`scikit-learn`

(machine learning)

In [1]:

```
# Using 'as' allows to avoid typing full name each time the module is referred to
import pandas as pd
```

*Series*is a one-dimensional array-like object

In [2]:

```
sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1
```

Out[2]:

0 150.0 1 120.0 2 3000.0 dtype: float64

In [3]:

```
sr1[0] # Slicing is simiar to standard Python objects
```

Out[3]:

150.0

In [4]:

```
sr1[sr1 > 200]
```

Out[4]:

2 3000.0 dtype: float64

- Another way to think about Series is as a ordered dictionary

In [5]:

```
d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}
```

In [6]:

```
sr2 = pd.Series(d)
sr2
```

Out[6]:

apple 150.0 banana 120.0 watermelon 3000.0 dtype: float64

In [7]:

```
sr2[0] # Recall that this slicing would be impossible for standard dictionary
```

Out[7]:

150.0

In [8]:

```
sr2.index
```

Out[8]:

Index(['apple', 'banana', 'watermelon'], dtype='object')

*DataFrame*is a rectangular table of data

In [9]:

```
data = {'fruit': ['apple', 'banana', 'watermelon'], # DataFrame can be constructed from
'weight': [150.0, 120.0, 3000.0], # a dict of equal-length lists/arrays
'berry': [False, True, True]}
df = pd.DataFrame(data)
df
```

Out[9]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

1 | banana | 120.0 | True |

2 | watermelon | 3000.0 | True |

- DataFrame has both row and column indices
`DataFrame.loc()`

provides method for*label*location`DataFrame.iloc()`

provides method for*index*location

In [10]:

```
df.iloc[0] # First row
```

Out[10]:

fruit apple weight 150.0 berry False Name: 0, dtype: object

In [11]:

```
df.iloc[:,0] # First column
```

Out[11]:

0 apple 1 banana 2 watermelon Name: fruit, dtype: object

Expression | Selection Operation |
---|---|

`df[val]` |
Column or sequence of columns +convenience (e.g. slice) |

`df.loc[lab_i]` |
Row or subset of rows by label |

`df.loc[:, lab_j]` |
Column or subset of columns by label |

`df.loc[lab_i, lab_j]` |
Both rows and columns by label |

`df.iloc[i]` |
Row or subset of rows by integer position |

`df.iloc[:, j]` |
Column or subset of columns by integer position |

`df.iloc[i, j]` |
Both rows and columns by integer position |

`df.at[lab_i, lab_j]` |
Single scalar value by row and column label |

`df.iat[i, j]` |
Single scalar value by row and column integer position |

In [12]:

```
df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)
```

Out[12]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

1 | banana | 120.0 | True |

In [13]:

```
df[:2] # Shortcut
```

Out[13]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

1 | banana | 120.0 | True |

In [14]:

```
df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'
```

Out[14]:

fruit | berry | |
---|---|---|

0 | apple | False |

1 | banana | True |

2 | watermelon | True |

In [15]:

```
df[['fruit', 'berry']] # Shortcut
```

Out[15]:

fruit | berry | |
---|---|---|

0 | apple | False |

1 | banana | True |

2 | watermelon | True |

In [16]:

```
df.columns # Retrieve the names of all columns
```

Out[16]:

Index(['fruit', 'weight', 'berry'], dtype='object')

In [17]:

```
df.columns[0] # This Index object is subsettable
```

Out[17]:

'fruit'

In [18]:

```
df.columns.str.startswith('fr') # As column names are strings, we can apply str methods
```

Out[18]:

array([ True, False, False])

In [19]:

```
df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria
```

Out[19]:

fruit | |
---|---|

0 | apple |

1 | banana |

2 | watermelon |

In [20]:

```
df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries
```

Out[20]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

In [21]:

```
df[df['berry'] == False] # The same can be achieved with more concise syntax
```

Out[21]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

In [22]:

```
weight200 = df[df['weight'] > 200] # Create new dataset with rows where weight is higher than 200
weight200
```

Out[22]:

fruit | weight | berry | |
---|---|---|---|

2 | watermelon | 3000.0 | True |

- Lambda functions can be used to transform data with
`map()`

method

In [23]:

```
df['fruit'].map(lambda x: x.upper())
```

Out[23]:

0 APPLE 1 BANANA 2 WATERMELON Name: fruit, dtype: object

In [24]:

```
transform = lambda x: x.capitalize()
```

In [25]:

```
transformed = df['fruit'].map(transform)
```

In [26]:

```
transformed
```

Out[26]:

0 Apple 1 Banana 2 Watermelon Name: fruit, dtype: object

- File object in Python provides the main interface to external files
- In contrast to other core types, file objects are created not with a literal,
- But with a function,
`open()`

:

`<variable_name> = open(<filepath>, <mode>)`

- Modes of file objects allow to:
- (
`r`

)ead a file (default) - (
`w`

)rite an object to a file - e(
`x`

)clusively create, failing if a file exists - (
`a`

)ppend to a file

- (
- You can
`r+`

mode if you need to read and write to file

In [27]:

```
f = open('../temp/test.txt', 'w') # Create a new file object in write mode
```

In [28]:

```
f.write('This is a test file.') # Write a string of characters to it
```

Out[28]:

20

In [29]:

```
f.close() # Flush output buffers to disk and close the connection
```

- To avoid keeping track of open file connections,
`with`

statement can be used

In [30]:

```
with open('../temp/test.txt', 'r') as f: # Note that we use 'r' mode for reading
text = f.read()
```

In [31]:

```
text
```

Out[31]:

'This is a test file.'

`pandas`

¶`pandas`

provides high-level methods that takes care of file connections- These methods all follow the same
`read_<format>`

and`to_<format>`

name patterns - CSV (comma-separated value) files are the standard of interoperability

`<variable_name> = pd.read_<format>(<filepath>)`

`<variable_name>.to_<format>(<filepath>)`

`pandas`

example¶- We will use the data from Kaggle 2021 Machine Learning and Data Science Survey
- For more information you can read the executive summary
- Or explore the winning Python Jupyter Notebooks

In [32]:

```
# We specify that we want to combine first two rows as a header
kaggle2021 = pd.read_csv('../data/kaggle_survey_2021_responses.csv', header = [0,1])
```

In [33]:

```
kaggle2021.head() # Returns the top n (n=5 default) rows
```

Out[33]:

Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q38_B_Part_3 | Q38_B_Part_4 | Q38_B_Part_5 | Q38_B_Part_6 | Q38_B_Part_7 | Q38_B_Part_8 | Q38_B_Part_9 | Q38_B_Part_10 | Q38_B_Part_11 | Q38_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL | ... | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - ClearML | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - MLflow | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other | |

0 | 910 | 50-54 | Man | India | Bachelor’s degree | Other | 5-10 years | Python | R | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

1 | 784 | 50-54 | Man | Indonesia | Master’s degree | Program/Project Manager | 20+ years | NaN | NaN | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |

2 | 924 | 22-24 | Man | Pakistan | Master’s degree | Software Engineer | 1-3 years | Python | NaN | NaN | ... | NaN | NaN | TensorBoard | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

3 | 575 | 45-49 | Man | Mexico | Doctoral degree | Research Scientist | 20+ years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |

4 | 781 | 45-49 | Man | India | Doctoral degree | Other | < 1 years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

5 rows × 369 columns

In [34]:

```
kaggle2021.tail() # Returns the bottom n (n=5 default) rows
```

Out[34]:

Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q38_B_Part_3 | Q38_B_Part_4 | Q38_B_Part_5 | Q38_B_Part_6 | Q38_B_Part_7 | Q38_B_Part_8 | Q38_B_Part_9 | Q38_B_Part_10 | Q38_B_Part_11 | Q38_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL | ... | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - ClearML | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - MLflow | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other | |

25968 | 1756 | 30-34 | Man | Egypt | Bachelor’s degree | Data Analyst | 1-3 years | Python | NaN | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

25969 | 253 | 22-24 | Man | China | Master’s degree | Student | 1-3 years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

25970 | 494 | 50-54 | Man | Sweden | Doctoral degree | Research Scientist | I have never written code | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |

25971 | 277 | 45-49 | Man | United States of America | Master’s degree | Data Scientist | 5-10 years | Python | NaN | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

25972 | 255 | 18-21 | Man | India | Bachelor’s degree | Business Analyst | I have never written code | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |

5 rows × 369 columns

`pandas`

¶- Note that when writing data out we start with the object name storing the dataset
- I.e.
`df.to_csv(path)`

as opposed to`df = pd.read_csv(path)`

- Pandas can also write out into other data formats
- E.g.
`df.to_excel(path)`

,`df.to_stata(path)`

In [35]:

```
kaggle2021.to_csv('../temp/kaggle2021.csv')
```

Books:

McKinney, Wes. 2017.

*Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython*. 2nd ed. Sebastopol, CA: O'Reilly Media**From the original author of the library!**

Online:

- Pandas Getting Started Tutorials
- Pandas Documentation (intermediate and advanced)

- Exploratory data analysis
- Data visualization