- Numerical analysis in Python
- Tabular data
- Pandas object types
- Working with dataframes in pandas
- Data input and output

- As opposed to other programming languages (Julia, R, MatLab),

Python provides very bare bones functionality for numeric analysis.

- E.g. no built-in matrix/array object type, limited mathematical and statistical functions

In [1]:

```
# Representing 3x3 matrix with list
mat = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
```

In [2]:

```
# Subsetting 3rd element of 2nd row
mat[1][2]
```

Out[2]:

6

In [3]:

```
# Naturally, this representation
# quickly breaks down
mat * 2
```

Out[3]:

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [1, 2, 3], [4, 5, 6], [7, 8, 9]]

- NumPy (
**Num**eric**Py**thon) package provides the basis of numerical computing in Python:- multidimensional array
- mathematical functions for arrays
- array data I/O
- linear algebra, RNG, FFT, ...

In [4]:

```
# Using 'as' allows to avoid typing full name
# each time the module is referred to
import numpy as np
```

- Multidimensional (N) array object (aka ndarray) is a principal container for datasets in Python.
- It is the backbone of data frames, operating behind the scenes

In [5]:

```
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
```

In [6]:

```
arr[1][2]
```

Out[6]:

6

In [7]:

```
arr * 2
```

Out[7]:

array([[ 2, 4, 6], [ 8, 10, 12], [14, 16, 18]])

In [8]:

```
# Object type
type(arr)
```

Out[8]:

numpy.ndarray

In [9]:

```
# Array dimensionality
arr.ndim
```

Out[9]:

2

In [10]:

```
# Array size
arr.shape
```

Out[10]:

(3, 3)

In [11]:

```
# Creation from list of lists
np.array(mat)
```

Out[11]:

array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

- Tidy data is a specific subset of rectangular data, where:
- Each variable is in a column
- Each observation is in a row
- Each value is in a cell

Source: R for Data Science

- Standard Python library does not have data type for rectangular data
- However,
`pandas`

library has become the de facto standard for data manipulation `pandas`

is built upon (and often used in conjuction with) other computational libraries- E.g.
`numpy`

(array data type),`scipy`

(linear algebra) and`scikit-learn`

(machine learning)

In [12]:

```
import pandas as pd
```

`pandas`

object types¶*Series*- one-dimensional sequence of values*DataFrame*- (typically) two-dimensional rectangular table

*Series*is a one-dimensional array-like object

In [13]:

```
sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1
```

Out[13]:

0 150.0 1 120.0 2 3000.0 dtype: float64

In [14]:

```
sr1[0] # Slicing is similar to standard Python objects
```

Out[14]:

150.0

In [15]:

```
sr1[sr1 > 200] # But subsetting is also available
```

Out[15]:

2 3000.0 dtype: float64

- Another way to think about Series is as a ordered dictionary

In [16]:

```
d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}
```

In [17]:

```
sr2 = pd.Series(d)
sr2
```

Out[17]:

apple 150.0 banana 120.0 watermelon 3000.0 dtype: float64

In [18]:

```
sr2[0] # Recall that this slicing would be impossible for standard dictionary
```

Out[18]:

150.0

In [19]:

```
sr2.index # Sequence of labels is converted into an Index object
```

Out[19]:

Index(['apple', 'banana', 'watermelon'], dtype='object')

*DataFrame*is a rectangular table of data

In [20]:

```
data = {'fruit': ['apple', 'banana', 'watermelon'], # DataFrame can be constructed from
'weight': [150.0, 120.0, 3000.0], # a dict of equal-length lists/arrays
'berry': [False, True, True]}
df = pd.DataFrame(data)
df
```

Out[20]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

1 | banana | 120.0 | True |

2 | watermelon | 3000.0 | True |

- DataFrame has both row and column indices
`DataFrame.loc()`

provides method for*label*location`DataFrame.iloc()`

provides method for*index*location

In [21]:

```
df.iloc[0] # First row
```

Out[21]:

fruit apple weight 150.0 berry False Name: 0, dtype: object

In [22]:

```
df.iloc[:,0] # First column
```

Out[22]:

0 apple 1 banana 2 watermelon Name: fruit, dtype: object

Expression | Selection Operation |
---|---|

`df[val]` |
Column or sequence of columns +convenience (e.g. slice) |

`df.loc[lab_i]` |
Row or subset of rows by label |

`df.loc[:, lab_j]` |
Column or subset of columns by label |

`df.loc[lab_i, lab_j]` |
Both rows and columns by label |

`df.iloc[i]` |
Row or subset of rows by integer position |

`df.iloc[:, j]` |
Column or subset of columns by integer position |

`df.iloc[i, j]` |
Both rows and columns by integer position |

`df.at[lab_i, lab_j]` |
Single scalar value by row and column label |

`df.iat[i, j]` |
Single scalar value by row and column integer position |

In [23]:

```
df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)
```

Out[23]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

1 | banana | 120.0 | True |

In [24]:

```
df[:2] # Shortcut
```

Out[24]:

fruit | weight | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

1 | banana | 120.0 | True |

In [25]:

```
df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'
```

Out[25]:

fruit | berry | |
---|---|---|

0 | apple | False |

1 | banana | True |

2 | watermelon | True |

In [26]:

```
df[['fruit', 'berry']] # Shortcut
```

Out[26]:

fruit | berry | |
---|---|---|

0 | apple | False |

1 | banana | True |

2 | watermelon | True |

In [27]:

```
df.columns # Retrieve the names of all columns (index object)
```

Out[27]:

Index(['fruit', 'weight', 'berry'], dtype='object')

In [28]:

```
df.columns[0] # This Index object is subsettable
```

Out[28]:

'fruit'

In [29]:

```
df.columns.str.startswith('fr') # As column names are strings, we can apply str methods
```

Out[29]:

array([ True, False, False])

In [30]:

```
df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria
```

Out[30]:

fruit | |
---|---|

0 | apple |

1 | banana |

2 | watermelon |

In [31]:

```
df = df.rename(columns = {'weight': 'weight_g'}) # Columns can be renamed with dictionary mapping
```

In [32]:

```
df
```

Out[32]:

fruit | weight_g | berry | |
---|---|---|---|

0 | apple | 150.0 | False |

1 | banana | 120.0 | True |

2 | watermelon | 3000.0 | True |

In [33]:

```
df['weight_oz'] = 0 # Columns can be added or modified by assignment
```

In [34]:

```
df
```

Out[34]:

fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|

0 | apple | 150.0 | False | 0 |

1 | banana | 120.0 | True | 0 |

2 | watermelon | 3000.0 | True | 0 |

In [35]:

```
df['weight_oz'] = df['weight_g'] * 0.04
```

In [36]:

```
df
```

Out[36]:

fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|

0 | apple | 150.0 | False | 6.0 |

1 | banana | 120.0 | True | 4.8 |

2 | watermelon | 3000.0 | True | 120.0 |

In [37]:

```
df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries
```

Out[37]:

fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|

0 | apple | 150.0 | False | 6.0 |

In [38]:

```
df[df['berry'] == False] # The same can be achieved with more concise syntax
```

Out[38]:

fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|

0 | apple | 150.0 | False | 6.0 |

In [39]:

```
weight200 = df[df['weight_g'] > 200] # Create new dataset with rows where weight is higher than 200 grams
weight200
```

Out[39]:

fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|

2 | watermelon | 3000.0 | True | 120.0 |

- Lambda functions can be used to transform data with
`map()`

method

In [40]:

```
df['fruit'].map(lambda x: x.upper())
```

Out[40]:

0 APPLE 1 BANANA 2 WATERMELON Name: fruit, dtype: object

In [41]:

```
transform = lambda x: x.capitalize()
```

In [42]:

```
transformed = df['fruit'].map(transform)
```

In [43]:

```
transformed
```

Out[43]:

0 Apple 1 Banana 2 Watermelon Name: fruit, dtype: object

- File object in Python provides the main interface to external files
- In contrast to other core types, file objects are created not with a literal,
- But with a function,
`open()`

:

```
<variable_name> = open(<filepath>, <mode>)
```

- Modes of file objects allow to:
- (
`r`

)ead a file (default) - (
`w`

)rite an object to a file - e(
`x`

)clusively create, failing if a file exists - (
`a`

)ppend to a file

- (
- You can
`r+`

mode if you need to read and write to file

In [44]:

```
f = open('../temp/test.txt', 'w') # Create a new file object in write mode
```

In [45]:

```
f.write('This is a test file.') # Write a string of characters to it
```

Out[45]:

20

In [46]:

```
f.close() # Flush output buffers to disk and close the connection
```

- To avoid keeping track of open file connections,
`with`

statement can be used

In [47]:

```
with open('../temp/test.txt', 'r') as f: # Note that we use 'r' mode for reading
text = f.read()
```

In [48]:

```
text
```

Out[48]:

'This is a test file.'

`pandas`

¶`pandas`

provides high-level methods that takes care of file connections- These methods all follow the same
`read_<format>`

and`to_<format>`

name patterns - CSV (comma-separated value) files are the standard of interoperability

```
<variable_name> = pd.read_<format>(<filepath>)
```

```
<variable_name>.to_<format>(<filepath>)
```

`pandas`

example¶- We will use the data from Kaggle 2020 Machine Learning and Data Science Survey
- For more information you can read the executive summary
- Or explore the winning Python Jupyter Notebooks

In [49]:

```
# We specify that we want to combine first two rows as a header
kaggle2020 = pd.read_csv('../data/kaggle_survey_2020_responses.csv', header = [0,1])
```

In [50]:

```
kaggle2020.head() # Returns the top n (n=5 default) rows
```

Out[50]:

Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q35_B_Part_2 | Q35_B_Part_3 | Q35_B_Part_4 | Q35_B_Part_5 | Q35_B_Part_6 | Q35_B_Part_7 | Q35_B_Part_8 | Q35_B_Part_9 | Q35_B_Part_10 | Q35_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL | ... | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Weights & Biases | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Trains | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other | |

0 | 1838 | 35-39 | Man | Colombia | Doctoral degree | Student | 5-10 years | Python | R | SQL | ... | NaN | NaN | NaN | TensorBoard | NaN | NaN | NaN | NaN | NaN | NaN |

1 | 289287 | 30-34 | Man | United States of America | Master’s degree | Data Engineer | 5-10 years | Python | R | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

2 | 860 | 35-39 | Man | Argentina | Bachelor’s degree | Software Engineer | 10-20 years | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |

3 | 507 | 30-34 | Man | United States of America | Master’s degree | Data Scientist | 5-10 years | Python | NaN | SQL | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

4 | 78 | 30-34 | Man | Japan | Master’s degree | Software Engineer | 3-5 years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

5 rows × 355 columns

In [51]:

```
kaggle2020.tail() # Returns the bottom n (n=5 default) rows
```

Out[51]:

Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q35_B_Part_2 | Q35_B_Part_3 | Q35_B_Part_4 | Q35_B_Part_5 | Q35_B_Part_6 | Q35_B_Part_7 | Q35_B_Part_8 | Q35_B_Part_9 | Q35_B_Part_10 | Q35_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL | ... | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Weights & Biases | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Trains | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other | |

20031 | 126 | 18-21 | Man | Turkey | Some college/university study without earning ... | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

20032 | 566 | 55-59 | Woman | United Kingdom of Great Britain and Northern I... | Master’s degree | Currently not employed | 20+ years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | NaN |

20033 | 238 | 30-34 | Man | Brazil | Master’s degree | Research Scientist | < 1 years | Python | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

20034 | 625 | 22-24 | Man | India | Bachelor’s degree | Software Engineer | 3-5 years | Python | NaN | SQL | ... | Weights & Biases | NaN | NaN | TensorBoard | NaN | NaN | Trains | NaN | NaN | NaN |

20035 | 1031 | 22-24 | Man | Pakistan | Master’s degree | Machine Learning Engineer | < 1 years | Python | NaN | NaN | ... | Weights & Biases | NaN | NaN | NaN | NaN | NaN | Trains | NaN | NaN | NaN |

5 rows × 355 columns

- Note that when writing data out we start with the object name storing the dataset
- I.e.
`df.to_csv(path)`

as opposed to`df = pd.read_csv(path)`

- Pandas can also write out into other data formats
- E.g.
`df.to_excel(path)`

,`df.to_stata(path)`

In [52]:

```
kaggle2020.to_csv('../temp/kaggle2020.csv')
```

- DataFrame methods in pandas can automatically handle (exclude) missing data (
`NaN`

)

In [53]:

```
kaggle2020.describe() # DataFrame.describe() provides an range of summary statistics
```

Out[53]:

Time from Start to Finish (seconds) | |
---|---|

Duration (in seconds) | |

count | 2.003600e+04 |

mean | 9.155865e+03 |

std | 6.136760e+04 |

min | 2.000000e+01 |

25% | 3.980000e+02 |

50% | 6.260000e+02 |

75% | 1.030250e+03 |

max | 1.144493e+06 |

In [54]:

```
kaggle2020.iloc[:,0].mean() # Rather than using describe(), we can apply individual methods
```

Out[54]:

9155.864843282092

In [55]:

```
kaggle2020.iloc[:,0].median() # Median
```

Out[55]:

626.0

In [56]:

```
kaggle2020.iloc[:,0].std() # Standard deviation
```

Out[56]:

61367.59967471586

In [57]:

```
import statistics ## We don't have to rely only on methods provided by `pandas`
statistics.stdev(kaggle2020.iloc[:,0])
```

Out[57]:

61367.599674715864

In [58]:

```
kaggle2020.describe(include = 'all') # Adding include = 'all' tells pandas to summarize all variables
```

Out[58]:

Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q35_B_Part_2 | Q35_B_Part_3 | Q35_B_Part_4 | Q35_B_Part_5 | Q35_B_Part_6 | Q35_B_Part_7 | Q35_B_Part_8 | Q35_B_Part_9 | Q35_B_Part_10 | Q35_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL | ... | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Weights & Biases | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Comet.ml | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Sacred + Omniboard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - TensorBoard | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Guild.ai | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Polyaxon | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Trains | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Domino Model Monitor | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - None | In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? (Select all that apply) - Selected Choice - Other | |

count | 2.003600e+04 | 20036 | 20036 | 20036 | 19569 | 19277 | 19120 | 15530 | 4277 | 7535 | ... | 1177 | 494 | 430 | 3199 | 557 | 480 | 846 | 519 | 3082 | 251 |

unique | NaN | 11 | 5 | 55 | 7 | 13 | 7 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

top | NaN | 25-29 | Man | India | Master’s degree | Student | 3-5 years | Python | R | SQL | ... | Weights & Biases | Comet.ml | Sacred + Omniboard | TensorBoard | Guild.ai | Polyaxon | Trains | Domino Model Monitor | None | Other |

freq | NaN | 4011 | 15789 | 5851 | 7859 | 5171 | 4546 | 15530 | 4277 | 7535 | ... | 1177 | 494 | 430 | 3199 | 557 | 480 | 846 | 519 | 3082 | 251 |

mean | 9.155865e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

std | 6.136760e+04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

min | 2.000000e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

25% | 3.980000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

50% | 6.260000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

75% | 1.030250e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

max | 1.144493e+06 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |

11 rows × 355 columns

In [59]:

```
kaggle2020.iloc[:,2].mode() # Mode, most frequent value
```

Out[59]:

0 Man Name: (Q2, What is your gender? - Selected Choice), dtype: object

In [60]:

```
kaggle2020.iloc[:,2].value_counts() # Counts of unique values
```

Out[60]:

Man 15789 Woman 3878 Prefer not to say 263 Prefer to self-describe 54 Nonbinary 52 Name: (Q2, What is your gender? - Selected Choice), dtype: int64

In [61]:

```
kaggle2020.iloc[:,2].value_counts(normalize = True) # We can further normalize them by the number of rows
```

Out[61]:

Man 0.788032 Woman 0.193552 Prefer not to say 0.013126 Prefer to self-describe 0.002695 Nonbinary 0.002595 Name: (Q2, What is your gender? - Selected Choice), dtype: float64

Method | Numeric | Categorical | Description |
---|---|---|---|

`count` |
yes | yes | Number of non-NA observations |

`value_counts` |
yes | yes | Number of unique observations by value |

`describe` |
yes | yes | Set of summary statistics for Series/DataFrame |

`min` , `max` |
yes | yes (caution) | Minimum and maximum values |

`quantile` |
yes | no | Sample quantile ranging from 0 to 1 |

`sum` |
yes | yes (caution) | Sum of values |

`prod` |
yes | no | Product of values |

`mean` |
yes | no | Mean |

`median` |
yes | no | Median (50% quantile) |

`var` |
yes | no | Sample variance |

`std` |
yes | no | Sample standard deviation |

`skew` |
yes | no | Sample skewness (third moment) |

`kurt` |
yes | no | Sample kurtosis (fourth moment) |

- Data Analysis and Communicating Results