Python provides very bare bones functionality for numeric analysis.
# Representing 3x3 matrix with list
mat = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
# Subsetting 2nd row, 3rd element
mat[1][2]
6
# Naturally, this representation
# breaks down rather quickly
mat * 2
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [1, 2, 3], [4, 5, 6], [7, 8, 9]]
# Using 'as' allows to avoid typing full name
# each time the module is referred to
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
arr[1][2]
6
arr * 2
array([[ 2, 4, 6], [ 8, 10, 12], [14, 16, 18]])
# Object type
type(arr)
numpy.ndarray
# Array dimensionality
arr.ndim
2
# Array size
arr.shape
(3, 3)
# Calculating summary statistics on array
# axis indicates the dimension
# note that every list within a list
# is treated as a column (not row)
arr.mean(axis = 0)
array([4., 5., 6.])
Source: R for Data Science
pandas
library has become the de facto standard for data manipulationpandas
is built upon (and often used in conjuction with) other computational librariesnumpy
(array data type), scipy
(linear algebra) and scikit-learn
(machine learning)import pandas as pd
pandas
object types¶sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1
0 150.0 1 120.0 2 3000.0 dtype: float64
sr1[0] # Slicing is similar to standard Python objects
150.0
sr1[sr1 > 200] # But subsetting is also available
2 3000.0 dtype: float64
d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}
sr2 = pd.Series(d)
sr2
apple 150.0 banana 120.0 watermelon 3000.0 dtype: float64
sr2[0] # Recall that this slicing would be impossible for standard dictionary
150.0
sr2.index # Sequence of labels is converted into an Index object
Index(['apple', 'banana', 'watermelon'], dtype='object')
# DataFrame can be constructed from
# a dict of equal-length lists/arrays
data = {'fruit': ['apple', 'banana', 'watermelon'],
'weight': [150.0, 120.0, 3000.0],
'berry': [False, True, True]}
df = pd.DataFrame(data)
df
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
2 | watermelon | 3000.0 | True |
DataFrame.loc()
provides method for label locationDataFrame.iloc()
provides method for index locationdf.iloc[0] # First row
fruit apple weight 150.0 berry False Name: 0, dtype: object
df.iloc[:,0] # First column
0 apple 1 banana 2 watermelon Name: fruit, dtype: object
Expression | Selection Operation |
---|---|
df[val] |
Column or sequence of columns +convenience (e.g. slice) |
df.loc[lab_i] |
Row or subset of rows by label |
df.loc[:, lab_j] |
Column or subset of columns by label |
df.loc[lab_i, lab_j] |
Both rows and columns by label |
df.iloc[i] |
Row or subset of rows by integer position |
df.iloc[:, j] |
Column or subset of columns by integer position |
df.iloc[i, j] |
Both rows and columns by integer position |
df.at[lab_i, lab_j] |
Single scalar value by row and column label |
df.iat[i, j] |
Single scalar value by row and column integer position |
df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
df[:2] # Shortcut
fruit | weight | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'
fruit | berry | |
---|---|---|
0 | apple | False |
1 | banana | True |
2 | watermelon | True |
df[['fruit', 'berry']] # Shortcut
fruit | berry | |
---|---|---|
0 | apple | False |
1 | banana | True |
2 | watermelon | True |
df.columns # Retrieve the names of all columns (index object)
Index(['fruit', 'weight', 'berry'], dtype='object')
df.columns[0] # This Index object is subsettable
'fruit'
df.columns.str.startswith('fr') # As column names are strings, we can apply str methods
array([ True, False, False])
df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria
fruit | |
---|---|
0 | apple |
1 | banana |
2 | watermelon |
df = df.rename(columns = {'weight': 'weight_g'}) # Columns can be renamed with dictionary mapping
df
fruit | weight_g | berry | |
---|---|---|---|
0 | apple | 150.0 | False |
1 | banana | 120.0 | True |
2 | watermelon | 3000.0 | True |
df['weight_oz'] = 0 # Columns can be added or modified by assignment
df
fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|
0 | apple | 150.0 | False | 0 |
1 | banana | 120.0 | True | 0 |
2 | watermelon | 3000.0 | True | 0 |
df['weight_oz'] = df['weight_g'] * 0.04
df
fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|
0 | apple | 150.0 | False | 6.0 |
1 | banana | 120.0 | True | 4.8 |
2 | watermelon | 3000.0 | True | 120.0 |
df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries
fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|
0 | apple | 150.0 | False | 6.0 |
df[df['berry'] == False] # The same can be achieved with more concise syntax
fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|
0 | apple | 150.0 | False | 6.0 |
weight200 = df[df['weight_g'] > 200] # Create new dataset with rows where weight is higher than 200 grams
weight200
fruit | weight_g | berry | weight_oz | |
---|---|---|---|---|
2 | watermelon | 3000.0 | True | 120.0 |
map()
methoddf['fruit'].map(lambda x: x.upper())
0 APPLE 1 BANANA 2 WATERMELON Name: fruit, dtype: object
transform = lambda x: x.capitalize()
transformed = df['fruit'].map(transform)
transformed
0 Apple 1 Banana 2 Watermelon Name: fruit, dtype: object
open()
:<variable_name> = open(<filepath>, <mode>)
r
)ead a file (default)w
)rite an object to a filex
)clusively create, failing if a file existsa
)ppend to a filer+
mode if you need to read and write to file# Create a new file object in write mode
f = open('../temp/test.txt', 'w')
# Write a string of characters to it
f.write('This is a test file.')
20
# Flush output buffers to disk and close the connection
f.close()
with
statement can be used# Note that we use 'r' mode for reading
with open('../temp/test.txt', 'r') as f:
text = f.read()
text
'This is a test file.'
pandas
¶pandas
provides high-level methods that takes care of file connectionsread_<format>
and to_<format>
name patterns<variable_name> = pd.read_<format>(<filepath>)
<variable_name>.to_<format>(<filepath>)
pandas
example¶# We specify that we want to combine first two rows as a header
kaggle2022 = pd.read_csv(
'../data/kaggle_survey_2022_responses.csv',
header = [0,1]
)
/tmp/ipykernel_776815/2279990677.py:2: DtypeWarning: Columns (208,225,255,257,260,270,271,277) have mixed types. Specify dtype option on import or set low_memory=False. kaggle2022 = pd.read_csv(
kaggle2022.head() # Returns the top n (n=5 default) rows
Duration (in seconds) | Q2 | Q3 | Q4 | Q5 | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | ... | Q44_3 | Q44_4 | Q44_5 | Q44_6 | Q44_7 | Q44_8 | Q44_9 | Q44_10 | Q44_11 | Q44_12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | Are you currently a student? (high school, university, or graduate) | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai | ... | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (notebooks, forums, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, Coursera forums, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Kaggle YouTube, Cloud AI Adventures, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, O’Reilly Data Show, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Analytics Vidhya, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (peer-reviewed journals, conference proceedings, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other | |
0 | 121 | 30-34 | Man | India | No | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 462 | 30-34 | Man | Algeria | No | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 293 | 18-21 | Man | Egypt | Yes | Coursera | edX | NaN | DataCamp | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | NaN | NaN | NaN | NaN | NaN |
3 | 851 | 55-59 | Man | France | No | Coursera | NaN | Kaggle Learn Courses | NaN | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | Course Forums (forums.fast.ai, Coursera forums... | NaN | NaN | Blogs (Towards Data Science, Analytics Vidhya,... | NaN | NaN | NaN | NaN |
4 | 232 | 45-49 | Man | India | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Blogs (Towards Data Science, Analytics Vidhya,... | NaN | NaN | NaN | NaN |
5 rows × 296 columns
kaggle2022.tail() # Returns the bottom n (n=5 default) rows
Duration (in seconds) | Q2 | Q3 | Q4 | Q5 | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | ... | Q44_3 | Q44_4 | Q44_5 | Q44_6 | Q44_7 | Q44_8 | Q44_9 | Q44_10 | Q44_11 | Q44_12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | Are you currently a student? (high school, university, or graduate) | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai | ... | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (notebooks, forums, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, Coursera forums, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Kaggle YouTube, Cloud AI Adventures, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, O’Reilly Data Show, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Analytics Vidhya, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (peer-reviewed journals, conference proceedings, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other | |
23992 | 331 | 22-24 | Man | United States of America | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | NaN | Journal Publications (peer-reviewed journals, ... | NaN | NaN | NaN |
23993 | 330 | 60-69 | Man | United States of America | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
23994 | 860 | 25-29 | Man | Turkey | No | NaN | NaN | NaN | DataCamp | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
23995 | 597 | 35-39 | Woman | Israel | No | NaN | NaN | Kaggle Learn Courses | NaN | NaN | ... | NaN | NaN | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
23996 | 303 | 18-21 | Man | India | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Other |
5 rows × 296 columns
df.to_csv(path)
as opposed to df = pd.read_csv(path)
df.to_excel(path)
, df.to_stata(path)
kaggle2022.to_csv('../temp/kaggle2022.csv')
NaN
)kaggle2022.describe() # DataFrame.describe() provides an range of summary statistics
Duration (in seconds) | |
---|---|
Duration (in seconds) | |
count | 2.399700e+04 |
mean | 1.009010e+04 |
std | 1.115403e+05 |
min | 1.200000e+02 |
25% | 2.640000e+02 |
50% | 4.140000e+02 |
75% | 7.150000e+02 |
max | 2.533678e+06 |
kaggle2022.iloc[:,0].mean() # Rather than using describe(), we can apply individual methods
10090.095845313997
kaggle2022.iloc[:,0].median() # Median
414.0
kaggle2022.iloc[:,0].std() # Standard deviation
111540.30746801202
import statistics ## We don't have to rely only on methods provided by `pandas`
statistics.stdev(kaggle2022.iloc[:,0])
111540.30746801202
kaggle2022.describe(include = 'all') # Adding include = 'all' tells pandas to summarize all variables
Duration (in seconds) | Q2 | Q3 | Q4 | Q5 | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | ... | Q44_3 | Q44_4 | Q44_5 | Q44_6 | Q44_7 | Q44_8 | Q44_9 | Q44_10 | Q44_11 | Q44_12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | Are you currently a student? (high school, university, or graduate) | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp | On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai | ... | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (notebooks, forums, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, Coursera forums, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Kaggle YouTube, Cloud AI Adventures, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, O’Reilly Data Show, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Analytics Vidhya, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (peer-reviewed journals, conference proceedings, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc) | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None | Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other | |
count | 2.399700e+04 | 23997 | 23997 | 23997 | 23997 | 9699 | 2474 | 6628 | 3718 | 944 | ... | 2678 | 11181 | 4006 | 11957 | 2120 | 7766 | 3804 | 1726 | 1268 | 835 |
unique | NaN | 11 | 5 | 58 | 2 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
top | NaN | 18-21 | Man | India | No | Coursera | edX | Kaggle Learn Courses | DataCamp | Fast.ai | ... | Reddit (r/machinelearning, etc) | Kaggle (notebooks, forums, etc) | Course Forums (forums.fast.ai, Coursera forums... | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | Blogs (Towards Data Science, Analytics Vidhya,... | Journal Publications (peer-reviewed journals, ... | Slack Communities (ods.ai, kagglenoobs, etc) | None | Other |
freq | NaN | 4559 | 18266 | 8792 | 12036 | 9699 | 2474 | 6628 | 3718 | 944 | ... | 2678 | 11181 | 4006 | 11957 | 2120 | 7766 | 3804 | 1726 | 1268 | 835 |
mean | 1.009010e+04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | 1.115403e+05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | 1.200000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | 2.640000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | 4.140000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | 7.150000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | 2.533678e+06 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 rows × 296 columns
kaggle2022.iloc[:,2].mode() # Mode, most frequent value
0 Man Name: (Q3, What is your gender? - Selected Choice), dtype: object
kaggle2022.iloc[:,2].value_counts() # Counts of unique values
Man 18266 Woman 5286 Prefer not to say 334 Nonbinary 78 Prefer to self-describe 33 Name: (Q3, What is your gender? - Selected Choice), dtype: int64
kaggle2022.iloc[:,2].value_counts(normalize = True) # We can further normalize them by the number of rows
Man 0.761178 Woman 0.220278 Prefer not to say 0.013918 Nonbinary 0.003250 Prefer to self-describe 0.001375 Name: (Q3, What is your gender? - Selected Choice), dtype: float64
Method | Numeric | Categorical | Description |
---|---|---|---|
count |
yes | yes | Number of non-NA observations |
value_counts |
yes | yes | Number of unique observations by value |
describe |
yes | yes | Set of summary statistics for Series/DataFrame |
min , max |
yes | yes (caution) | Minimum and maximum values |
quantile |
yes | no | Sample quantile ranging from 0 to 1 |
sum |
yes | yes (caution) | Sum of values |
prod |
yes | no | Product of values |
mean |
yes | no | Mean |
median |
yes | no | Median (50% quantile) |
var |
yes | no | Sample variance |
std |
yes | no | Sample standard deviation |
skew |
yes | no | Sample skewness (third moment) |
kurt |
yes | no | Sample kurtosis (fourth moment) |