# Representing 3x3 matrix with list
mat = [[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]]


# Subsetting 2nd row, 3rd element
mat[1][2]

6


# Naturally, this representation
# breaks down rather quickly
mat * 2

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [1, 2, 3], [4, 5, 6], [7, 8, 9]]


# Using 'as' allows to avoid typing full name 
# each time the module is referred to
import numpy as np


arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])


arr[1][2]

6


arr * 2

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])


# Object type
type(arr)

numpy.ndarray


# Array dimensionality
arr.ndim

2


# Array size
arr.shape

(3, 3)


# Calculating summary statistics on array
# axis indicates the dimension
# note that every list within a list
# is treated as a column (not row)
arr.mean(axis = 0)

array([4., 5., 6.])


import pandas as pd


sr1 = pd.Series([150.0, 120.0, 3000.0])
sr1

0     150.0
1     120.0
2    3000.0
dtype: float64


sr1[0] # Slicing is similar to standard Python objects

150.0


sr1[sr1 > 200] # But subsetting is also available

2    3000.0
dtype: float64


d = {'apple': 150.0, 'banana': 120.0, 'watermelon': 3000.0}


sr2 = pd.Series(d)
sr2

apple          150.0
banana         120.0
watermelon    3000.0
dtype: float64


sr2[0] # Recall that this slicing would be impossible for standard dictionary

150.0


sr2.index # Sequence of labels is converted into an Index object

Index(['apple', 'banana', 'watermelon'], dtype='object')


# DataFrame can be constructed from
# a dict of equal-length lists/arrays
data = {'fruit': ['apple', 'banana', 'watermelon'], 
        'weight': [150.0, 120.0, 3000.0],
        'berry': [False, True, True]}           
df = pd.DataFrame(data)
df


df.iloc[0] # First row

fruit     apple
weight    150.0
berry     False
Name: 0, dtype: object


df.iloc[:,0] # First column

0         apple
1        banana
2    watermelon
Name: fruit, dtype: object


df.iloc[:2] # Select the first two rows (with convenience shortcut for slicing)


df[:2]  # Shortcut


df.loc[:, ['fruit', 'berry']] # Select the columns 'fruit' and 'berry'


df[['fruit', 'berry']] # Shortcut


df.columns # Retrieve the names of all columns (index object)

Index(['fruit', 'weight', 'berry'], dtype='object')


df.columns[0] # This Index object is subsettable

'fruit'


df.columns.str.startswith('fr') # As column names are strings, we can apply str methods

array([ True, False, False])


df.iloc[:,df.columns.str.startswith('fr')] # This is helpful with more complicated column selection criteria


df = df.rename(columns = {'weight': 'weight_g'}) # Columns can be renamed with dictionary mapping

df


df['weight_oz'] = 0  # Columns can be added or modified by assignment

df


df['weight_oz'] = df['weight_g'] * 0.04

df


df[df.loc[:,'berry'] == False] # Select rows where fruits are not berries


df[df['berry'] == False] # The same can be achieved with more concise syntax


weight200 = df[df['weight_g'] > 200] # Create new dataset with rows where weight is higher than 200 grams
weight200


df['fruit'].map(lambda x: x.upper())

0         APPLE
1        BANANA
2    WATERMELON
Name: fruit, dtype: object


transform = lambda x: x.capitalize()


transformed = df['fruit'].map(transform)


transformed

0         Apple
1        Banana
2    Watermelon
Name: fruit, dtype: object


# Create a new file object in write mode
f = open('../temp/test.txt', 'w')


# Write a string of characters to it
f.write('This is a test file.')

20


# Flush output buffers to disk and close the connection
f.close()


# Note that we use 'r' mode for reading
with open('../temp/test.txt', 'r') as f:
    text = f.read()


text

'This is a test file.'


# We specify that we want to combine first two rows as a header
kaggle2022 = pd.read_csv(
    '../data/kaggle_survey_2022_responses.csv',
    header = [0,1]
)

/tmp/ipykernel_776815/2279990677.py:2: DtypeWarning: Columns (208,225,255,257,260,270,271,277) have mixed types. Specify dtype option on import or set low_memory=False.
  kaggle2022 = pd.read_csv(


kaggle2022.head() # Returns the top n (n=5 default) rows


kaggle2022.tail() # Returns the bottom n (n=5 default) rows


kaggle2022.to_csv('../temp/kaggle2022.csv')


kaggle2022.describe() # DataFrame.describe() provides an range of summary statistics


kaggle2022.iloc[:,0].mean() # Rather than using describe(), we can apply individual methods

10090.095845313997


kaggle2022.iloc[:,0].median() # Median

414.0


kaggle2022.iloc[:,0].std() # Standard deviation

111540.30746801202


import statistics ## We don't have to rely only on methods provided by `pandas`
statistics.stdev(kaggle2022.iloc[:,0])

111540.30746801202


kaggle2022.describe(include = 'all') # Adding include = 'all' tells pandas to summarize all variables


kaggle2022.iloc[:,2].mode() # Mode, most frequent value

0    Man
Name: (Q3, What is your gender? - Selected Choice), dtype: object


kaggle2022.iloc[:,2].value_counts() # Counts of unique values

Man                        18266
Woman                       5286
Prefer not to say            334
Nonbinary                     78
Prefer to self-describe       33
Name: (Q3, What is your gender? - Selected Choice), dtype: int64


kaggle2022.iloc[:,2].value_counts(normalize = True) # We can further normalize them by the number of rows

Man                        0.761178
Woman                      0.220278
Prefer not to say          0.013918
Nonbinary                  0.003250
Prefer to self-describe    0.001375
Name: (Q3, What is your gender? - Selected Choice), dtype: float64

Expression	Selection Operation
`df[val]`	Column or sequence of columns +convenience (e.g. slice)
`df.loc[lab_i]`	Row or subset of rows by label
`df.loc[:, lab_j]`	Column or subset of columns by label
`df.loc[lab_i, lab_j]`	Both rows and columns by label
`df.iloc[i]`	Row or subset of rows by integer position
`df.iloc[:, j]`	Column or subset of columns by integer position
`df.iloc[i, j]`	Both rows and columns by integer position
`df.at[lab_i, lab_j]`	Single scalar value by row and column label
`df.iat[i, j]`	Single scalar value by row and column integer position

	Duration (in seconds)	Q2	Q3	Q4	Q5	Q6_1	Q6_2	Q6_3	Q6_4	Q6_5	...	Q44_3	Q44_4	Q44_5	Q44_6	Q44_7	Q44_8	Q44_9	Q44_10	Q44_11	Q44_12
	Duration (in seconds)	What is your age (# years)?	What is your gender? - Selected Choice	In which country do you currently reside?	Are you currently a student? (high school, university, or graduate)	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai	...	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (notebooks, forums, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, Coursera forums, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Kaggle YouTube, Cloud AI Adventures, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Analytics Vidhya, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (peer-reviewed journals, conference proceedings, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other
0	121	30-34	Man	India	No	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	462	30-34	Man	Algeria	No	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	293	18-21	Man	Egypt	Yes	Coursera	edX	NaN	DataCamp	NaN	...	NaN	Kaggle (notebooks, forums, etc)	NaN	YouTube (Kaggle YouTube, Cloud AI Adventures, ...	Podcasts (Chai Time Data Science, O’Reilly Dat...	NaN	NaN	NaN	NaN	NaN
3	851	55-59	Man	France	No	Coursera	NaN	Kaggle Learn Courses	NaN	NaN	...	NaN	Kaggle (notebooks, forums, etc)	Course Forums (forums.fast.ai, Coursera forums...	NaN	NaN	Blogs (Towards Data Science, Analytics Vidhya,...	NaN	NaN	NaN	NaN
4	232	45-49	Man	India	Yes	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	Blogs (Towards Data Science, Analytics Vidhya,...	NaN	NaN	NaN	NaN

	Duration (in seconds)	Q2	Q3	Q4	Q5	Q6_1	Q6_2	Q6_3	Q6_4	Q6_5	...	Q44_3	Q44_4	Q44_5	Q44_6	Q44_7	Q44_8	Q44_9	Q44_10	Q44_11	Q44_12
	Duration (in seconds)	What is your age (# years)?	What is your gender? - Selected Choice	In which country do you currently reside?	Are you currently a student? (high school, university, or graduate)	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai	...	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (notebooks, forums, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, Coursera forums, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Kaggle YouTube, Cloud AI Adventures, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Analytics Vidhya, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (peer-reviewed journals, conference proceedings, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other
23992	331	22-24	Man	United States of America	Yes	NaN	NaN	NaN	NaN	NaN	...	NaN	Kaggle (notebooks, forums, etc)	NaN	YouTube (Kaggle YouTube, Cloud AI Adventures, ...	Podcasts (Chai Time Data Science, O’Reilly Dat...	NaN	Journal Publications (peer-reviewed journals, ...	NaN	NaN	NaN
23993	330	60-69	Man	United States of America	Yes	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	YouTube (Kaggle YouTube, Cloud AI Adventures, ...	NaN	NaN	NaN	NaN	NaN	NaN
23994	860	25-29	Man	Turkey	No	NaN	NaN	NaN	DataCamp	NaN	...	NaN	Kaggle (notebooks, forums, etc)	NaN	YouTube (Kaggle YouTube, Cloud AI Adventures, ...	NaN	NaN	NaN	NaN	NaN	NaN
23995	597	35-39	Woman	Israel	No	NaN	NaN	Kaggle Learn Courses	NaN	NaN	...	NaN	NaN	NaN	YouTube (Kaggle YouTube, Cloud AI Adventures, ...	NaN	NaN	NaN	NaN	NaN	NaN
23996	303	18-21	Man	India	Yes	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Other

	Duration (in seconds)
	Duration (in seconds)
count	2.399700e+04
mean	1.009010e+04
std	1.115403e+05
min	1.200000e+02
25%	2.640000e+02
50%	4.140000e+02
75%	7.150000e+02
max	2.533678e+06

	Duration (in seconds)	Q2	Q3	Q4	Q5	Q6_1	Q6_2	Q6_3	Q6_4	Q6_5	...	Q44_3	Q44_4	Q44_5	Q44_6	Q44_7	Q44_8	Q44_9	Q44_10	Q44_11	Q44_12
	Duration (in seconds)	What is your age (# years)?	What is your gender? - Selected Choice	In which country do you currently reside?	Are you currently a student? (high school, university, or graduate)	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp	On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai	...	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (notebooks, forums, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, Coursera forums, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Kaggle YouTube, Cloud AI Adventures, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Analytics Vidhya, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (peer-reviewed journals, conference proceedings, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None	Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other
count	2.399700e+04	23997	23997	23997	23997	9699	2474	6628	3718	944	...	2678	11181	4006	11957	2120	7766	3804	1726	1268	835
unique	NaN	11	5	58	2	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
top	NaN	18-21	Man	India	No	Coursera	edX	Kaggle Learn Courses	DataCamp	Fast.ai	...	Reddit (r/machinelearning, etc)	Kaggle (notebooks, forums, etc)	Course Forums (forums.fast.ai, Coursera forums...	YouTube (Kaggle YouTube, Cloud AI Adventures, ...	Podcasts (Chai Time Data Science, O’Reilly Dat...	Blogs (Towards Data Science, Analytics Vidhya,...	Journal Publications (peer-reviewed journals, ...	Slack Communities (ods.ai, kagglenoobs, etc)	None	Other
freq	NaN	4559	18266	8792	12036	9699	2474	6628	3718	944	...	2678	11181	4006	11957	2120	7766	3804	1726	1268	835
mean	1.009010e+04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	1.115403e+05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	1.200000e+02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	2.640000e+02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	4.140000e+02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	7.150000e+02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	2.533678e+06	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Week 5: Data Wrangling¶

Tom Paskhalis¶

Overview¶

Numerical analysis in Python¶

NumPy - numerical analysis in Python¶

NumPy array¶

Working with arrays¶

Array indexing and slicing¶

Tidy data¶

Data preparation¶

Pandas¶

Core `pandas` object types¶

Series¶

Indexing in Series¶

DataFrame - the workhorse of data analysis¶

Indexing in DataFrame¶

Summary of indexing in DataFrame¶

Subsetting in DataFrame¶

Columns in DataFrame¶

Manipulating columns in DataFrame¶

Filtering in DataFrame¶

Variable transformation¶

File object¶

Data input and output¶

Data output example¶

Data input example¶

Reading and writing data in `pandas`¶

Reading data in `pandas` example¶

Visual data inspection¶

Visual data inspection continued¶

Reading in other (non-`.csv`) data files¶

Writing data out in Python¶

Summarizing numeric variables¶

Methods for summarizing numeric variables¶

Summarizing categorical variables¶

Methods for summarizing categorical variables¶

Summary of descriptive statistics methods¶

Next Week¶

	fruit	weight_g	berry	weight_oz
0	apple	150.0	False	6.0
1	banana	120.0	True	4.8
2	watermelon	3000.0	True	120.0

Method	Numeric	Categorical	Description
`count`	yes	yes	Number of non-NA observations
`value_counts`	yes	yes	Number of unique observations by value
`describe`	yes	yes	Set of summary statistics for Series/DataFrame
`min`, `max`	yes	yes (caution)	Minimum and maximum values
`quantile`	yes	no	Sample quantile ranging from 0 to 1
`sum`	yes	yes (caution)	Sum of values
`prod`	yes	no	Product of values
`mean`	yes	no	Mean
`median`	yes	no	Median (50% quantile)
`var`	yes	no	Sample variance
`std`	yes	no	Sample standard deviation
`skew`	yes	no	Sample skewness (third moment)
`kurt`	yes	no	Sample kurtosis (fourth moment)

Week 5: Data Wrangling¶

Python for Social Data Science¶

Tom Paskhalis¶

Overview¶

Numerical analysis in Python¶

NumPy - numerical analysis in Python¶

NumPy array¶

Working with arrays¶

Array indexing and slicing¶

Tidy data¶

Data preparation¶

Pandas¶

Core pandas object types¶

Series¶

Indexing in Series¶

DataFrame - the workhorse of data analysis¶

Indexing in DataFrame¶

Summary of indexing in DataFrame¶

Subsetting in DataFrame¶

Columns in DataFrame¶

Manipulating columns in DataFrame¶

Filtering in DataFrame¶

Variable transformation¶

File object¶

Data input and output¶

Data output example¶

Data input example¶

Reading and writing data in pandas¶

Reading data in pandas example¶

Visual data inspection¶

Visual data inspection continued¶

Reading in other (non-.csv) data files¶

Writing data out in Python¶

Summarizing numeric variables¶

Methods for summarizing numeric variables¶

Summarizing categorical variables¶

Methods for summarizing categorical variables¶

Summary of descriptive statistics methods¶

Next Week¶

Core `pandas` object types¶

Reading and writing data in `pandas`¶

Reading data in `pandas` example¶

Reading in other (non-`.csv`) data files¶