POP77001 - 2025/26 – Week 10 Tutorial: Data Wrangling in Python

Loading the Dataset

Replace filepath with the location of the file on your computer

import pandas as pd

# This time let's skip the 2nd row, which contains questions
PATH = '../data/kaggle_survey_2022_responses.csv'

kaggle2022 = pd.read_csv(PATH, skiprows = [1])
kaggle2022.head(n = 1)

	Duration (in seconds)	Q2	Q3	Q4	Q5	Q6_1	Q6_2	Q6_3	Q6_4	Q6_5	...	Q44_3	Q44_4	Q44_5	Q44_6	Q44_7	Q44_8	Q44_9	Q44_10	Q44_11	Q44_12
0	121	30-34	Man	India	No	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1 rows × 296 columns

# We will load the questions as a separate dataset
kaggle2022_qs = pd.read_csv(PATH, nrows = 1)
kaggle2022_qs

	Duration (in seconds)	Q2	Q3	Q4	Q5	Q6_1	Q6_2	Q6_3	Q6_4	Q6_5	...	Q44_3	Q44_4	Q44_5	Q44_6	Q44_7	Q44_8	Q44_9	Q44_10	Q44_11	Q44_12
0	Duration (in seconds)	What is your age (# years)?	What is your gender? - Selected Choice	In which country do you currently reside?	Are you currently a student? (high school, uni...	On which platforms have you begun or completed...	On which platforms have you begun or completed...	On which platforms have you begun or completed...	On which platforms have you begun or completed...	On which platforms have you begun or completed...	...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...	Who/what are your favorite media sources that ...

1 rows × 296 columns

Exercise: Categorical Variables

Load the dataset (as local file)
Consider country of residence reported by respondents (question Q4).
Make sure you can select the column both using label and index.
Calculate the percentages of top 3 countries of residence in the sample.

Crosstabulation in `pandas`

# Calculate crosstabulation between 'Age group' (Q2) and 'Gender' (Q3)
pd.crosstab(kaggle2022['Q2'], kaggle2022['Q3'])

Q3	Man	Nonbinary	Prefer not to say	Prefer to self-describe	Woman
Q2
18-21	3310	13	69	7	1160
22-24	3168	15	50	6	1044
25-29	3425	14	56	6	971
30-34	2248	12	43	6	663
35-39	1791	6	36	1	519
40-44	1480	6	29	2	410
45-49	997	6	10	1	239
50-54	759	0	14	1	140
55-59	506	3	13	0	89
60-69	470	3	10	1	42
70+	112	0	4	2	9

Margins in Crosstab

# It is often useful to see the proportions/percentages rather than raw counts
pd.crosstab(kaggle2022['Q2'], kaggle2022['Q3'], normalize = 'columns')

Q3	Man	Nonbinary	Prefer not to say	Prefer to self-describe	Woman
Q2
18-21	0.181211	0.166667	0.206587	0.212121	0.219448
22-24	0.173437	0.192308	0.149701	0.181818	0.197503
25-29	0.187507	0.179487	0.167665	0.181818	0.183693
30-34	0.123070	0.153846	0.128743	0.181818	0.125426
35-39	0.098051	0.076923	0.107784	0.030303	0.098184
40-44	0.081025	0.076923	0.086826	0.060606	0.077563
45-49	0.054582	0.076923	0.029940	0.030303	0.045214
50-54	0.041553	0.000000	0.041916	0.030303	0.026485
55-59	0.027702	0.038462	0.038922	0.000000	0.016837
60-69	0.025731	0.038462	0.029940	0.030303	0.007946
70+	0.006132	0.000000	0.011976	0.060606	0.001703

Crosstab with `pivot_table`

# For `values` variable we use `Q4`, but any other would work equally well 
pd.pivot_table(
  kaggle2022,
  index = 'Q2',
  columns = 'Q3',
  values = 'Q4',
  aggfunc = 'count',
  fill_value = 0
)

Q3	Man	Nonbinary	Prefer not to say	Prefer to self-describe	Woman
Q2
18-21	3310	13	69	7	1160
22-24	3168	15	50	6	1044
25-29	3425	14	56	6	971
30-34	2248	12	43	6	663
35-39	1791	6	36	1	519
40-44	1480	6	29	2	410
45-49	997	6	10	1	239
50-54	759	0	14	1	140
55-59	506	3	13	0	89
60-69	470	3	10	1	42
70+	112	0	4	2	9

Exercise: Manipulating Columns

Let’s take a look at the first column of the dataset.
It lists the time it took respondents to complete the survey (in seconds).
First, change column’s long name to duration_min.
Now modify the column such that it shows time in minutes.
Filter dataset leaving only respondents who took more than 3 mins to respond.
How many are dropped?

Pivoting Data in `pandas`

Recall pivoting from R.
The two main operations are:
- Spreading some variable across columns (pd.DataFrame.pivot())
- Gathering some columns in a variable pair (pd.DataFrame.melt())

pd.DataFrame.pivot()

pd.DataFrame.melt()

Source

R for Data Science

Pivoting Data to Long

df_wide = pd.DataFrame({
  'country': ['Afghanistan', 'Brazil'],
  '1999': [745, 2666],
  '2000': [37737, 80488]
})
df_wide

	country	1999	2000
0	Afghanistan	745	37737
1	Brazil	2666	80488

# Pivoting longer
df_long = df_wide.melt(
    id_vars = 'country',
    var_name = 'year',
    value_name = 'cases'
)
df_long

	country	year	cases
0	Afghanistan	1999	745
1	Brazil	1999	2666
2	Afghanistan	2000	37737
3	Brazil	2000	80488

Pivoting Data to Wide

# Pivoting wider
df_wide = df_long.pivot(
    index = 'country',
    columns = 'year',
    values = 'cases'
)
df_wide

year	1999	2000
country
Afghanistan	745	37737
Brazil	2666	80488

# As using pivot creates an index from
# the column used as the row labels, we
# may want to use reset_index to move 
# the data back into a column
df_wide.reset_index()

year	country	1999	2000
0	Afghanistan	745	37737
1	Brazil	2666	80488

Week 10 Exercise (unassessed)

Try replicating Exercise 5 from Assignment 2 using pandas.
You can use pd.DataFrame.isna() or pd.DataFrame.notna() for filtering.

Week 10 Tutorial: Data Wrangling in Python

Loading the Dataset

Exercise: Categorical Variables

Crosstabulation in pandas

Margins in Crosstab

Crosstab with pivot_table

Exercise: Manipulating Columns

Pivoting Data in pandas

Pivoting Data to Long

Pivoting Data to Wide

Week 10 Exercise (unassessed)

Crosstabulation in `pandas`

Crosstab with `pivot_table`

Pivoting Data in `pandas`