Week 6: Data Analysis and Communicating Results¶

Python for Social Data Science¶

Tom Paskhalis¶

Exploratory data analysis¶

Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone--as the first step.

Tukey, 1977

  • Exploratory data analysis (EDA) is the first and often most important step.
  • Study's feasibility, scope and framing would usually depend on its results.
  • Specific details of EDA depend upon the type and quality of data available.

Measurement scales¶

Source: Stevens (1946)

Measurement scales in Pandas¶

  • The 4 measurement scales defined by Stevens (1946) can be roughly represented in pandas as follows:
  • Interval and ratio -> numeric
  • Nominal and ordinal -> categorical

Loading the dataset¶

In [1]:
import pandas as pd
In [2]:
# This time let's skip the 2nd row, which contains questions
kaggle2022 = pd.read_csv(
    '../data/kaggle_survey_2021_responses.csv',
    skiprows = [1]
)
kaggle2022.head(n = 1)
/tmp/ipykernel_779556/1478523914.py:2: DtypeWarning: Columns (195,201) have mixed types. Specify dtype option on import or set low_memory=False.
  kaggle2022 = pd.read_csv(
Out[2]:
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 ... Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
0 910 50-54 Man India Bachelor’s degree Other 5-10 years Python R NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 369 columns

Loading the dataset continued¶

In [3]:
# We will load the questions as a separate dataset
kaggle2022_qs = pd.read_csv(
    '../data/kaggle_survey_2021_responses.csv',
    nrows = 1
)
kaggle2022_qs
Out[3]:
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 ... Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
0 Duration (in seconds) What is your age (# years)? What is your gender? - Selected Choice In which country do you currently reside? What is the highest level of formal education ... Select the title most similar to your current ... For how many years have you been writing code ... What programming languages do you use on a reg... What programming languages do you use on a reg... What programming languages do you use on a reg... ... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor...

1 rows × 369 columns

Summarizing numeric variables¶

  • DataFrame methods in pandas can automatically handle (exclude) missing data (NaN)
In [4]:
# pd.DataFrame.describe() provides an range of summary statistics
kaggle2022.describe()
Out[4]:
Time from Start to Finish (seconds) Q30_B_Part_1 Q30_B_Part_2 Q30_B_Part_3 Q30_B_Part_4 Q30_B_Part_5 Q30_B_Part_6 Q30_B_Part_7 Q30_B_OTHER
count 2.597300e+04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mean 1.105466e+04 NaN NaN NaN NaN NaN NaN NaN NaN
std 1.014716e+05 NaN NaN NaN NaN NaN NaN NaN NaN
min 1.200000e+02 NaN NaN NaN NaN NaN NaN NaN NaN
25% 4.430000e+02 NaN NaN NaN NaN NaN NaN NaN NaN
50% 6.560000e+02 NaN NaN NaN NaN NaN NaN NaN NaN
75% 1.038000e+03 NaN NaN NaN NaN NaN NaN NaN NaN
max 2.488653e+06 NaN NaN NaN NaN NaN NaN NaN NaN

Methods for summarizing numeric variables¶

In [5]:
kaggle2022.iloc[:,0].mean() # Rather than using describe(), we can apply individual methods
Out[5]:
11054.66492126439
In [6]:
kaggle2022.iloc[:,0].median() # Median
Out[6]:
656.0
In [7]:
kaggle2022.iloc[:,0].std() # Standard deviation
Out[7]:
101471.6221245172
In [8]:
import statistics ## We don't have to rely only on methods provided by `pandas`
statistics.stdev(kaggle2022.iloc[:,0])
Out[8]:
101471.6221245172

Summarizing categorical variables¶

In [9]:
# Adding include = 'all' tells pandas to summarize all variables
kaggle2022.describe(include = 'all')
Out[9]:
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 ... Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
count 2.597300e+04 25973 25973 25973 25973 25973 25973 21860 5334 10756 ... 633 591 4239 729 737 1020 666 2747 4542 377
unique NaN 11 5 66 7 15 7 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
top NaN 25-29 Man India Master’s degree Student 1-3 years Python R SQL ... Comet.ml Sacred + Omniboard TensorBoard Guild.ai Polyaxon ClearML Domino Model Monitor MLflow None Other
freq NaN 4931 20598 7434 10132 6804 7874 21860 5334 10756 ... 633 591 4239 729 737 1020 666 2747 4542 377
mean 1.105466e+04 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 1.014716e+05 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 1.200000e+02 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% 4.430000e+02 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% 6.560000e+02 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 1.038000e+03 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 2.488653e+06 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

11 rows × 369 columns

Methods for summarizing categorical variables¶

In [10]:
kaggle2022.iloc[:,2].mode() # Mode, most frequent value
Out[10]:
0    Man
Name: Q2, dtype: object
In [11]:
kaggle2022.iloc[:,2].value_counts() # Counts of unique values
Out[11]:
Man                        20598
Woman                       4890
Prefer not to say            355
Nonbinary                     88
Prefer to self-describe       42
Name: Q2, dtype: int64
In [12]:
# We can further normalize them by the number of rows
kaggle2022.iloc[:,2].value_counts(normalize = True)
Out[12]:
Man                        0.793054
Woman                      0.188272
Prefer not to say          0.013668
Nonbinary                  0.003388
Prefer to self-describe    0.001617
Name: Q2, dtype: float64

Summary of descriptive statistics methods¶

Method Numeric Categorical Description
count yes yes Number of non-NA observations
value_counts yes yes Number of unique observations by value
describe yes yes Set of summary statistics for Series/DataFrame
min, max yes yes (caution) Minimum and maximum values
quantile yes no Sample quantile ranging from 0 to 1
sum yes yes (caution) Sum of values
prod yes no Product of values
mean yes no Mean
median yes no Median (50% quantile)
var yes no Sample variance
std yes no Sample standard deviation
skew yes no Sample skewness (third moment)
kurt yes no Sample kurtosis (fourth moment)

Pivoting data in pandas¶

  • Pivoting is transforing datasets from wide to long format (and vice versa)
  • The two main operations are:
    • Spreading some variable across columns (pd.DataFrame.pivot())
    • Gathering some columns in a variable pair (pd.DataFrame.melt())