Week 6: Data Analysis and Communicating Results¶

Python for Social Data Science¶

Tom Paskhalis¶

Exploratory data analysis¶

Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone--as the first step.

Tukey, 1977

  • Exploratory data analysis (EDA) is the first and often most important step.
  • Study's feasibility, scope and framing would usually depend on its results.
  • Specific details of EDA depend upon the type and quality of data available.

Measurement scales¶

Source: Stevens (1946)

Measurement scales in Pandas¶

  • The 4 measurement scales defined by Stevens (1946) can be roughly represented in pandas as follows:
  • Interval and ratio -> numeric
  • Nominal and ordinal -> categorical

Loading the dataset¶

In [1]:
import pandas as pd
In [2]:
# This time let's skip the 2nd row, which contains questions
kaggle2022 = pd.read_csv(
    '../data/kaggle_survey_2021_responses.csv',
    skiprows = [1]
)
kaggle2022.head(n = 1)
/tmp/ipykernel_779556/1478523914.py:2: DtypeWarning: Columns (195,201) have mixed types. Specify dtype option on import or set low_memory=False.
  kaggle2022 = pd.read_csv(
Out[2]:
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 ... Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
0 910 50-54 Man India Bachelor’s degree Other 5-10 years Python R NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 369 columns

Loading the dataset continued¶

In [3]:
# We will load the questions as a separate dataset
kaggle2022_qs = pd.read_csv(
    '../data/kaggle_survey_2021_responses.csv',
    nrows = 1
)
kaggle2022_qs
Out[3]:
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 ... Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
0 Duration (in seconds) What is your age (# years)? What is your gender? - Selected Choice In which country do you currently reside? What is the highest level of formal education ... Select the title most similar to your current ... For how many years have you been writing code ... What programming languages do you use on a reg... What programming languages do you use on a reg... What programming languages do you use on a reg... ... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor... In the next 2 years, do you hope to become mor...

1 rows × 369 columns

Summarizing numeric variables¶

  • DataFrame methods in pandas can automatically handle (exclude) missing data (NaN)
In [4]:
# pd.DataFrame.describe() provides an range of summary statistics
kaggle2022.describe()
Out[4]:
Time from Start to Finish (seconds) Q30_B_Part_1 Q30_B_Part_2 Q30_B_Part_3 Q30_B_Part_4 Q30_B_Part_5 Q30_B_Part_6 Q30_B_Part_7 Q30_B_OTHER
count 2.597300e+04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mean 1.105466e+04 NaN NaN NaN NaN NaN NaN NaN NaN
std 1.014716e+05 NaN NaN NaN NaN NaN NaN NaN NaN
min 1.200000e+02 NaN NaN NaN NaN NaN NaN NaN NaN
25% 4.430000e+02 NaN NaN NaN NaN NaN NaN NaN NaN
50% 6.560000e+02 NaN NaN NaN NaN NaN NaN NaN NaN
75% 1.038000e+03 NaN NaN NaN NaN NaN NaN NaN NaN
max 2.488653e+06 NaN NaN NaN NaN NaN NaN NaN NaN

Methods for summarizing numeric variables¶

In [5]:
kaggle2022.iloc[:,0].mean() # Rather than using describe(), we can apply individual methods
Out[5]:
11054.66492126439
In [6]:
kaggle2022.iloc[:,0].median() # Median
Out[6]:
656.0
In [7]:
kaggle2022.iloc[:,0].std() # Standard deviation
Out[7]:
101471.6221245172
In [8]:
import statistics ## We don't have to rely only on methods provided by `pandas`
statistics.stdev(kaggle2022.iloc[:,0])
Out[8]:
101471.6221245172

Summarizing categorical variables¶

In [9]:
# Adding include = 'all' tells pandas to summarize all variables
kaggle2022.describe(include = 'all')
Out[9]:
Time from Start to Finish (seconds) Q1 Q2 Q3 Q4 Q5 Q6 Q7_Part_1 Q7_Part_2 Q7_Part_3 ... Q38_B_Part_3 Q38_B_Part_4 Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9 Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
count 2.597300e+04 25973 25973 25973 25973 25973 25973 21860 5334 10756 ... 633 591 4239 729 737 1020 666 2747 4542 377
unique NaN 11 5 66 7 15 7 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
top NaN 25-29 Man India Master’s degree Student 1-3 years Python R SQL ... Comet.ml Sacred + Omniboard TensorBoard Guild.ai Polyaxon ClearML Domino Model Monitor MLflow None Other
freq NaN 4931 20598 7434 10132 6804 7874 21860 5334 10756 ... 633 591 4239 729 737 1020 666 2747 4542 377
mean 1.105466e+04 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 1.014716e+05 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 1.200000e+02 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% 4.430000e+02 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% 6.560000e+02 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 1.038000e+03 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 2.488653e+06 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

11 rows × 369 columns

Methods for summarizing categorical variables¶

In [10]:
kaggle2022.iloc[:,2].mode() # Mode, most frequent value
Out[10]:
0    Man
Name: Q2, dtype: object
In [11]:
kaggle2022.iloc[:,2].value_counts() # Counts of unique values
Out[11]:
Man                        20598
Woman                       4890
Prefer not to say            355
Nonbinary                     88
Prefer to self-describe       42
Name: Q2, dtype: int64
In [12]:
# We can further normalize them by the number of rows
kaggle2022.iloc[:,2].value_counts(normalize = True)
Out[12]:
Man                        0.793054
Woman                      0.188272
Prefer not to say          0.013668
Nonbinary                  0.003388
Prefer to self-describe    0.001617
Name: Q2, dtype: float64

Summary of descriptive statistics methods¶

Method Numeric Categorical Description
count yes yes Number of non-NA observations
value_counts yes yes Number of unique observations by value
describe yes yes Set of summary statistics for Series/DataFrame
min, max yes yes (caution) Minimum and maximum values
quantile yes no Sample quantile ranging from 0 to 1
sum yes yes (caution) Sum of values
prod yes no Product of values
mean yes no Mean
median yes no Median (50% quantile)
var yes no Sample variance
std yes no Sample standard deviation
skew yes no Sample skewness (third moment)
kurt yes no Sample kurtosis (fourth moment)

Pivoting data in pandas¶

  • Pivoting is transforing datasets from wide to long format (and vice versa)
  • The two main operations are:
    • Spreading some variable across columns (pd.DataFrame.pivot())
    • Gathering some columns in a variable pair (pd.DataFrame.melt())
pd.DataFrame.pivot()
pd.DataFrame.melt()

Source: R for Data Science

Pivoting data example¶

In [13]:
df_wide = pd.DataFrame({
  'country': ['Afghanistan', 'Brazil'],
  '1999': [745, 2666],
  '2000': [37737, 80488]
})
df_wide
Out[13]:
country 1999 2000
0 Afghanistan 745 37737
1 Brazil 2666 80488
In [14]:
# Pivoting long
df_long = df_wide.melt(
    id_vars = 'country',
    var_name = 'year',
    value_name = 'cases'
)
df_long
Out[14]:
country year cases
0 Afghanistan 1999 745
1 Brazil 1999 2666
2 Afghanistan 2000 37737
3 Brazil 2000 80488

Pivoting data example continued¶

In [15]:
# Pivoting wide
df_wide = df_long.pivot(
    index = 'country',
    columns = 'year',
    values = 'cases'
)
df_wide
Out[15]:
year 1999 2000
country
Afghanistan 745 37737
Brazil 2666 80488
In [16]:
# As using pivot creates an index from
# the column used as the row labels, we
# may want to use reset_index to move 
# the data back into a column
df_wide.reset_index()
Out[16]:
year country 1999 2000
0 Afghanistan 745 37737
1 Brazil 2666 80488

Crosstabulation¶

  • When working with survey data it is often useful to perform simple crosstabulations.
  • Crosstabulation (or crosstab for short) is a computation of group frequencies.
  • It is usually used for working with categorical variables that have a limited number of categories.
  • In pandas pd.crosstab() method is a special case of pd.pivot_table().

Crosstabulation in pandas¶

In [17]:
# Calculate crosstabulation between 'Age group' (Q1) and 'Gender' (Q2)
pd.crosstab(kaggle2022['Q1'], kaggle2022['Q2'])
Out[17]:
Q2 Man Nonbinary Prefer not to say Prefer to self-describe Woman
Q1
18-21 3696 16 60 12 1117
22-24 3643 13 66 9 963
25-29 3859 12 61 5 994
30-34 2765 17 34 7 618
35-39 1993 7 42 7 455
40-44 1537 4 31 1 317
45-49 1171 4 24 1 175
50-54 811 3 14 0 136
55-59 509 4 7 0 72
60-69 504 4 10 0 35
70+ 110 4 6 0 8

Margins in crosstab¶

In [18]:
# It is often useful to see the proportions/percentages rather than raw counts
pd.crosstab(kaggle2022['Q1'], kaggle2022['Q2'], normalize = 'columns')
Out[18]:
Q2 Man Nonbinary Prefer not to say Prefer to self-describe Woman
Q1
18-21 0.179435 0.181818 0.169014 0.285714 0.228425
22-24 0.176862 0.147727 0.185915 0.214286 0.196933
25-29 0.187348 0.136364 0.171831 0.119048 0.203272
30-34 0.134236 0.193182 0.095775 0.166667 0.126380
35-39 0.096757 0.079545 0.118310 0.166667 0.093047
40-44 0.074619 0.045455 0.087324 0.023810 0.064826
45-49 0.056850 0.045455 0.067606 0.023810 0.035787
50-54 0.039373 0.034091 0.039437 0.000000 0.027812
55-59 0.024711 0.045455 0.019718 0.000000 0.014724
60-69 0.024468 0.045455 0.028169 0.000000 0.007157
70+ 0.005340 0.045455 0.016901 0.000000 0.001636

Crosstabulation with pivot_table¶

In [19]:
# For `values` variable we use `Q3`, but any other would work equally well 
pd.pivot_table(
    kaggle2022, index = 'Q1', columns = 'Q2', values = 'Q3',
    aggfunc = 'count', fill_value = 0
)
Out[19]:
Q2 Man Nonbinary Prefer not to say Prefer to self-describe Woman
Q1
18-21 3696 16 60 12 1117
22-24 3643 13 66 9 963
25-29 3859 12 61 5 994
30-34 2765 17 34 7 618
35-39 1993 7 42 7 455
40-44 1537 4 31 1 317
45-49 1171 4 24 1 175
50-54 811 3 14 0 136
55-59 509 4 7 0 72
60-69 504 4 10 0 35
70+ 110 4 6 0 8

Data visualization¶

Source: Tufte (2001), based on Marey (1885)

Data visualization in Python¶

  • As with dealing with data, Python has no in-built, 'base' plotting functionality
  • matplotlib has become the one of standard solutions
  • It is often used in combination with pandas
  • Other popular alternative include seaborn and plotnine
  • Also pandas itself has some limited plotting facilities

plotnine - ggplot for Python¶

  • plotnine implements Grammar of Graphics data visualisation scheme (Wilkinson, 2005)
  • It mimics the syntax of a well-known R library ggplot2syntax (Wickham, 2010)
  • In doing so, it makes the code (almost) seamlessly portable between the two languages

Grammar of graphics¶

  • Grammar of Graphics is a powerful conceptualization of plotting
  • Graphs are broken into multiple layers
  • Layers can be recycled across multiple plots

Structure of ggplot calls in plotnine¶

  • Creation of ggplot objects in plotline has the following structure:
ggplot(data = <DATA>) +\
    <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
  • If the mappings are re-used across geometric objects (e.g. scatterplot and line):
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +\
    <GEOM_FUNCTION>() +\
    <GEOM_FUNCTION>()

Creating a ggplot in plotnine¶

In [20]:
from plotnine import *
In [21]:
q1_plot = ggplot(data = kaggle2022) + geom_bar(aes(x = 'Q1')) # Basic 'Age group' (Q1) bar chart
q1_plot
Out[21]:
<ggplot: (8791535819521)>

Compare to base pandas¶

In [22]:
# First we need to group dataset by 'Age group' (Q1) and summarize it with `size()`
kaggle2022_q1_grouped = kaggle2022.groupby(['Q1']).size() 
kaggle2022_q1_grouped.head(n = 3)
Out[22]:
Q1
18-21    4901
22-24    4694
25-29    4931
dtype: int64
In [23]:
%matplotlib inline
kaggle2022_q1_grouped.plot(kind = 'bar')
Out[23]:
<AxesSubplot: xlabel='Q1'>

Compare to matplotlib¶

In [24]:
import matplotlib.pyplot as plt
In [25]:
# `matplotlib` is more low-level library
# plots would need more work to be 'prettified'
plt.bar(
    x = kaggle2022_q1_grouped.index,
    height = kaggle2022_q1_grouped.values
)
Out[25]:
<BarContainer object of 11 artists>

Prettifying ggplot in plotnine¶

In [26]:
# Here we change default axes' labels and then apply B&W theme
q1_plot_pretty = q1_plot +\
    labs(x = 'Age group', y = 'respondents') +\
    theme_bw()
q1_plot_pretty
Out[26]:
<ggplot: (8791533522590)>

Other geometric objects (geom_)¶

Method Description
geom_bar(), geom_col() Bar charts
geom_boxplot() Box and whisker plot
geom_histogram() Histogram
geom_point() Scatterplot
geom_line(), geom_path() Lines
geom_map() Geographic areas
geom_smooth() Smoothed conditional means
geom_violin() Violin plots

Writing plots out in plotnine¶

  • Output format is automatically determined from write-out file extension
  • Commonly used formats are PDF, PNG and EPS
In [27]:
q1_plot_pretty.save('../temp/q1_plot_pretty.pdf')
/home/tp1587/Decrypted/Git/Python_Social_Data_Science/venv/lib/python3.10/site-packages/plotnine/ggplot.py:718: PlotnineWarning: Saving 6.4 x 4.8 in image.
/home/tp1587/Decrypted/Git/Python_Social_Data_Science/venv/lib/python3.10/site-packages/plotnine/ggplot.py:719: PlotnineWarning: Filename: ../temp/q1_plot_pretty.pdf

Additional visualization materials¶

Books:

  • Healy, Kieran. 2019. Data Visualization: A Practical Introduction. Princeton, NJ: Princeton University Press
  • Tufte, Edward. 2001. The Visual Display of Quantitative Information. 2nd ed. Cheshire, CT: Graphics Press

Online:

  • Plotnine: Grammar of Graphics for Python
  • Plotnine Documentation
  • Ggplot2 Documentation

Regression analysis¶

Linear regression

Source: Twitter

Anscombe's quartet¶

  • 4 artificial datasets constructed by Anscombe (1973)
  • All of them have nearly identical summary statistics
  • But show dramatically different relationships between variables
  • Designed to illustrate the importance of data visualization

Data for Anscombe's quartet¶

In [28]:
import pandas as pd
anscombe_quartet = pd.read_csv('../data/anscombes_quartet.csv')
In [29]:
anscombe_quartet.head()
Out[29]:
dataset x y
0 I 10 8.04
1 I 8 6.95
2 I 13 7.58
3 I 9 8.81
4 I 11 8.33

Summary statistics for Anscombe's quartet¶

In [30]:
# Here we use `groupby` method to create summary by a variable ('dataset')
anscombe_quartet.groupby(['dataset']).describe()
Out[30]:
x y
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
dataset
I 11.0 9.0 3.316625 4.0 6.5 9.0 11.5 14.0 11.0 7.500909 2.031568 4.26 6.315 7.58 8.57 10.84
II 11.0 9.0 3.316625 4.0 6.5 9.0 11.5 14.0 11.0 7.500909 2.031657 3.10 6.695 8.14 8.95 9.26
III 11.0 9.0 3.316625 4.0 6.5 9.0 11.5 14.0 11.0 7.500000 2.030424 5.39 6.250 7.11 7.98 12.74
IV 11.0 9.0 3.316625 8.0 8.0 8.0 8.0 19.0 11.0 7.500909 2.030579 5.25 6.170 7.04 8.19 12.50

Plotting Anscombe's quartet¶

In [31]:
from plotnine import *

ggplot(anscombe_quartet, aes(x = 'x', y = 'y')) +\
    geom_point(colour = 'red') +\
    geom_smooth(method = 'lm', se = False, fullrange = True) +\
    facet_wrap('dataset') +\
    theme_bw()                                     
Out[31]:
<ggplot: (8791533512158)>

Linear regression¶

  • Linear regression is the classical tool of statistical analysis.
  • It allows to estimate the degree of association between variables.
  • Typically, it is the association between one or more independent variables (IV) and one dependent variable (DV).
  • The main quantities of interest usually are direction, magnitude of association and its statistical significance.

Linear regression in Python¶

  • As for tabular data and visualization we need external libraries for running regression.
  • statsmodels library provides tools for estimating many statistical models.
  • Another useful library is scikit-learn.
  • It is more focussed on machine-learning applications.
In [32]:
import statsmodels.api as sm
import statsmodels.formula.api as smf # Formula API provides R-style formula specification

Data transformation¶

In [33]:
kaggle2022 = pd.read_csv('../data/kaggle_survey_2021_responses.csv', skiprows = [1])
/tmp/ipykernel_779556/2819004467.py:1: DtypeWarning: Columns (195,201) have mixed types. Specify dtype option on import or set low_memory=False.
In [34]:
# Let's give more intuitive names to out variables
kaggle2022 = kaggle2022.rename(columns = {
                'Q1': 'age',
                'Q2': 'gender',
                'Q3': 'country',
                'Q4': 'education',
                'Q25': 'compensation'})
In [35]:
kaggle2022['compensation'].head(n = 2)
Out[35]:
0    25,000-29,999
1    60,000-69,999
Name: compensation, dtype: object
In [36]:
from statistics import mean
# Here we are replacing the compensation range by its midpoint (i.e. 112499.5 for $100,000-$124,999)
# This variable requires substantial cleaning before transformation
# Such as extraneous symbols ('$', ',', '>') have to be removed
kaggle2022['compensation'] = kaggle2022['compensation'].map(
    lambda x: mean([float(x.replace(',','').replace('$','').replace('>','')) for x in str(x).split('-')])
)

Pandas and linear regression¶

In [37]:
# Level of compensation (in USD, our DV)
kaggle2022['compensation']
Out[37]:
0        27499.5
1        64999.5
2          499.5
3        34999.5
4        34999.5
          ...   
25968    17499.5
25969        NaN
25970      499.5
25971        NaN
25972      499.5
Name: compensation, Length: 25973, dtype: float64
In [38]:
# Frequencies of gender categories (our IV)
kaggle2022['gender'].value_counts()
Out[38]:
Man                        20598
Woman                       4890
Prefer not to say            355
Nonbinary                     88
Prefer to self-describe       42
Name: gender, dtype: int64

Formula specification¶

In [39]:
# Formula specification allows to write 
# 'DV ~ IV_1 + IV_2 + ... + IV_N' as model specification
fit1 = smf.ols('compensation ~ gender', data = kaggle2022).fit()

Model summary¶

In [40]:
fit1.summary()
Out[40]:
OLS Regression Results
Dep. Variable: compensation R-squared: 0.007
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 26.46
Date: Sun, 20 Nov 2022 Prob (F-statistic): 6.64e-22
Time: 15:32:20 Log-Likelihood: -1.9707e+05
No. Observations: 15391 AIC: 3.941e+05
Df Residuals: 15386 BIC: 3.942e+05
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.593e+04 782.805 58.672 0.000 4.44e+04 4.75e+04
gender[T.Nonbinary] 7.016e+04 1.29e+04 5.455 0.000 4.49e+04 9.54e+04
gender[T.Prefer not to say] 2.166e+04 6335.402 3.419 0.001 9242.046 3.41e+04
gender[T.Prefer to self-describe] 2.498e+04 1.8e+04 1.389 0.165 -1.03e+04 6.02e+04
gender[T.Woman] -1.466e+04 1932.351 -7.588 0.000 -1.84e+04 -1.09e+04
Omnibus: 18458.954 Durbin-Watson: 2.001
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2343660.164
Skew: 6.472 Prob(JB): 0.00
Kurtosis: 62.051 Cond. No. 25.7


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Multiple linear regression¶

In [41]:
# Let's now also control for age and education
fit2 = (
    smf
    .ols('compensation ~ gender + age + education', data = kaggle2022)
    .fit()
)

Multiple linear regression continued¶

In [42]:
fit2.summary()
Out[42]:
OLS Regression Results
Dep. Variable: compensation R-squared: 0.068
Model: OLS Adj. R-squared: 0.067
Method: Least Squares F-statistic: 55.98
Date: Sun, 20 Nov 2022 Prob (F-statistic): 1.77e-216
Time: 15:32:20 Log-Likelihood: -1.9658e+05
No. Observations: 15391 AIC: 3.932e+05
Df Residuals: 15370 BIC: 3.934e+05
Df Model: 20
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.483e+04 2910.746 5.093 0.000 9120.353 2.05e+04
gender[T.Nonbinary] 6.443e+04 1.25e+04 5.163 0.000 4e+04 8.89e+04
gender[T.Prefer not to say] 2.033e+04 6157.104 3.302 0.001 8261.253 3.24e+04
gender[T.Prefer to self-describe] 3.486e+04 1.74e+04 1.998 0.046 667.732 6.91e+04
gender[T.Woman] -1.1e+04 1884.889 -5.838 0.000 -1.47e+04 -7309.568
age[T.22-24] 2313.5562 3387.043 0.683 0.495 -4325.450 8952.562
age[T.25-29] 8811.2235 3235.996 2.723 0.006 2468.289 1.52e+04
age[T.30-34] 2.427e+04 3347.112 7.251 0.000 1.77e+04 3.08e+04
age[T.35-39] 3.405e+04 3485.133 9.770 0.000 2.72e+04 4.09e+04
age[T.40-44] 4.192e+04 3656.035 11.465 0.000 3.48e+04 4.91e+04
age[T.45-49] 5.368e+04 3872.345 13.862 0.000 4.61e+04 6.13e+04
age[T.50-54] 5.354e+04 4211.610 12.713 0.000 4.53e+04 6.18e+04
age[T.55-59] 6.666e+04 4792.714 13.910 0.000 5.73e+04 7.61e+04
age[T.60-69] 5.718e+04 4971.283 11.502 0.000 4.74e+04 6.69e+04
age[T.70+] 6.918e+04 9180.635 7.535 0.000 5.12e+04 8.72e+04
education[T.Doctoral degree] 1.187e+04 2322.458 5.111 0.000 7318.325 1.64e+04
education[T.I prefer not to answer] -1.029e+04 4857.239 -2.118 0.034 -1.98e+04 -769.130
education[T.Master’s degree] 7168.1892 1666.521 4.301 0.000 3901.611 1.04e+04
education[T.No formal education past high school] -1.059e+04 5801.400 -1.825 0.068 -2.2e+04 782.116
education[T.Professional doctorate] 8857.2561 5216.498 1.698 0.090 -1367.698 1.91e+04
education[T.Some college/university study without earning a bachelor’s degree] -912.0023 3378.159 -0.270 0.787 -7533.593 5709.588
Omnibus: 19216.780 Durbin-Watson: 2.008
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2958551.621
Skew: 6.892 Prob(JB): 0.00
Kurtosis: 69.509 Cond. No. 30.2


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Markdown - a language of reports¶

  • Markdown is a markup language for formatting text with simple syntax.
  • The key goal of Markdown is readability.
  • Only a limited set of formatting options is supported.
  • Markdown is used in online documentation, blogging and instant messaging.

Formatting text in Markdown¶

  • For italics *one star on each side*
  • For bold **two stars on each side**
  • For strikethrough ~~two tildes on each side~~

Lists in Markdown¶

For bulleted or unordered list of items:

- Just add a dash first and then write a text.
- If you add another dash in the following line, you will have another item in the list.
  - If you add four spaces or use a tab key, you will create an indented list.

For numbered or ordered list of items:

1. Just type a number and then write a text.
2. If you want to add a second item, just type in another number.
1. If you make a mistake when typing numbers, fear not, Markdown will correct it for you.
    1. If you press a tab key or type four spaces, you will get an indented list and the numbering
    will start from scratch.

Headers in Markdown¶

Headers or section titles are created with hashes(#)

# This is a first-tier header

## This is a second-tier header

### This is a third-tier header

Images and links in Markdown¶

  • To add an image you can write ![some text](image_path)
  • To add a link you can write [some text](URL)
  • For more complex cases HTML code can be used

Tables in Markdown¶

  • Tables in Markdown can be created using the following syntax (there are a few variants)
| Header1 | Header2 |
|:--------|:--------|
| content | content |
  • :--- produces left-aligned text in cells
  • ---: produces right-aligned text in cells
  • :--: produces centered text in cells

Markdown tables in pandas¶

  • Pandas can generate Markdown tables from DataFrame
In [43]:
# Let's revisit the summary statistics of Anscombe's quartet
anscombe_quartet.groupby(['dataset']).describe().iloc[:,0:3]
Out[43]:
x
count mean std
dataset
I 11.0 9.0 3.316625
II 11.0 9.0 3.316625
III 11.0 9.0 3.316625
IV 11.0 9.0 3.316625
In [44]:
print(anscombe_quartet.groupby(['dataset']).describe().iloc[:,0:3].to_markdown(index = False))
|   ('x', 'count') |   ('x', 'mean') |   ('x', 'std') |
|-----------------:|----------------:|---------------:|
|               11 |               9 |        3.31662 |
|               11 |               9 |        3.31662 |
|               11 |               9 |        3.31662 |
|               11 |               9 |        3.31662 |

The end¶