Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone--as the first step.
import pandas as pd
# This time let's skip the 2nd row, which contains questions
kaggle2022 = pd.read_csv(
'../data/kaggle_survey_2021_responses.csv',
skiprows = [1]
)
kaggle2022.head(n = 1)
/tmp/ipykernel_779556/1478523914.py:2: DtypeWarning: Columns (195,201) have mixed types. Specify dtype option on import or set low_memory=False. kaggle2022 = pd.read_csv(
Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q38_B_Part_3 | Q38_B_Part_4 | Q38_B_Part_5 | Q38_B_Part_6 | Q38_B_Part_7 | Q38_B_Part_8 | Q38_B_Part_9 | Q38_B_Part_10 | Q38_B_Part_11 | Q38_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 910 | 50-54 | Man | India | Bachelor’s degree | Other | 5-10 years | Python | R | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 369 columns
# We will load the questions as a separate dataset
kaggle2022_qs = pd.read_csv(
'../data/kaggle_survey_2021_responses.csv',
nrows = 1
)
kaggle2022_qs
Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q38_B_Part_3 | Q38_B_Part_4 | Q38_B_Part_5 | Q38_B_Part_6 | Q38_B_Part_7 | Q38_B_Part_8 | Q38_B_Part_9 | Q38_B_Part_10 | Q38_B_Part_11 | Q38_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education ... | Select the title most similar to your current ... | For how many years have you been writing code ... | What programming languages do you use on a reg... | What programming languages do you use on a reg... | What programming languages do you use on a reg... | ... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... | In the next 2 years, do you hope to become mor... |
1 rows × 369 columns
NaN
)# pd.DataFrame.describe() provides an range of summary statistics
kaggle2022.describe()
Time from Start to Finish (seconds) | Q30_B_Part_1 | Q30_B_Part_2 | Q30_B_Part_3 | Q30_B_Part_4 | Q30_B_Part_5 | Q30_B_Part_6 | Q30_B_Part_7 | Q30_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|
count | 2.597300e+04 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
mean | 1.105466e+04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | 1.014716e+05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | 1.200000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | 4.430000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | 6.560000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | 1.038000e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | 2.488653e+06 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
kaggle2022.iloc[:,0].mean() # Rather than using describe(), we can apply individual methods
11054.66492126439
kaggle2022.iloc[:,0].median() # Median
656.0
kaggle2022.iloc[:,0].std() # Standard deviation
101471.6221245172
import statistics ## We don't have to rely only on methods provided by `pandas`
statistics.stdev(kaggle2022.iloc[:,0])
101471.6221245172
# Adding include = 'all' tells pandas to summarize all variables
kaggle2022.describe(include = 'all')
Time from Start to Finish (seconds) | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7_Part_1 | Q7_Part_2 | Q7_Part_3 | ... | Q38_B_Part_3 | Q38_B_Part_4 | Q38_B_Part_5 | Q38_B_Part_6 | Q38_B_Part_7 | Q38_B_Part_8 | Q38_B_Part_9 | Q38_B_Part_10 | Q38_B_Part_11 | Q38_B_OTHER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.597300e+04 | 25973 | 25973 | 25973 | 25973 | 25973 | 25973 | 21860 | 5334 | 10756 | ... | 633 | 591 | 4239 | 729 | 737 | 1020 | 666 | 2747 | 4542 | 377 |
unique | NaN | 11 | 5 | 66 | 7 | 15 | 7 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
top | NaN | 25-29 | Man | India | Master’s degree | Student | 1-3 years | Python | R | SQL | ... | Comet.ml | Sacred + Omniboard | TensorBoard | Guild.ai | Polyaxon | ClearML | Domino Model Monitor | MLflow | None | Other |
freq | NaN | 4931 | 20598 | 7434 | 10132 | 6804 | 7874 | 21860 | 5334 | 10756 | ... | 633 | 591 | 4239 | 729 | 737 | 1020 | 666 | 2747 | 4542 | 377 |
mean | 1.105466e+04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | 1.014716e+05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | 1.200000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | 4.430000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | 6.560000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | 1.038000e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | 2.488653e+06 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 rows × 369 columns
kaggle2022.iloc[:,2].mode() # Mode, most frequent value
0 Man Name: Q2, dtype: object
kaggle2022.iloc[:,2].value_counts() # Counts of unique values
Man 20598 Woman 4890 Prefer not to say 355 Nonbinary 88 Prefer to self-describe 42 Name: Q2, dtype: int64
# We can further normalize them by the number of rows
kaggle2022.iloc[:,2].value_counts(normalize = True)
Man 0.793054 Woman 0.188272 Prefer not to say 0.013668 Nonbinary 0.003388 Prefer to self-describe 0.001617 Name: Q2, dtype: float64
Method | Numeric | Categorical | Description |
---|---|---|---|
count |
yes | yes | Number of non-NA observations |
value_counts |
yes | yes | Number of unique observations by value |
describe |
yes | yes | Set of summary statistics for Series/DataFrame |
min , max |
yes | yes (caution) | Minimum and maximum values |
quantile |
yes | no | Sample quantile ranging from 0 to 1 |
sum |
yes | yes (caution) | Sum of values |
prod |
yes | no | Product of values |
mean |
yes | no | Mean |
median |
yes | no | Median (50% quantile) |
var |
yes | no | Sample variance |
std |
yes | no | Sample standard deviation |
skew |
yes | no | Sample skewness (third moment) |
kurt |
yes | no | Sample kurtosis (fourth moment) |
pd.DataFrame.pivot()
)pd.DataFrame.melt()
)