POP77001 Computer Programming for Social Scientists
Source
pandas library has become the de facto standard for data manipulation.pandas is built upon (and often used in conjunction with) other computational libraries:
scipy (linear algebra), statsmodels (statistical models) and scikit-learn (machine learning)pandas Object Typespandas. fruit weight berry
0 apple 150.0 False
1 banana 120.0 True
2 watermelon 3000.0 True
DataFrame.iloc() provides method for index locationDataFrame.loc() provides method for label location| Expression | Selection Operation |
|---|---|
df[val] |
Column or sequence of columns +convenience (e.g. slice) |
df.loc[lab_i] |
Row or subset of rows by label |
df.loc[:, lab_j] |
Column or subset of columns by label |
df.loc[lab_i, lab_j] |
Both rows and columns by label |
df.iloc[i] |
Row or subset of rows by integer position |
df.iloc[:, j] |
Column or subset of columns by integer position |
df.iloc[i, j] |
Both rows and columns by integer position |
df.at[lab_i, lab_j] |
Single scalar value by row and column label |
df.iat[i, j] |
Single scalar value by row and column integer position |
fruit weight berry
0 apple 150.0 False
1 banana 120.0 True
Index(['fruit', 'weight', 'berry'], dtype='object')
fruit weight berry
0 apple 150.0 False
fruit weight_g berry
0 apple 150.0 False
1 banana 120.0 True
2 watermelon 3000.0 True
map() methodpandas%>%/|> in R), pandas provides a pipe() method to chain operations. fruit weight_g berry weight_oz
0 Apple 150.0 False 6.0
1 Banana 120.0 True 4.8
2 Watermelon 3000.0 True 120.0
| Format | Readability | Platform | Speed | Compression | Persistence |
|---|---|---|---|---|---|
csv |
✅ Human-readable | ✅ Cross-platform | ❌ Slow | ❌ No | ✅ Long-term |
rds |
❌ Binary | ❌ R only | ✅ Fast | ✅ Yes | ✅ Long-term |
pickle |
❌ Binary | ❌ Python only | ✅ Fast | ✅ Yes | ✅ Long-term |
parquet |
❌ Binary | ✅ Cross-platform | ✅ Fast | ✅ Yes | ✅ Long-term |
feather |
❌ Binary | ✅ Cross-platform | ✅ Fast | ✅ Yes | ❌ Short-term |
open():<variable_name> = open(<filepath>, <mode>)
r)ead a file (default);w)rite an object to a file;x)clusively create, failing if a file exists;a)ppend to a file;r+ mode if you need to read and write to file.with statement can be used:pandaspandas provides high-level methods that takes care of file connections.read_<format> and to_<format> name patterns.<variable_name> = pd.read_<format>(<filepath>)
<variable_name>.to_<format>(<filepath>)
Duration (in seconds) ... Q44_12
Duration (in seconds) ... Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other
0 121 ... NaN
1 462 ... NaN
2 293 ... NaN
3 851 ... NaN
4 232 ... NaN
[5 rows x 296 columns]
Duration (in seconds) ... Q44_12
Duration (in seconds) ... Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other
23992 331 ... NaN
23993 330 ... NaN
23994 860 ... NaN
23995 597 ... NaN
23996 303 ... Other
[5 rows x 296 columns]
NaN) Duration (in seconds) ... Q44_11
Duration (in seconds) ... Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - None
count 2.399700e+04 ... 0.0
mean 1.009010e+04 ... NaN
std 1.115403e+05 ... NaN
min 1.200000e+02 ... NaN
25% 2.640000e+02 ... NaN
50% 4.140000e+02 ... NaN
75% 7.150000e+02 ... NaN
max 2.533678e+06 ... NaN
[8 rows x 17 columns]
Duration (in seconds) ... Q44_12
Duration (in seconds) ... Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Other
count 2.399700e+04 ... 835
unique NaN ... 1
top NaN ... Other
freq NaN ... 835
mean 1.009010e+04 ... NaN
std 1.115403e+05 ... NaN
min 1.200000e+02 ... NaN
25% 2.640000e+02 ... NaN
50% 4.140000e+02 ... NaN
75% 7.150000e+02 ... NaN
max 2.533678e+06 ... NaN
[11 rows x 296 columns]
(Q3, What is your gender? - Selected Choice)
Man 18266
Woman 5286
Prefer not to say 334
Nonbinary 78
Prefer to self-describe 33
Name: count, dtype: int64
| Method | Numeric | Categorical | Description |
|---|---|---|---|
count |
yes | yes | Number of non-NA observations |
value_counts |
yes | yes | Number of unique observations by value |
describe |
yes | yes | Set of summary statistics for Series/DataFrame |
min, max |
yes | yes (caution) | Minimum and maximum values |
quantile |
yes | no | Sample quantile ranging from 0 to 1 |
sum |
yes | yes (caution) | Sum of values |
prod |
yes | no | Product of values |
mean |
yes | no | Mean |
median |
yes | no | Median (50% quantile) |
var |
yes | no | Sample variance |
std |
yes | no | Sample standard deviation |
skew |
yes | no | Sample skewness (third moment) |
kurt |
yes | no | Sample kurtosis (fourth moment) |