x y z
1 1 a TRUE
2 2 b FALSE
3 3 c FALSE
4 4 d TRUE
POP77001 Computer Programming for Social Scientists
tidyverse packagesdata.frame() function with named vectors as input.'data.frame': 4 obs. of 3 variables:
$ x: int 1 2 3 4
$ y: chr "a" "b" "c" "d"
$ z: logi TRUE FALSE FALSE TRUE
$x
[1] 1 2 3 4 5
$y
[1] "a" "b" "c" "d" "e"
$z
[1] TRUE FALSE TRUE FALSE TRUE
data.frame[[index]]
data.frame[index]
data.frame$name
data.frame[vector_1, vector_2]
rbind() (row bind) - appends a row to data framecbind() (column bind) - appends a column to data frameNote that a row has to be a list as it contains different data types.
tibble from tibble package (part of tidyverse package ecosystem)data.table from data.tabletibble provides features enhancing user experience (readability, ease of manipulation)data.table provides speeddata.table() might be a bit confusing at first.tidyverse packagestidyverse package ecosystem - rich collection of data science packages.readr - data input/output (also readxl for spreadsheets, haven for SPSS/Stata)dplyr - data manipulation (also tidyr for pivoting)ggplot2 - data visualisationlubridate - working with dates and timetibble - enhanced data frametidyverse ecosystem.tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
$ x: int [1:4] 1 2 3 4
$ y: chr [1:4] "a" "b" "c" "d"
$ z: logi [1:4] TRUE FALSE FALSE TRUE
dplyrdplyr - is one of the core packages for data manipulation in tidyversefilter() - subset rows from datamutate() - add new/modify existing variablesrename() - rename existing variableselect() - subset columns from dataarrange() - order data by some variablegroup_by() - aggregate data by some variablesummarise() - create a summary of aggregated variablesdplyr: Subsettingdplyr: Modifying Columns%>% Operatortidyverse packages are encouraged to use pipe operator %>%.|> but it is still not used as widely.<result> <- <input> %>%
<function_name>(., arg_1, arg_2, ..., arg_n)
<result> <- <input> %>%
<function_name>(arg_1, arg_2, ..., arg_n)
%>% Operator: Example# A tibble: 4 × 4
x y z random
<int> <chr> <lgl> <dbl>
1 1 a TRUE 0.775
2 2 b FALSE -1.60
3 3 c FALSE -1.97
4 4 d TRUE -0.881
%>% vs |> Operator|> (pipe) operator.tidyr::pivot_wider())tidyr::pivot_longer()) pivot_wider()
pivot_longer()
# A tibble: 2 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 37737
2 Brazil 2666 80488
# A tibble: 4 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 37737
3 Brazil 1999 2666
4 Brazil 2000 80488
| Format | Readability | Platform | Speed | Compression | Persistence |
|---|---|---|---|---|---|
csv |
✅ Human-readable | ✅ Cross-platform | ❌ Slow | ❌ No | ✅ Long-term |
rds |
❌ Binary | ❌ R only | ✅ Fast | ✅ Yes | ✅ Long-term |
pickle |
❌ Binary | ❌ Python only | ✅ Fast | ✅ Yes | ✅ Long-term |
parquet |
❌ Binary | ✅ Cross-platform | ✅ Fast | ✅ Yes | ✅ Long-term |
feather |
❌ Binary | ✅ Cross-platform | ✅ Fast | ✅ Yes | ❌ Short-term |
.csv (Comma-separated value)
read.csv()/write.csv() - base R functionsreadr::read_csv()/readr::write_csv() - functions from readr package in tidyverse.rds (R data serialization)
readRDS()/writeRDS() - base R functionsreadr::read_rds()/readr::write_rds() - functions from readr (no default compression).rda (R data)
save()/load() - base R functions.feather/.parquet
arrow::read_feather()/arrow::write_feather() - functions fromarrow::read_parquet()/arrow::write_parquet() - arrow package in Apache Arrow# A tibble: 6 × 10
`Duration (in seconds)` `What is your age (# years)?` What is your gender? -…¹
<dbl> <chr> <chr>
1 121 30-34 Man
2 462 30-34 Man
3 293 18-21 Man
4 851 55-59 Man
5 232 45-49 Man
6 277 18-21 Woman
# ℹ abbreviated name: ¹`What is your gender? - Selected Choice`
# ℹ 7 more variables: `In which country do you currently reside?` <chr>,
# `Are you currently a student? (high school, university, or graduate)` <chr>,
# `On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera` <chr>,
# `On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX` <chr>,
# `On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses` <chr>,
# `On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp` <chr>, …
Duration (in seconds) What is your age (# years)?
Min. : 120 Length:23997
1st Qu.: 264 Class :character
Median : 414 Mode :character
Mean : 10090
3rd Qu.: 715
Max. :2533678
What is your gender? - Selected Choice
Length:23997
Class :character
Mode :character
In which country do you currently reside?
Length:23997
Class :character
Mode :character
Are you currently a student? (high school, university, or graduate)
Length:23997
Class :character
Mode :character
On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera
Length:23997
Class :character
Mode :character
On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX
Length:23997
Class :character
Mode :character
On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses
Length:23997
Class :character
Mode :character
On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp
Length:23997
Class :character
Mode :character
On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai
Length:23997
Class :character
Mode :character
What is your gender? - Selected Choice
Man Nonbinary Prefer not to say
18266 78 334
Prefer to self-describe Woman
33 5286