POP77142 Quantitative Text Analysis for Social Scientists
download.file() in R, wget in Python/Terminal)rvest in R, beautifulsoup in Python)httr2 in R, requests in Python)(Eur-Lex)
<!DOCTYPE html>
<html>
<head>
<title>A title</title>
</head>
<body>
<h1 style="color:Red;">A heading</h1>
<p>A paragraph.</p>
</body>
</html>
Extra
<h1>)</h1>)style="color:Red;")<html>, <body>, <header><h1>, <title>, <div><b>, <i><a>{xml_nodeset (2)}
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n <h1 style="color:Red;">A heading</h1> \n <p>A para ...
<?xml version="1.0" encoding="UTF-8" ?>
<courses>
<course>
<title>Computer Programming for Social Scientists</title>
<code>POP77001</code>
<year>2024</year>
<term>Michaelmas</term>
<description>Course on computer programming in Python and R.</description>
</course>
<course>
<title>Quantitative Text Analysis for Social Scientists</title>
<code>POP77142</code>
<year>2025</year>
<term>Hillary</term>
<description>Introduction to text analysis.</description>
</course>
</courses>
xml_txt <-
'<?xml version="1.0" encoding="UTF-8" ?>
<courses>
<course>
<title>Computer Programming for Social Scientists</title>
<code>POP77001</code>
<year>2024</year>
<term>Michaelmas</term>
<description>Course on computer programming in Python and R.</description>
</course>
<course>
<title>Quantitative Text Analysis for Social Scientists</title>
<code>POP77142</code>
<year>2025</year>
<term>Hillary</term>
<description>Introduction to text analysis.</description>
</course>
</courses>'{xml_nodeset (2)}
[1] <course>\n <title>Computer Programming for Social Scientists</title>\n ...
[2] <course>\n <title>Quantitative Text Analysis for Social Scientists</titl ...
.docx, .xlsx, .pptx, OpenOffice/LibreOffice)/ - select element at the root node (e.g. /html/body)// - select element at any depth (e.g. //h1)//<tag>/* - select all descendants of tag (e.g. //body/*)//<tag>[@<attr>] - select all elements that have given attribute (e.g. //h1[@style])//<tag>[@<attr>='<value>'] - select all elements, whose attribute has given value (e.g. //h1[@style='color:Red;'])Extra
{xml_nodeset (8)}
[1] <table class="box-More_citations_needed plainlinks metadata ambox ambox-c ...
[2] <table class="infobox vevent"><tbody>\n<tr><th colspan="2" class="infobox ...
[3] <table style="width:100%; border-collapse:collapse"><tbody><tr style="ver ...
[4] <table class="wikitable" style="font-size: 95%;"><tbody>\n<tr>\n<th colsp ...
[5] <table class="wikitable sortable"><tbody>\n<tr>\n<th rowspan="2">Constitu ...
[6] <table class="wikitable"><tbody>\n<tr>\n<th>Constituency\n</th>\n<th>Outg ...
[7] <table class="wikitable"><tbody>\n<tr>\n<th>Winner\n</th>\n<th colspan="2 ...
[8] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" style ...
[[1]]
# A tibble: 109 × 8
Constituency Name Portrait `Party affiliation` `Party affiliation`
<chr> <chr> <chr> <chr> <chr>
1 Constituency Name "Portra… "Start of Dáil ter… Start of Dáil term
2 Antrim East Robert McCal… "" "" Irish Unionist
3 Antrim East George Hanna "" "Elected in 1919 b… Elected in 1919 by…
4 Antrim Mid Hugh O'Neill "" "" Irish Unionist
5 Antrim North Peter Kerr-S… "" "" Irish Unionist
6 Antrim South Charles Curt… "" "" Irish Unionist
7 Armagh Mid James Rolsto… "" "" Irish Unionist
8 Armagh North William Allen "" "" Irish Unionist
9 Armagh South Patrick Donn… "" "" Irish Parliamentary
10 Belfast Cromac William Arth… "" "" Irish Unionist
# ℹ 99 more rows
# ℹ 3 more variables: `Party affiliation` <chr>, `Party affiliation` <chr>,
# `Assumed office` <chr>
List of 1
$ : tibble [109 × 8] (S3: tbl_df/tbl/data.frame)
..$ Constituency : chr [1:109] "Constituency" "Antrim East" "Antrim East" "Antrim Mid" ...
..$ Name : chr [1:109] "Name" "Robert McCalmont" "George Hanna" "Hugh O'Neill" ...
..$ Portrait : chr [1:109] "Portrait" "" "" "" ...
..$ Party affiliation: chr [1:109] "Start of Dáil term" "" "Elected in 1919 by-electionas Independent Unionist" "" ...
..$ Party affiliation: chr [1:109] "Start of Dáil term" "Irish Unionist" "Elected in 1919 by-electionas Independent Unionist" "Irish Unionist" ...
..$ Party affiliation: chr [1:109] "End of Dáil term" "Resigned in 1919" "" "" ...
..$ Party affiliation: chr [1:109] "End of Dáil term" "Resigned in 1919" "Ulster Unionist" "Ulster Unionist" ...
..$ Assumed office : chr [1:109] "Assumed office" "Abstained" "Abstained" "Abstained" ...
# A tibble: 6 × 8
Constituency Name Portrait `Party affiliation` `Party affiliation`
<chr> <chr> <chr> <chr> <chr>
1 Constituency Name "Portra… "Start of Dáil ter… Start of Dáil term
2 Antrim East Robert McCalmont "" "" Irish Unionist
3 Antrim East George Hanna "" "Elected in 1919 b… Elected in 1919 by…
4 Antrim Mid Hugh O'Neill "" "" Irish Unionist
5 Antrim North Peter Kerr-Smil… "" "" Irish Unionist
6 Antrim South Charles Curtis … "" "" Irish Unionist
# ℹ 3 more variables: `Party affiliation` <chr>, `Party affiliation` <chr>,
# `Assumed office` <chr>
# A tibble: 6 × 8
Constituency Name Portrait `Start of Dáil term` `Start of Dáil term`
<chr> <chr> <chr> <chr> <chr>
1 Antrim East Robert McCalm… "" "" Irish Unionist
2 Antrim East George Hanna "" "Elected in 1919 by… Elected in 1919 by-…
3 Antrim Mid Hugh O'Neill "" "" Irish Unionist
4 Antrim North Peter Kerr-Sm… "" "" Irish Unionist
5 Antrim South Charles Curti… "" "" Irish Unionist
6 Armagh Mid James Rolston… "" "" Irish Unionist
# ℹ 3 more variables: `End of Dáil term` <chr>, `End of Dáil term` <chr>,
# `Assumed office` <chr>
tibble [108 × 7] (S3: tbl_df/tbl/data.frame)
$ Constituency : chr [1:108] "Antrim East" "Antrim East" "Antrim Mid" "Antrim North" ...
$ Name : chr [1:108] "Robert McCalmont" "George Hanna" "Hugh O'Neill" "Peter Kerr-Smiley" ...
$ Start of Dáil term: chr [1:108] "" "Elected in 1919 by-electionas Independent Unionist" "" "" ...
$ Start of Dáil term: chr [1:108] "Irish Unionist" "Elected in 1919 by-electionas Independent Unionist" "Irish Unionist" "Irish Unionist" ...
$ End of Dáil term : chr [1:108] "Resigned in 1919" "" "" "" ...
$ End of Dáil term : chr [1:108] "Resigned in 1919" "Ulster Unionist" "Ulster Unionist" "Ulster Unionist" ...
$ Assumed office : chr [1:108] "Abstained" "Abstained" "Abstained" "Abstained" ...