Week 1: Introduction to QTA

POP77142 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

Text as data
Data collection
Web technologies
HTML fundamentals
XPath

Text As Data

Textual Data

Ubiquitous
Yet often underutilized

Web Scraping

Online Data Sources

Data downloadable in tabular format (E.g. CSV/TSV, XLS, DTA, etc.)
Data available online as a table (E.g. webpages with rendered tables)
Unstructured data available online (E.g. simple webpages)
Interactive webpages with user-input (E.g. webpages with logins, dropdown menus)
Web APIs (special interfaces for querying, e.g. Twitter, Google)

Online Data Collection

Tabular format: download single or multiple files (automate with download.file() in R, wget in Python/Terminal)
Online tables and unstructured data: simple web scraping (HTML with XPath, rvest in R, beautifulsoup in Python)
Interactive webpages: web scraping with headless browser (Selenium, Playwright - Python bindings recommended)
Web API: sending requests and processing responses (HTTP queries, httr2 in R, requests in Python)

Web Tables

(Wikipedia)

Unstructured Data

(Eur-Lex)

Interactive Webpages

(Izbori.ba)

Automated Data Collection

Manual scraping (copy-pasting) can be:
- Extremely laborious and time-consuming
- Very error-prone
- Often impossible to reproduce exactly
Automated data collection
- Easy to scale up (computer time is cheap)
- Less error-prone
- Usually, perfectly reproducible
There is a trade-off (time invested in automation vs time saved)
- However, it is good to err on the side of automation

Web Technologies

Key technologies used to disseminate content on the Web:
- XML/HTML (Extensible Markup Language/Hypertext Markup Language)
- CSS (Cascading Style Sheets)
- JavaScript
- API (Application Programming Interface)
- JSON (JavaScript Object Notation)

Static vs Dynamic Websites

The critical feature of a website which determines approach to scraping its content
Static websites all have prebuild source code which is served at user’s request
- No real-time processing of user’s input
- They can contain elements that change the appearance of a website
- Example: POP77142 website
Dynamic websites render websites in real-time as a response to user’s input
- They can use a range of technologies to achieve it (JavaScript, Python Django, PHP)
- Example: Google Maps

HTML: Hypertext Markup Language

HTML (Hypertext Markup Language) is a mark-up language for webpages
Forms the basis of static websites
Your browser renders (interprets) HTML for viewing
Current version is HTML5

<!DOCTYPE html> 
<html>
    <head>
        <title>A title</title> 
    </head>
    <body>
        <h1 style="color:Red;">A heading</h1> 
        <p>A paragraph.</p> 
    </body>
</html>

Extra

W3Schools: Try HTML

HTML Basics

Basic unit of HTML is an element (aka node)
Elements, typically, begin with an start tag (e.g. <h1>)
And finish with an end tag (e.g. </h1>)
Content of an element is found between the start and end tags
Attributes are special words used within a start tag to control element’s behaviour (e.g. style="color:Red;")
Soma HTML tag exampes:
- Document structure: <html>, <body>, <header>
- Document components: <h1>, <title>, <div>
- Text style: <b>, <i>
- Hyperlinks: <a>

HTML tree

HTML Tree Relationships

All elements (nodes) in HTML tree are connected by relationships
These relationship can be of the following types:
- Ancestors (parents)
- Descendants (children)
- Siblings

HTML Parent/Ancestor

HTML Children/Descendants

HTML Siblings

Parsing HTML Tree: Example

library("rvest")

html_txt <- "
<!DOCTYPE html> 
<html>
    <head>
        <title>A title</title> 
    </head>
    <body>
        <h1 style='color:Red;'>A heading</h1> 
        <p>A paragraph.</p> 
    </body>
</html>"

html <- rvest::read_html(html_txt)

str(html)

List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Parsing HTML Tree: Example

children <- rvest::html_children(html)
children

{xml_nodeset (2)}
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n        <h1 style="color:Red;">A heading</h1> \n        <p>A para ...

body <- children[2]
rvest::html_name(body)

[1] "body"

children2 <- rvest::html_children(body)
children2

{xml_nodeset (2)}
[1] <h1 style="color:Red;">A heading</h1>
[2] <p>A paragraph.</p>

rvest::html_attrs(children2[1])

[[1]]
       style 
"color:Red;"

rvest::html_text(children2[1])

[1] "A heading"

XML: Extensible Markup Language

XML (Extensible Markup Language) is a more general form of markup language
Allows sharing structured data of tree-like form
Relative to HTML:
- Tags are user-defined
- End tags are always required
- Stricter (no inconsistencies permitted)

<?xml version="1.0" encoding="UTF-8" ?>
<courses> 
    <course> 
        <title>Computer Programming for Social Scientists</title> 
        <code>POP77001</code> 
        <year>2024</year> 
        <term>Michaelmas</term> 
        <description>Course on computer programming in Python and R.</description> 
    </course> 
    <course> 
        <title>Quantitative Text Analysis for Social Scientists</title> 
        <code>POP77142</code> 
        <year>2025</year> 
        <term>Hillary</term> 
        <description>Introduction to text analysis.</description> 
    </course> 
</courses>

Parsing XML Tree: Example

library("xml2")

xml_txt <- 
'<?xml version="1.0" encoding="UTF-8" ?>
<courses> 
    <course> 
        <title>Computer Programming for Social Scientists</title> 
        <code>POP77001</code> 
        <year>2024</year> 
        <term>Michaelmas</term> 
        <description>Course on computer programming in Python and R.</description> 
    </course> 
    <course> 
        <title>Quantitative Text Analysis for Social Scientists</title> 
        <code>POP77142</code> 
        <year>2025</year> 
        <term>Hillary</term> 
        <description>Introduction to text analysis.</description> 
    </course> 
</courses>'

xml <- xml2::read_xml(xml_txt)

str(xml)

List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Parsing XML Tree: Example

children3 <- xml2::xml_children(xml)
children3

{xml_nodeset (2)}
[1] <course>\n  <title>Computer Programming for Social Scientists</title>\n   ...
[2] <course>\n  <title>Quantitative Text Analysis for Social Scientists</titl ...

pop77001 <- children3[1]
xml2::xml_children(pop77001)

{xml_nodeset (5)}
[1] <title>Computer Programming for Social Scientists</title>
[2] <code>POP77001</code>
[3] <year>2024</year>
[4] <term>Michaelmas</term>
[5] <description>Course on computer programming in Python and R.</description>

xml2::xml_text(xml_children(children3[1]))

[1] "Computer Programming for Social Scientists"     
[2] "POP77001"                                       
[3] "2024"                                           
[4] "Michaelmas"                                     
[5] "Course on computer programming in Python and R."

Examples of XML

RSS (Really Simple Syndication) feeds
SVG (Scalable Vector Graphics) images
Modern office documents (Microsoft Office .docx, .xlsx, .pptx, OpenOffice/LibreOffice)

Parsing XML/HTML with XPath

XPath (XML Path Language) is a language for selecting parts of XML/HTML tree
Basic syntax:
- / - select element at the root node (e.g. /html/body)
- // - select element at any depth (e.g. //h1)
- //<tag>/* - select all descendants of tag (e.g. //body/*)
- //<tag>[@<attr>] - select all elements that have given attribute (e.g. //h1[@style])
- //<tag>[@<attr>='<value>'] - select all elements, whose attribute has given value (e.g. //h1[@style='color:Red;'])

Extra

XPath syntax

Parsing XML/HTML with XPath: Example

rvest::html_elements(html, xpath = "//p")

{xml_nodeset (1)}
[1] <p>A paragraph.</p>

rvest::html_elements(html, xpath = "//h1[@style='color:Red;']")

{xml_nodeset (1)}
[1] <h1 style="color:Red;">A heading</h1>

xml2::xml_find_all(xml, xpath = "//code")

{xml_nodeset (2)}
[1] <code>POP77001</code>
[2] <code>POP77142</code>

# We can also find elements by text
xml2::xml_find_all(xml, xpath = "//code[text()='POP77001']")

{xml_nodeset (1)}
[1] <code>POP77001</code>

Scraping Webpage

(Wikipedia)

Scraping Webpage with XPath: Example

html <- rvest::read_html("https://en.wikipedia.org/wiki/Members_of_the_1st_D%C3%A1il")

tables <- rvest::html_elements(html, xpath = "//table")
tables

{xml_nodeset (8)}
[1] <table class="box-More_citations_needed plainlinks metadata ambox ambox-c ...
[2] <table class="infobox vevent"><tbody>\n<tr><th colspan="2" class="infobox ...
[3] <table style="width:100%; border-collapse:collapse"><tbody><tr style="ver ...
[4] <table class="wikitable" style="font-size: 95%;"><tbody>\n<tr>\n<th colsp ...
[5] <table class="wikitable sortable"><tbody>\n<tr>\n<th rowspan="2">Constitu ...
[6] <table class="wikitable"><tbody>\n<tr>\n<th>Constituency\n</th>\n<th>Outg ...
[7] <table class="wikitable"><tbody>\n<tr>\n<th>Winner\n</th>\n<th colspan="2 ...
[8] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" style ...

tbody <- rvest::html_children(tables[5])
tbody

{xml_nodeset (1)}
[1] <tbody>\n<tr>\n<th rowspan="2">Constituency\n</th>\n<th rowspan="2">Name\ ...

tds <- rvest::html_table(tbody)
tds

[[1]]
# A tibble: 109 × 8
   Constituency   Name          Portrait `Party affiliation` `Party affiliation`
   <chr>          <chr>         <chr>    <chr>               <chr>              
 1 Constituency   Name          "Portra… "Start of Dáil ter… Start of Dáil term 
 2 Antrim East    Robert McCal… ""       ""                  Irish Unionist     
 3 Antrim East    George Hanna  ""       "Elected in 1919 b… Elected in 1919 by…
 4 Antrim Mid     Hugh O'Neill  ""       ""                  Irish Unionist     
 5 Antrim North   Peter Kerr-S… ""       ""                  Irish Unionist     
 6 Antrim South   Charles Curt… ""       ""                  Irish Unionist     
 7 Armagh Mid     James Rolsto… ""       ""                  Irish Unionist     
 8 Armagh North   William Allen ""       ""                  Irish Unionist     
 9 Armagh South   Patrick Donn… ""       ""                  Irish Parliamentary
10 Belfast Cromac William Arth… ""       ""                  Irish Unionist     
# ℹ 99 more rows
# ℹ 3 more variables: `Party affiliation` <chr>, `Party affiliation` <chr>,
#   `Assumed office` <chr>

Scraping Webpage with XPath: Example

str(tds)

List of 1
 $ : tibble [109 × 8] (S3: tbl_df/tbl/data.frame)
  ..$ Constituency     : chr [1:109] "Constituency" "Antrim East" "Antrim East" "Antrim Mid" ...
  ..$ Name             : chr [1:109] "Name" "Robert McCalmont" "George Hanna" "Hugh O'Neill" ...
  ..$ Portrait         : chr [1:109] "Portrait" "" "" "" ...
  ..$ Party affiliation: chr [1:109] "Start of Dáil term" "" "Elected in 1919 by-electionas Independent Unionist" "" ...
  ..$ Party affiliation: chr [1:109] "Start of Dáil term" "Irish Unionist" "Elected in 1919 by-electionas Independent Unionist" "Irish Unionist" ...
  ..$ Party affiliation: chr [1:109] "End of Dáil term" "Resigned in 1919" "" "" ...
  ..$ Party affiliation: chr [1:109] "End of Dáil term" "Resigned in 1919" "Ulster Unionist" "Ulster Unionist" ...
  ..$ Assumed office   : chr [1:109] "Assumed office" "Abstained" "Abstained" "Abstained" ...

tds <- tds[[1]]
head(tds)

# A tibble: 6 × 8
  Constituency Name             Portrait `Party affiliation` `Party affiliation`
  <chr>        <chr>            <chr>    <chr>               <chr>              
1 Constituency Name             "Portra… "Start of Dáil ter… Start of Dáil term 
2 Antrim East  Robert McCalmont ""       ""                  Irish Unionist     
3 Antrim East  George Hanna     ""       "Elected in 1919 b… Elected in 1919 by…
4 Antrim Mid   Hugh O'Neill     ""       ""                  Irish Unionist     
5 Antrim North Peter Kerr-Smil… ""       ""                  Irish Unionist     
6 Antrim South Charles Curtis … ""       ""                  Irish Unionist     
# ℹ 3 more variables: `Party affiliation` <chr>, `Party affiliation` <chr>,
#   `Assumed office` <chr>

Scraping Webpage with XPath: Example

colnames(tds) <- tds[1,]
tds <- tds[-1,]
head(tds)

# A tibble: 6 × 8
  Constituency Name           Portrait `Start of Dáil term` `Start of Dáil term`
  <chr>        <chr>          <chr>    <chr>                <chr>               
1 Antrim East  Robert McCalm… ""       ""                   Irish Unionist      
2 Antrim East  George Hanna   ""       "Elected in 1919 by… Elected in 1919 by-…
3 Antrim Mid   Hugh O'Neill   ""       ""                   Irish Unionist      
4 Antrim North Peter Kerr-Sm… ""       ""                   Irish Unionist      
5 Antrim South Charles Curti… ""       ""                   Irish Unionist      
6 Armagh Mid   James Rolston… ""       ""                   Irish Unionist      
# ℹ 3 more variables: `End of Dáil term` <chr>, `End of Dáil term` <chr>,
#   `Assumed office` <chr>

tds <- tds[,-3]
str(tds)

tibble [108 × 7] (S3: tbl_df/tbl/data.frame)
 $ Constituency      : chr [1:108] "Antrim East" "Antrim East" "Antrim Mid" "Antrim North" ...
 $ Name              : chr [1:108] "Robert McCalmont" "George Hanna" "Hugh O'Neill" "Peter Kerr-Smiley" ...
 $ Start of Dáil term: chr [1:108] "" "Elected in 1919 by-electionas Independent Unionist" "" "" ...
 $ Start of Dáil term: chr [1:108] "Irish Unionist" "Elected in 1919 by-electionas Independent Unionist" "Irish Unionist" "Irish Unionist" ...
 $ End of Dáil term  : chr [1:108] "Resigned in 1919" "" "" "" ...
 $ End of Dáil term  : chr [1:108] "Resigned in 1919" "Ulster Unionist" "Ulster Unionist" "Ulster Unionist" ...
 $ Assumed office    : chr [1:108] "Abstained" "Abstained" "Abstained" "Abstained" ...

Web Scraping in Practice

Always check first whether an API for querying exists.
It is the most robust (and sanctioned) way of obtaining data.
Check copyrights and respect those when using scraped data.
Limit you scraping bandwidth (introduce waiting times between queries).

Tutorial: HTML and web scraping
Assignment 1: Due 15:59 on Wednesday, 26th March (submission on Blackboard)

Week 1: Introduction to QTA

Overview

Text As Data

Textual Data

Web Scraping

Online Data Sources

Online Data Collection

Web Tables

Unstructured Data

Interactive Webpages

Automated Data Collection

Web Technologies

Static vs Dynamic Websites

HTML: Hypertext Markup Language

HTML Basics

HTML tree

HTML Tree Relationships

HTML Parent/Ancestor

HTML Children/Descendants

HTML Siblings

Parsing HTML Tree: Example

Parsing HTML Tree: Example

XML: Extensible Markup Language

Parsing XML Tree: Example

Parsing XML Tree: Example

Examples of XML

Parsing XML/HTML with XPath

Parsing XML/HTML with XPath: Example

Scraping Webpage

Scraping Webpage with XPath: Example

Scraping Webpage with XPath: Example

Scraping Webpage with XPath: Example

Web Scraping in Practice

Next