Week 2: Quantifying Texts

POP77142 Quantitative Text Analysis for Social Scientists

Tom Paskhalis

Overview

Motivation
Character Encoding
Text Preprocessing
APIs
JSON

Motivation

Parliamentary Power in 17th c. England

(Rodon & Paskhalis, 2024)

Ideological Positions in Germany

(Slapin & Proksch, 2008)

Complexity of US State of the Union Addresses

(Benoit, Munger & Spirling, 2019)

Character Encoding

Foundations of Computer Memory

Bits:
- The smallest unit of digital data.
- Can be either 0 or 1.
- n bits can represent \(2^n\) different values.
- E.g. 2 bits can represent 4 values: 00, 01, 10, 11.
Bytes:
- 8 bits = 1 byte
- Thus, 1 byte can represent 256 values: \([00000000, 00000001, ..., 11111111]\).
- Metric aggregations than are kilobyte (KB), megabyte (MB), gigabyte (GB), etc.

Character Encoding

Character - “the smallest component of written language that has semantic value” (https://unicode.org/glossary/#character).
- E.g. “h”, “ε”, “4”, “&”, “!”, “€”, “🤖”.
Character set - a collection of characters.
- E.g. Latin alphabet, Greek alphabet, Arabic numerals, punctuation marks, etc.
Code point - the unique value assigned to each character in a set.
- Depends on what is a considered a valid value: binary - 101101, decimal - 45, hexadecimal - 2D, etc.
A mapping between code points and characters is called an encoding.

ASCII

ASCII (American Standard Code for Information Interchange) - one of the earlier wide-spread character encodings.
Only encodes \(2^7 = 128\) characters (of which 95 are printable).
- Essentially, English alphabet, Arabic numerals, and some punctuation marks.
Later extended to \(2^8 = 256\) characters
(aka ISO-8859-1, Latin-1, closely related to Windows-1252).
- Added support for most Western European languages.
A lot more needed to support all the world’s languages…

ASCII

(Wikipedia & US DoD)

E.g., decimal code point for “A” is 65, comprised of these bits:
- 1000001 (original ASCII)
- 01000001 (ISO-8859-1)

Unicode

Designed to support all the world’s writing systems that can be digitized.
Variable-length, between 1 and 4 bytes (8 and 32 bits).
First 128 code points are the same as in ASCII (backward compatibility).
UTF-8 - most common Unicode encoding (also UTF-16, but more rare):
- 1 byte for ASCII characters.
- 2 bytes for most Latin, Greek, Cyrillic, CJK, etc.
- 3 bytes for the rest of the BMP.
- 4 bytes for the rest of Unicode.

Digital Storage of Text

Plain Text & Binary Files

Plain text files contain only human-readable characters.
- “Simple” text, e.g. .txt
- Markup languages, e.g. .md, .Rmd, qmd, .html
- Data storage, e.g. .csv, .tsv, .tab, .json, .xml
- Images, e.g. .svg, .eps
- Computer code, e.g. .py, .R, .tex, .sh
- Some other: .ipynb (effectively, .json), .docx (effectively, zipped .xml)
Binary files contain computer-readable data (parts can be human-readable).
- Text, e.g. .doc, .rtf, .pdf
- Data storage, e.g. .pickle, .rds, .feather
- Images, e.g. .png, .jpg, .gif
Not always dichotomous (e.g. .docx, .pdf, .svg).

Text Encoding Caveats

Plain text files don’t contain information about encoding.
Instead, each software “guesses” (often, assumes the default).
If the guess is wrong, text can be displayed incorrectly (mojibake).
UTF-8 is the most common encoding (the one you should use).
However, many texts still use other encodings.
Windows is often a problem (can use Windows-1252 or UTF-16).
In general, no easy way to know the encoding of a text file.

Text Encoding: Example

Write out text using Python in ISO-8859-1 encoding.

tain_bo_cualinge = "Fecht n-óen do Ailill & do Meidb íar ndérgud a rígleptha dóib i Crúachanráith Chonnacht, arrecaim comrád chind cherchailli eturru."

with open("../temp/latin1.txt", "w", encoding = "ISO-8859-1") as f:
    f.write(tain_bo_cualinge)

Read in text in R using the default (UTF-8) encoding.

tain_bo_cualinge <- readLines("../temp/latin1.txt")
tain_bo_cualinge

[1] "Fecht n-\xf3en do Ailill & do Meidb \xedar nd\xe9rgud a r\xedgleptha d\xf3ib i Cr\xfaachanr\xe1ith Chonnacht, arrecaim comr\xe1d chind cherchailli eturru."

Using ISO-8859-1 (note the difference in the encoding name).

tain_bo_cualinge <- readLines("../temp/latin1.txt", encoding = "latin1")
tain_bo_cualinge

[1] "Fecht n-óen do Ailill & do Meidb íar ndérgud a rígleptha dóib i Crúachanráith Chonnacht, arrecaim comrád chind cherchailli eturru."

Text Encoding: Things to Try

Pick a movie you like.
Go to OpenSubtitles.
Find subtitles for that movie in a language that uses a different script.
Download the subtitles and try to open them in a text editor.
Check the ‘guessed’ encoding of the file.
Are all characters displayed correctly?
Try to open the file programmatically in R or Python.

Designing a Text Analysis Study

Workflow in Text Analysis

(Grimmer & Stewart, 2013)

Sample vs Population

Basic Idea: Observed text is a stochastic realization.
Systematic features shape most of observed verbal content.
Non-systematic, random features also shape verbal content.

Implications of a Stochastic View

Observed text is not the only text that could have been generated.
Research (system) design would depend on the question and quantity of interest.
Very different if you are trying to monitor something like hate speech, where what you actually say matters, not the value of your “expected statement”.
Means that having “all the text” is still not a “population”.

Sampling Strategies

Be clear what is you sample and your population.
May not be feasible to perform any sampling.
Different types of sampling vary from random to purposive:
- Random sampling (e.g. politician’s speeches)
- Non-random sampling (e.g. messages containing hate speech on a social media platform)
Key is to make sure that what is being analyzed is a valid representation of the phenomenon as a whole - a question of research design.

Text Preprocessing

Quantifying Texts

Some QTA Terminology

Corpus - a collection of texts for analysis.
- E.g. SOTU addresses, Hansard debates, party manifestos, etc.
Document - a single text in the corpus.
- E.g. a single SOTU address, one speech, a specific party manifesto, etc.
Token - a single unit of text.
- E.g. typically a word, but can include punctuation, numbers, hashtags, etc.
Type - a unique token.
- E.g. articles like “the” and “a” appearing throughout the corpus.
Tokenization - the process of breaking a text into tokens.

Some Linguistic Terminology

Tokens constitute the basic unit of analysis (particularly in NLP applications).
But how tokens are constructed can vary.
It might be useful to consider different tokens as the same.
- E.g. “runs”, “running”, “ran” are all forms of the same word.
Stemming - mechanically removing affixes (usually, suffixes) from tokens.
- E.g. “running” -> “run”, “runs” -> “run”.
Lemmatization - reducing tokens to their base (root) or dictionary form.
- E.g. “ran” -> “run”, “runner” -> “run”.
While lemmatization is more accurate, it also requires more built-in knowledge about a language.

Tokenization: Example

library("quanteda")

text <- "The quick brown fox jumps over the lazy dog."
tokens <- quanteda::tokens(text)
tokens

Tokens consisting of 1 document.
text1 :
 [1] "The"   "quick" "brown" "fox"   "jumps" "over"  "the"   "lazy"  "dog"  
[10] "."

quanteda::ntoken(tokens)

text1 
   10

quanteda::ntype(tokens)

text1 
   10

tokens <- quanteda::tokens_tolower(tokens)
quanteda::ntoken(tokens)

text1 
   10

quanteda::ntype(tokens)

text1 
    9

tokens_stemmed <- quanteda::tokens_wordstem(tokens, language = "english")
tokens_stemmed

Tokens consisting of 1 document.
text1 :
 [1] "the"   "quick" "brown" "fox"   "jump"  "over"  "the"   "lazi"  "dog"  
[10] "."

Stopwords

Not all words can be assumed to be equally informative.
E.g. “the”, “a”, “and”, etc. are common in most texts.
Stopwords are words that are removed from the text before analysis.

library("stopwords")

stopwords::stopwords(language = "en")

  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"       "will"

API

APIs

API - Application Programming Interface.
Programmatic way to interact with a software application/service.
Widely used in computing (even on a single machine).
Here, focus on web APIs, that provide interface to web services.
A set of structured HTTP/S requests returns some responses.
- E.g. in XML or JSON format.

APIs vs Web Scraping

Advantages:

Cleaner data collection: no malformed HTML, consistency, fewer legal issues, etc.
Standardised data access processes.
Scalability.
Potentially, pre-existing robust packages for handling common tasks.

Disadvantages:

Limited availability.
Dependency on API providers.
Access limits
Rate limits.
Price

Principles of APIs

Working with Web APIs

Types of APIs:
- RESTful APIs - queries for static information at a given moment,
- Streaming APIs - tracking real-time changes (e.g. posts, economic indicators, etc.)
API documentation varies by provider.
But usually is rather technical in nature (written for developers).
Some key terms:
- Endpoint - URL (web location) that receives requests and sends responses.
- Parameters - custom information that can be passed to the API.
- Response - the data returned by the API.

Authentication

Many APIs require a key or tokens.
Most APIs are rate-limited
- E.g. restrictions by user/key/IP address/time period.
Make sure that you understand the terms of service/use.
Even providers of public free APIs can impose some restrictions.

Example: Guardian API

library("httr")

# It is a good idea to not hard-code the API key in the script
# and, instead, load it dynamically from a file
api_key <- readLines("../temp/guardian_api_key.txt")

# Endpoint
base_url <- "https://content.guardianapis.com/search"

# Parameters
params <- list(
  "api-key" = api_key,
  "q" = "ireland",
  "page-size" = 1
)

# Make the request and receive the response
response <- httr::GET(url = base_url, query = params)

# Check the status of the response (200 means successful)
# Status codes 4xx are client errors; 5xx are server errors
response$status_code

[1] 200

response

Response [https://content.guardianapis.com/search?api-key=2cae3602-5bc7-4c38-83e2-634c28798ba9&q=ireland&page-size=1]
  Date: 2025-04-10 09:15
  Status: 200
  Content-Type: application/json
  Size: 684 B

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format
It is commonly used in web APIs (as well as elsewhere, e.g. Jupyter Notebooks).
At its core, JSON objects are key-value pairs (often deeply nested).
Keys have to be strings with double quotes.
Values can be one of the following types:
- String (e.g., “example”)
- Number (e.g., 42, 3.141)
- Array (e.g., [“a”, “b”, “c”])
- Boolean (e.g., true, false)
- null

Extra

JSON Syntax

JSON: Example

library("jsonlite")

json <- httr::content(response, as = "text", encoding = "UTF-8")

json |>
  jsonlite::prettify()

{
    "response": {
        "status": "ok",
        "userTier": "developer",
        "total": 113007,
        "startIndex": 1,
        "pageSize": 1,
        "currentPage": 1,
        "pages": 113007,
        "orderBy": "relevance",
        "results": [
            {
                "id": "us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland",
                "type": "article",
                "sectionId": "us-news",
                "sectionName": "US news",
                "webPublicationDate": "2025-04-03T10:53:49Z",
                "webTitle": "Trump tariffs could undermine Brexit deal in Northern Ireland",
                "webUrl": "https://www.theguardian.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland",
                "apiUrl": "https://content.guardianapis.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland",
                "isHosted": false,
                "pillarId": "pillar/news",
                "pillarName": "News"
            }
        ]
    }
}

JSON Representation

As JSON is just a text format, its representation in code will vary by language.
- E.g. in R JSON -> list, in Python -> dictionary

json_parsed <- jsonlite::fromJSON(json)

str(json_parsed)

List of 1
 $ response:List of 9
  ..$ status     : chr "ok"
  ..$ userTier   : chr "developer"
  ..$ total      : int 113007
  ..$ startIndex : int 1
  ..$ pageSize   : int 1
  ..$ currentPage: int 1
  ..$ pages      : int 113007
  ..$ orderBy    : chr "relevance"
  ..$ results    :'data.frame': 1 obs. of  11 variables:
  .. ..$ id                : chr "us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland"
  .. ..$ type              : chr "article"
  .. ..$ sectionId         : chr "us-news"
  .. ..$ sectionName       : chr "US news"
  .. ..$ webPublicationDate: chr "2025-04-03T10:53:49Z"
  .. ..$ webTitle          : chr "Trump tariffs could undermine Brexit deal in Northern Ireland"
  .. ..$ webUrl            : chr "https://www.theguardian.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland"
  .. ..$ apiUrl            : chr "https://content.guardianapis.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland"
  .. ..$ isHosted          : logi FALSE
  .. ..$ pillarId          : chr "pillar/news"
  .. ..$ pillarName        : chr "News"

JSON Representation

dim(json_parsed$response$results)

[1]  1 11

head(json_parsed$response$results)

                                                              id    type
1 us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland article
  sectionId sectionName   webPublicationDate
1   us-news     US news 2025-04-03T10:53:49Z
                                                       webTitle
1 Trump tariffs could undermine Brexit deal in Northern Ireland
                                                                                      webUrl
1 https://www.theguardian.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland
                                                                                           apiUrl
1 https://content.guardianapis.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland
  isHosted    pillarId pillarName
1    FALSE pillar/news       News

Tutorial: Text Preprocessing and APIs
Next week: Classifying texts
Assignment 1: Due 15:59 on Wednesday, 26th March (submission on Blackboard)

Week 2: Quantifying Texts

Overview

Motivation

Parliamentary Power in 17th c. England

Ideological Positions in Germany

Complexity of US State of the Union Addresses

Character Encoding

Foundations of Computer Memory

Character Encoding

ASCII

ASCII

Unicode

Digital Storage of Text

Plain Text & Binary Files

Text Encoding Caveats

Text Encoding: Example

Text Encoding: Things to Try

Designing a Text Analysis Study

Workflow in Text Analysis

Sample vs Population

Implications of a Stochastic View

Sampling Strategies

Text Preprocessing

Quantifying Texts

Some QTA Terminology

Some Linguistic Terminology

Tokenization: Example

Stopwords

API

APIs

APIs vs Web Scraping

Principles of APIs

Working with Web APIs

Authentication

Example: Guardian API

JSON

JSON: Example

JSON Representation

JSON Representation

Next