# Week 2 Tutorial:<br>Preprocessing and APIs

POP77142 Quantitative Text Analysis for Social Scientists

## PDF Files

-   PDF (**P**ortable **D**ocument **F**ormat) is a file format that
    captures all the elements of a printable document.
-   It is both ubiquitous for digital documents and often contains a
    large amount of textual data.
-   However, extracting text from PDFs can be challenging as a document
    contains a lot of additional information (e.g. images, layouts,
    fonts, etc.). That is before we even consider non-machine readable
    PDFs that store text as images.
-   Here we will consider a commonly used data source in political
    science - electoral manifestos of parties that are often distributed
    in PDF format.

> **Extra**
>
> [Intro to PDF by Leonard
> Rosenthol](https://www.youtube.com/watch?v=KmP7pbcAl-8)

## PDF as Text File

In [None]:
# Extract manifestos from the archive in terminal
# Alternatively, you can use GUI for this
unzip -o ../data/ireland_ge_2024_manifestos.zip -d ../temp/

Archive:  ../data/ireland_ge_2024_manifestos.zip
  inflating: ../temp/ireland_ge_2024_manifestos/AO.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/FF.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/FG.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/GR.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/II.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/LAB.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/PBP.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/SD.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/SF.pdf  

. . .

In [None]:
# List files and show their sizes (-s) in human readable format (h)
ls -sh ../temp/ireland_ge_2024_manifestos

total 71M
9.8M AO.pdf
 14M FF.pdf
6.5M FG.pdf
 22M GR.pdf
2.0M II.pdf
592K LAB.pdf
1.4M PBP.pdf
 11M SD.pdf
6.1M SF.pdf

. . .

In [None]:
manifestos=$(ls ../temp/ireland_ge_2024_manifestos)

. . .

In [None]:
echo $manifestos

## PDF as Text File

-   As with other file formats it is a good idea to first try opening a
    file in a text editor.
-   Here a simple `more` command is used to display the content of the
    first manifesto.

In [None]:
more "../temp/ireland_ge_2024_manifestos/${manifestos[1]}"

. . .

-   For a more user-friendly and text-focussed you might try `less`
    command as well (note that it is not directly available in Windows).

In [None]:
less "../temp/ireland_ge_2024_manifestos/${manifestos[1]}"

## Exercise 1: Working with PDFs

-   In this exercise we will consider party manifestos from the last
    Irish General Election.
-   The ZIP archive with all main party manifestos is available on
    Blackboard.
-   Download the archive and extract all PDF files to a temporary
    directory.
-   Use `pdftools` package (you might try Python’s `pypdf` if you are
    interested in Python approach) to extract text from the first PDF
    file in the directory.
-   Tokenise the text and count the number of tokens.
-   What if you stem the tokens?
-   Now repeat this for all party manifestos and create a data frame of
    the following form:

| party | year | text | npages | ntokens | ntypes |
|-------|------|------|--------|---------|--------|
| …     | …    | …    | …      | …       | …      |

where:

-   `party` is the party name,
-   `year` is the year of the election,
-   `text` is the text of the manifesto,
-   `npages` is the number of pages in the manifesto,
-   `ntokens` is the number of tokens in the manifesto,
-   `ntypes` is the number of types in the manifesto.

##

In [None]:
library("pdftools")

Using poppler version 24.02.0

. . .

In [None]:
# List all PDF files in the directory
manifestos <- list.files(
  "../temp/ireland_ge_2024_manifestos",
  pattern = ".pdf$",
  full.names = TRUE
)
manifestos

[1] "../temp/ireland_ge_2024_manifestos/AO.pdf" 
[2] "../temp/ireland_ge_2024_manifestos/FF.pdf" 
[3] "../temp/ireland_ge_2024_manifestos/FG.pdf" 
[4] "../temp/ireland_ge_2024_manifestos/GR.pdf" 
[5] "../temp/ireland_ge_2024_manifestos/II.pdf" 
[6] "../temp/ireland_ge_2024_manifestos/LAB.pdf"
[7] "../temp/ireland_ge_2024_manifestos/PBP.pdf"
[8] "../temp/ireland_ge_2024_manifestos/SD.pdf" 
[9] "../temp/ireland_ge_2024_manifestos/SF.pdf" 

. . .

In [None]:
# Extract text from the first PDF file
txt <- pdftools::pdf_text(manifestos[1])

# Note that this created a vector whose length
# is equal to the number of pages in the PDF
length(txt)

[1] 104

##

In [None]:
head(txt)

[1] "       Our\nCommon Sense\n  Manifesto 2024\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

##

In [None]:
library("quanteda")

Package version: 4.2.0
Unicode version: 15.1
ICU version: 74.2

Parallel computing: disabled

See https://quanteda.io for tutorials and examples.

. . .

In [None]:
# Tokenise the text
toks <- quanteda::tokens(txt, remove_punct = TRUE)
ntokens <- quanteda::ntoken(toks)
sum(ntokens)

[1] 28423

. . .

In [None]:
ntypes <- quanteda::ntype(toks)
sum(ntypes)

[1] 16084

## Exercise 2: Working with APIs

-   In this part we will try using [Guardian
    API](https://open-platform.theguardian.com/) to collect newspaper
    articles.
-   First, you will need to got through a simple registration process to
    obtain an API key.
-   The main **endpoint** of the Guardian API is the content or [search
    endpoint](https://open-platform.theguardian.com/documentation/search).
-   Study the documentation of the search endpoint.
-   Below is a simple example of working with the Guardian API to search
    for articles related to Ireland.
-   Modify the search query to search for articled related to elections
    in Ireland.
-   Set the time frame for around the time of the last GE (e.g. between
    1 November and 31 December 2024).
-   How many articles are returned?

##

In [None]:
library("httr")

. . .

In [None]:
# It is a good idea to not hard-code the API key in the script
# and, instead, load it dynamically from a file
api_key <- readLines("../temp/guardian_api_key.txt")

. . .

In [None]:
# Define the URL for the API
base_url <- "https://content.guardianapis.com/search"

. . .

In [None]:
# Define the query parameters
params <- list(
  "api-key" = api_key,
  "q" = "ireland",
  "page-size" = 10
)

##

In [None]:
# Make the request and receive the response
response <- httr::GET(url = base_url, query = params)

# Check the status of the response
response$status_code

[1] 200

. . .

In [None]:
# Extract the content of the response
json <- httr::content(response, as = "text", encoding = "UTF-8")

. . .

In [None]:
json_parsed <- jsonlite::fromJSON(json)

. . .

In [None]:
str(json_parsed)

List of 1
 $ response:List of 9
  ..$ status     : chr "ok"
  ..$ userTier   : chr "developer"
  ..$ total      : int 113007
  ..$ startIndex : int 1
  ..$ pageSize   : int 10
  ..$ currentPage: int 1
  ..$ pages      : int 11301
  ..$ orderBy    : chr "relevance"
  ..$ results    :'data.frame': 10 obs. of  11 variables:
  .. ..$ id                : chr [1:10] "us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland" "world/2024/dec/02/ireland-election-2024-latest-results" "commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump" "sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live" ...
  .. ..$ type              : chr [1:10] "article" "article" "article" "liveblog" ...
  .. ..$ sectionId         : chr [1:10] "us-news" "world" "commentisfree" "sport" ...
  .. ..$ sectionName       : chr [1:10] "US news" "World news" "Opinion" "Sport" ...
  .. ..$ webPublicationDate: chr [1:10] "2025-04-03T10:53:49Z" "2024-12-03T07:45:44Z" "2025-03-14T11:40:1

##

In [None]:
head(json_parsed$response$results)

                                                                                      id
1                         us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland
2                                 world/2024/dec/02/ireland-election-2024-latest-results
3        commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump
4                      sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live
5 sport/2025/feb/01/ireland-england-six-nations-rugby-union-player-ratings-aviva-stadium
6             uk-news/2024/dec/26/police-surveillance-of-journalists-in-northern-ireland
      type     sectionId sectionName   webPublicationDate
1  article       us-news     US news 2025-04-03T10:53:49Z
2  article         world  World news 2024-12-03T07:45:44Z
3  article commentisfree     Opinion 2025-03-14T11:40:10Z
4 liveblog         sport       Sport 2025-02-22T16:41:09Z
5  article         sport       Sport 2025-02-01T20:14:35Z
6  article       uk-news     