Week 2 Tutorial:
Preprocessing and APIs

POP77142 Quantitative Text Analysis for Social Scientists

PDF Files

  • PDF (Portable Document Format) is a file format that captures all the elements of a printable document.
  • It is both ubiquitous for digital documents and often contains a large amount of textual data.
  • However, extracting text from PDFs can be challenging as a document contains a lot of additional information (e.g. images, layouts, fonts, etc.). That is before we even consider non-machine readable PDFs that store text as images.
  • Here we will consider a commonly used data source in political science - electoral manifestos of parties that are often distributed in PDF format.

PDF as Text File

# Extract manifestos from the archive in terminal
# Alternatively, you can use GUI for this
unzip -o ../data/ireland_ge_2024_manifestos.zip -d ../temp/
Archive:  ../data/ireland_ge_2024_manifestos.zip
  inflating: ../temp/ireland_ge_2024_manifestos/AO.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/FF.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/FG.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/GR.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/II.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/LAB.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/PBP.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/SD.pdf  
  inflating: ../temp/ireland_ge_2024_manifestos/SF.pdf  
# List files and show their sizes (-s) in human readable format (h)
ls -sh ../temp/ireland_ge_2024_manifestos
total 71M
9.8M AO.pdf
 14M FF.pdf
6.5M FG.pdf
 22M GR.pdf
2.0M II.pdf
592K LAB.pdf
1.4M PBP.pdf
 11M SD.pdf
6.1M SF.pdf
manifestos=$(ls ../temp/ireland_ge_2024_manifestos)
echo $manifestos

PDF as Text File

  • As with other file formats it is a good idea to first try opening a file in a text editor.
  • Here a simple more command is used to display the content of the first manifesto.
more "../temp/ireland_ge_2024_manifestos/${manifestos[1]}"
  • For a more user-friendly and text-focussed you might try less command as well (note that it is not directly available in Windows).
less "../temp/ireland_ge_2024_manifestos/${manifestos[1]}"

Exercise 1: Working with PDFs

  • In this exercise we will consider party manifestos from the last Irish General Election.
  • The ZIP archive with all main party manifestos is available on Blackboard.
  • Download the archive and extract all PDF files to a temporary directory.
  • Use pdftools package (you might try Python’s pypdf if you are interested in Python approach) to extract text from the first PDF file in the directory.
  • Tokenise the text and count the number of tokens.
  • What if you stem the tokens?
  • Now repeat this for all party manifestos and create a data frame of the following form:
party year text npages ntokens ntypes

where:

  • party is the party name,
  • year is the year of the election,
  • text is the text of the manifesto,
  • npages is the number of pages in the manifesto,
  • ntokens is the number of tokens in the manifesto,
  • ntypes is the number of types in the manifesto.

library("pdftools")
# List all PDF files in the directory
manifestos <- list.files(
  "../temp/ireland_ge_2024_manifestos",
  pattern = ".pdf$",
  full.names = TRUE
)
manifestos
[1] "../temp/ireland_ge_2024_manifestos/AO.pdf" 
[2] "../temp/ireland_ge_2024_manifestos/FF.pdf" 
[3] "../temp/ireland_ge_2024_manifestos/FG.pdf" 
[4] "../temp/ireland_ge_2024_manifestos/GR.pdf" 
[5] "../temp/ireland_ge_2024_manifestos/II.pdf" 
[6] "../temp/ireland_ge_2024_manifestos/LAB.pdf"
[7] "../temp/ireland_ge_2024_manifestos/PBP.pdf"
[8] "../temp/ireland_ge_2024_manifestos/SD.pdf" 
[9] "../temp/ireland_ge_2024_manifestos/SF.pdf" 
# Extract text from the first PDF file
txt <- pdftools::pdf_text(manifestos[1])

# Note that this created a vector whose length
# is equal to the number of pages in the PDF
length(txt)
[1] 104

head(txt)
[1] "       Our\nCommon Sense\n  Manifesto 2024\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
[2] "                Opening statement\nIn the last year Aontú has come of age. Our vote has surpassed Labour, the Soc\nDens and PbP in the last election. We were the only party that stood with the\npeople in the recent referendums. We made the common-sense argument and we\nwon.\n\nIndeed, Aontú has, for the last five years, Aontú has often stood alone in the Dáil\nfor you. We have held our ground in your interest. Whether it has been raising real\nconcerns about the phenomenal government waste, the disastrous government\nimmigration policy, why 100,000 homes remain empty in the middle of a housing\ncrisis, censorship legislation or carbon taxes on fuel, we have been on the right side\nof the argument – your side.\n\nWe are offering a step change in how this country is governed. We want to end the\nculture of waste, lack of accountability and indifference shown to many across the\nstate who need government to deliver for them.\n\nWe have done that with one TD. Imagine the impact we can have with a team of\nAontú TDs fighting your corner and standing firm on issues that matter to you,\nyour business and your family. That power is now in your hands. You can change\nthe direction of this country with your vote.\n\nWe have proven that we can deliver, and more importantly, that no matter what\nway the wind blows or how many other parties follow the latest trend, we will be\nresolute in what we believe in and won't be afraid to use our voice for you in the\nbest interests of the country.\n\nI am immensely grateful to all of our activists, our newly elected representatives\nand our supporters for standing with us throughout the lifetime of the 33rd Dáil. As\nwe look into the 34th, I am confident that 'Team Common Sense' will challenge the\npolitical establishment like never before.\n\nThe is a wave of support coming towards Aontú at the doors. Opinion polls are\nshowing arise in our support. More and more people are hearing our message. For\nthe first time, there is an Aontú candidate running in every constituency, giving\nevery citizen the opportunity to vote Aontú #1.\n\nIn many constituencies the last seat will be between the government and the Aontú\ncandidate. Don't spread your vote so thin over independents and micro-\norganisations that the government candidate still slips through. Concentrate your\nvote on the one party that is listening to you.\n\nThe government is wasting your money. Do not waste your vote. I am asking you\nto vote Aontú #1.\nLe gach deá ghuí,\n\nPeadar Tóibín TD\nLeader of Aontú\n"                                                                                                                                                                                                                            
[3] "                       Ráiteas Tosaigh\nLe bliain anuas tá Aontú tagtha in aois. Tá ár vóta tar éis dul thar Páirtí an Lucht\nOibre, na Daonlaithe Sóisialta agus Daoine Roimh Brabach sa toghchán\ndeireanach. Ba sinne an t-aon pháirtí a sheas leis na daoine sna reifrinn le déanaí.\nRinneamar an argóint chiallmhar agus bhuaigh muid.\n\nGo deimhin, is minic a sheas Aontú, go haonarach le cúig bliana anuas san\nOireachtas agus d'fhán siad leo chun do leasa. Cibé imní a bhí I gceist-imní faoi\ndhramhaíl uafásach an rialtais, faoi bheartas tubaisteach inimirce an rialtais, faoin\nbhfáth go bhfuil 100,000 teach folamh i lár géarchéime tithíochta, faoin\nreachtaíocht chinsireachta nó faoi chánacha carbóin ar bhreosla, bhíomar ar an\ntaobh cheart den argóint – bhur dtaobh.\n\nTáimid ag tairiscint athrú céime ar an dóigh a bhfuil an tír seo á rialú. Ba mhaith\nlinn deireadh a chur le cultúr na dramhaíola, na heaspa cuntasachta agus na\nneamhshuime a léiríodh do go leor ar fud an stáit a bhfuil gá acu le rialtas a bheith\nníos fearr agus seachadadh a dhéanamh ar a son.\n\nTáimid tar éis é sin a dhéanamh le TD amháin. Samhlaigh cad is féidir linn a\ndhéanamh le foireann de Theachtaí Dála Aontú ag troid i do chúinne agus ag\nseasamh go daingean ar cheisteanna atá tábhachtach duitse, do do ghnó agus do\ndo theaghlach. Tá an chumhacht sin i do lámha.\n\nTá sé cruthaithe againn gur féidir linn a sheachadadh, agus níos tábhachtaí fós, is\ncuma cén bealach a shéideann an ghaoth nó cé mhéad páirtí eile a leanann an\ntreocht is déanaí, go mbeimid diongbháilte lena gcreidimid ann agus nach mbeidh\neagla orainn ár nguth a úsáid le labhairt ar son á gcreidimid, ar mhaithe le leas na\ntíre.\n\nTáim thar a bheith buíoch dár ngníomhaithe go léir, dár n-ionadaithe nua-thofa\nagus dár lucht tacaíochta as seasamh linn ar feadh shaolré an 33ú Dáil.\nAgus muid ag breathnú isteach ar an 34ú, táim lánchinnte go dtabharfaidh\n'Foireann na Céille' dúshlán don bhunaíocht pholaitiúil mar nach raibh riamh ann.\n\nIs tonn tacaíochta é seo ag teacht i dtreo Aontú ag na doirse. Tá pobalbhreith ag\nteacht chun cinn inár dtacaíocht. Tá níos mó daoine ag éisteacht lenár\ndteachtaireacht. Don chéad uair, tá iarrthóir Aontú ag rith i ngach toghcheantar,\nrud a thugann deis do gach saoránach vóta a chaitheamh in Aontú #1.\n\nBeidh an suíochan deireanach I gcuid mhaith toghcheantar idir an rialtas agus\niarrthóir Aontú. Ná caith do vóta chomh tanaí sin thar eagraíochtaí\nneamhspleácha agus micrea-eagraíochtaí go sleamhnaíonn iarrthóir an rialtais\ntríd fós. Dírigh do vóta ar an bpáirtí amháin atá ag éisteacht leat.\n\nTá do chuid airgid curtha amú ag an rialtas. Ná cuir do vóta amú. Táim ag iarraidh\nort vótáil Aontú #1.\n\nPeadar Tóibín TD\nCeannaire Aontú\n"
[4] "            Contents\n\n\n\n         Housing .... 5\n         Health .... 15\n      Immigration .... 23\n     Cost of Living .... 35\n       Enterprise .... 41\n     Accountability .... 47\n       Economics .... 57\n     Human Rights .... 62\n         Gaeilge .... 72\nDefence & Foreign Affairs .... 77\n Agriculture & Rural Life.... 82\n          Crime .... 94\n    Our candidates .... 104\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[5] "        Our\nCommon Sense\n   Housing Policy\n\n\n\n\n  Manifesto 2024\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[6] "          8.1%                  100,000\n   increase in rent for\n                                 empty homes\n      new tenants\n\n\n\n\n       9.843                        4,561\nadults who are homeless     children who are homeless\n\n\n\n\n of 18-34 year olds are     Ireland has the 4th highest\n living in their parents’       number of homeless\n           home                  people in Europe\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

library("quanteda")
# Tokenise the text
toks <- quanteda::tokens(txt, remove_punct = TRUE)
ntokens <- quanteda::ntoken(toks)
sum(ntokens)
[1] 28423
ntypes <- quanteda::ntype(toks)
sum(ntypes)
[1] 16084

Exercise 2: Working with APIs

  • In this part we will try using Guardian API to collect newspaper articles.
  • First, you will need to got through a simple registration process to obtain an API key.
  • The main endpoint of the Guardian API is the content or search endpoint.
  • Study the documentation of the search endpoint.
  • Below is a simple example of working with the Guardian API to search for articles related to Ireland.
  • Modify the search query to search for articled related to elections in Ireland.
  • Set the time frame for around the time of the last GE (e.g. between 1 November and 31 December 2024).
  • How many articles are returned?

library("httr")
# It is a good idea to not hard-code the API key in the script
# and, instead, load it dynamically from a file
api_key <- readLines("../temp/guardian_api_key.txt")
# Define the URL for the API
base_url <- "https://content.guardianapis.com/search"
# Define the query parameters
params <- list(
  "api-key" = api_key,
  "q" = "ireland",
  "page-size" = 10
)

# Make the request and receive the response
response <- httr::GET(url = base_url, query = params)

# Check the status of the response
response$status_code
[1] 200
# Extract the content of the response
json <- httr::content(response, as = "text", encoding = "UTF-8")
json_parsed <- jsonlite::fromJSON(json)
str(json_parsed)
List of 1
 $ response:List of 9
  ..$ status     : chr "ok"
  ..$ userTier   : chr "developer"
  ..$ total      : int 113007
  ..$ startIndex : int 1
  ..$ pageSize   : int 10
  ..$ currentPage: int 1
  ..$ pages      : int 11301
  ..$ orderBy    : chr "relevance"
  ..$ results    :'data.frame': 10 obs. of  11 variables:
  .. ..$ id                : chr [1:10] "us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland" "world/2024/dec/02/ireland-election-2024-latest-results" "commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump" "sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live" ...
  .. ..$ type              : chr [1:10] "article" "article" "article" "liveblog" ...
  .. ..$ sectionId         : chr [1:10] "us-news" "world" "commentisfree" "sport" ...
  .. ..$ sectionName       : chr [1:10] "US news" "World news" "Opinion" "Sport" ...
  .. ..$ webPublicationDate: chr [1:10] "2025-04-03T10:53:49Z" "2024-12-03T07:45:44Z" "2025-03-14T11:40:10Z" "2025-02-22T16:41:09Z" ...
  .. ..$ webTitle          : chr [1:10] "Trump tariffs could undermine Brexit deal in Northern Ireland" "Ireland election 2024: full results" "Europe is rapidly rearming. Will that leave neutral Ireland defenceless? | Brigid Laffan" "Wales 18-27 Ireland: Six Nations – as it happened" ...
  .. ..$ webUrl            : chr [1:10] "https://www.theguardian.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland" "https://www.theguardian.com/world/2024/dec/02/ireland-election-2024-latest-results" "https://www.theguardian.com/commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump" "https://www.theguardian.com/sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live" ...
  .. ..$ apiUrl            : chr [1:10] "https://content.guardianapis.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland" "https://content.guardianapis.com/world/2024/dec/02/ireland-election-2024-latest-results" "https://content.guardianapis.com/commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump" "https://content.guardianapis.com/sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live" ...
  .. ..$ isHosted          : logi [1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
  .. ..$ pillarId          : chr [1:10] "pillar/news" "pillar/news" "pillar/opinion" "pillar/sport" ...
  .. ..$ pillarName        : chr [1:10] "News" "News" "Opinion" "Sport" ...

head(json_parsed$response$results)
                                                                                      id
1                         us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland
2                                 world/2024/dec/02/ireland-election-2024-latest-results
3        commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump
4                      sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live
5 sport/2025/feb/01/ireland-england-six-nations-rugby-union-player-ratings-aviva-stadium
6             uk-news/2024/dec/26/police-surveillance-of-journalists-in-northern-ireland
      type     sectionId sectionName   webPublicationDate
1  article       us-news     US news 2025-04-03T10:53:49Z
2  article         world  World news 2024-12-03T07:45:44Z
3  article commentisfree     Opinion 2025-03-14T11:40:10Z
4 liveblog         sport       Sport 2025-02-22T16:41:09Z
5  article         sport       Sport 2025-02-01T20:14:35Z
6  article       uk-news     UK news 2024-12-26T17:26:05Z
                                                                                  webTitle
1                            Trump tariffs could undermine Brexit deal in Northern Ireland
2                                                      Ireland election 2024: full results
3 Europe is rapidly rearming. Will that leave neutral Ireland defenceless? | Brigid Laffan
4                                        Wales 18-27 Ireland: Six Nations – as it happened
5                                        Ireland 27-22 England: Six Nations player ratings
6                         Police surveillance of journalists in Northern Ireland | Letters
                                                                                                              webUrl
1                         https://www.theguardian.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland
2                                 https://www.theguardian.com/world/2024/dec/02/ireland-election-2024-latest-results
3        https://www.theguardian.com/commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump
4                      https://www.theguardian.com/sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live
5 https://www.theguardian.com/sport/2025/feb/01/ireland-england-six-nations-rugby-union-player-ratings-aviva-stadium
6             https://www.theguardian.com/uk-news/2024/dec/26/police-surveillance-of-journalists-in-northern-ireland
                                                                                                                   apiUrl
1                         https://content.guardianapis.com/us-news/2025/apr/03/trump-tariffs-brexit-deal-northern-ireland
2                                 https://content.guardianapis.com/world/2024/dec/02/ireland-election-2024-latest-results
3        https://content.guardianapis.com/commentisfree/2025/mar/14/ireland-neutral-europe-rearm-triple-lock-donald-trump
4                      https://content.guardianapis.com/sport/live/2025/feb/22/wales-ireland-six-nations-rugby-union-live
5 https://content.guardianapis.com/sport/2025/feb/01/ireland-england-six-nations-rugby-union-player-ratings-aviva-stadium
6             https://content.guardianapis.com/uk-news/2024/dec/26/police-surveillance-of-journalists-in-northern-ireland
  isHosted       pillarId pillarName
1    FALSE    pillar/news       News
2    FALSE    pillar/news       News
3    FALSE pillar/opinion    Opinion
4    FALSE   pillar/sport      Sport
5    FALSE   pillar/sport      Sport
6    FALSE    pillar/news       News