Module Overview

POP77032 Quantitative Text Analysis for Social Scientists

Author

Tom Paskhalis, Department of Political Science

Syllabus Blackboard

Introduction Introduction (pdf)

At no time in human history has there been more textual information produced than the present day. Researchers now have access to massive collections of texts by different societal actors: parliamentary speeches and blog posts, corporate press releases and social media posts, newspaper articles and archival documents to name just a few. At the same time, the computational power has reached unprecedented levels and has enabled the development and use of practical software to process and analyze huge datasets of text.

This module focuses on a range of computational tools – stemming from the fields of machine learning and natural language processing (NLP) – that are essential for large-scale analyses of text information. The aim is to provide students with a hands-on introduction to processing and analyzing ‘text-as-data’ for the purpose of answering important social science research questions.

Instructors

Tom Paskhalis, Office Hours: Friday 11:00-13:00 in-person or online (booking required)
Teaching Fellows:
- Sara Cid

Module Meetings

2-hour lecture
- Until Reading Week - Wednesday 16:00-18:00 in 4050A Arts Building
- After Reading Week - Wednesday 16:00-18:00 in 5052 Arts Building
2-hour tutorial
- Until Reading Week - Friday 14:00-16:00 in AP0.09 Aras an Phiarsaigh
- After Reading Week - Friday 14:00-16:00 in 1.24 D’Olier Street

Week	Date	Topic	Released	Due
1	21 January	Introduction
2	28 January	Words and Tokens	Assignment 1
3	4 February	Quantifying Texts
4	11 February	Dictionaries and Sentiment		Assignment 1
5	18 February	Supervised Modelling	Assignment 2
6	25 February	Unsupervised Modelling
7	4 March	-		Assignment 2
8	11 March	Beyond Bag-of-Words
9	18 March	Embeddings	Assignment 3
10	25 March	Neural Networks I
11	1 April	Neural Networks II		Assignment 3
12	8 April	Large Language Models

Prerequisites

This is an intermediate-level class focussing on representing and analysing text as quantitative data. The course assumes that you have a basic understanding of statistics and quantitative research methods (up to generalised linear models such as logistic and multinomial regression, particularly for the 2nd part of the module) and are comfortable with key programming concepts in both R and Python.

Software

In this class we will use both R and Python programming languages, with the former being widely used in social science research and the latter being the dominant language in the field of natural language processing.

Both R and Python are widely available for all major operating systems (Windows, Mac OS, Linux). You should install them on your personal computer prior to attending tutorials. Use these links to download the installation files:

R - https://cran.r-project.org/
Python - https://www.python.org/downloads/
Jupyter - https://jupyter.org/install
RStudio - https://posit.co/download/rstudio-desktop/

Materials

We will primarily be relying on the following core texts for this module:

Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, PA: Princeton University Press
Daniel Jurafsky and James H. Martin. 2026. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd ed. Draft. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan26.pdf

Some other useful texts on natural language processing and text analysis:

Christopher Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press
Jacob Eisenstein. 2019. Introduction to Natural Language Processing. Cambridge, MA: The MIT Press. https://github.com/jacobeisenstein/gt-nlp- class/blob/master/notes/eisenstein-nlp-notes.pdf

Finally, I highly recommend taking a look at the foundational content analysis text that was largely developed in pre-digital era but, nevertheless, provides an in-depth overview of many topics (largely pertaining to manual annotation of text data) that are still highly relevant today:

Klaus Krippendorff. 2019. Content Analysis: An Introduction to Its Methodology. 4th ed. Thousand Oaks, CA: SAGE Publications

Additional online resources:

See syllabus for further details.

Assessment

3 programming exercises (40% total)
Research paper (60%)
- Approximately 5-10 pages and 3,000 words (references excluded)

The final research paper will be due by 23:59 Wednesday, 22 April 2026.

See syllabus for further details.

Plagiarism Policy

Plagiarism - defined by the College as the act of presenting the work of others as one’s own work, without acknowledgement — is unacceptable under any circumstances. All submitted coursework must be individual and original (you should not re-use parts of a paper you wrote for another module, for example). You need to reference any literature you use in the correct manner. This is true for use of quotations as well as summarising someone else’s ideas in your own words. When in doubt, consult with the lecturer before you hand in an assignment. Plagiarism is regarded as a major offence that will have serious implications. For more information on the College policy on plagiarism, please see avoiding plagiarism guide. All students must complete the online tutorial on avoiding plagiarism which can be found on this webpage.

Generative AI Policy

The use of generative AI tools is permitted for this module, provided that their use is fully and transparently acknowledged in any submitted work. Every submission must include the specific version of the model used to assist in the creation of the work. Keep in mind that all information that you share with generative AI will be stored and re-used, so under no circumstances private, personal and copyrighted information (including any part of this module’s materials) should be used in prompt creation and submitted to any generative AI service. If you have access to hardware that allows you to run generative AI models locally this should be preferred to using third-party services.