Module Overview
POP77032 Quantitative Text Analysis for Social Scientists
Syllabus Blackboard
Introduction Introduction (pdf)
At no time in human history has there been more textual information produced than the present day. Researchers now have access to massive collections of texts by different societal actors: parliamentary speeches and blog posts, corporate press releases and social media posts, newspaper articles and archival documents to name just a few. At the same time, the computational power has reached unprecedented levels and has enabled the development and use of practical software to process and analyze huge datasets of text.
This module focuses on a range of computational tools – stemming from the fields of machine learning and natural language processing (NLP) – that are essential for large-scale analyses of text information. The aim is to provide students with a hands-on introduction to processing and analyzing ‘text-as-data’ for the purpose of answering important social science research questions.
Instructors
- Tom Paskhalis, Office Hours: Friday 11:00-13:00 in-person or online (booking required)
- Teaching Fellows:
Module Meetings
- 2-hour lecture
- Until Reading Week - Wednesday 16:00-18:00 in 4050A Arts Building
- After Reading Week - Wednesday 16:00-18:00 in 5052 Arts Building
- 2-hour tutorial
- Until Reading Week - Friday 14:00-16:00 in AP0.09 Aras an Phiarsaigh
- After Reading Week - Friday 14:00-16:00 in 1.24 D’Olier Street
| Week | Date | Topic | Released | Due |
|---|---|---|---|---|
| 1 | 21 January | Introduction | ||
| 2 | 28 January | Words and Tokens | Assignment 1 | |
| 3 | 4 February | Quantifying Texts | ||
| 4 | 11 February | Dictionaries and Sentiment | Assignment 1 | |
| 5 | 18 February | Supervised Modelling | Assignment 2 | |
| 6 | 25 February | Unsupervised Modelling | ||
| 7 | 4 March | - | Assignment 2 | |
| 8 | 11 March | Beyond Bag-of-Words | ||
| 9 | 18 March | Embeddings | Assignment 3 | |
| 10 | 25 March | Neural Networks | ||
| 11 | 1 April | Transformers | Assignment 3 | |
| 12 | 8 April | Large Language Models |
Prerequisites
This is an intermediate-level class focussing on representing text as quantitative data. The course assumes that you have a basic understanding of statistics and are comfortable with key programming concepts in R and/or Python.
Software
In this class we will use R to work with data. R is free, open-source and interactive programming language for statical analysis. RStudio is a versatile editor for working with R code and data that provides a more intuitive interface to many features of the language.
Both R and RStudio are widely available for all major operating systems (Windows, Mac OS, Linux). You should install them on your personal computer prior to attending tutorials. Use these links to download the installation files:
Materials
We will primarily be relying on the following core texts for this module:
- Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, PA: Princeton University Press
- Daniel Jurafsky and James H. Martin. 2026. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd ed. Draft. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan26.pdff
Some other useful texts on natural language processing and text analysis:
- Christopher Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press
- Jacob Eisenstein. 2019. Introduction to Natural Language Processing. Cambridge, MA: The MIT Press. https://github.com/jacobeisenstein/gt-nlp- class/blob/master/notes/eisenstein-nlp-notes.pdf
Finally, I highly recommend taking a look at the foundational content analysis text that was largely developed in pre-digital era but, nevertheless, provides an in-depth overview of many topics (largely pertaining to manual coding of text data) that are still highly relevant today:
- Klaus Krippendorff. 2019. Content Analysis: An Introduction to Its Methodology. 4th ed. Thousand Oaks, CA: SAGE Publications
Additional online resources:
See syllabus for further details.
Assessment
3 programming exercises (40% total)
Research paper (60%)
- Approximately 5-10 pages and 3,000 words (references excluded)
The final research paper will be due by 23:59 Wednesday, 22 April 2026.
See syllabus for further details.
Plagiarism Policy
Plagiarism - defined by the College as the act of presenting the work of others as one’s own work, without acknowledgement — is unacceptable under any circumstances. All submitted coursework must be individual and original (you should not re-use parts of a paper you wrote for another module, for example). You need to reference any literature you use in the correct manner. This is true for use of quotations as well as summarising someone else’s ideas in your own words. When in doubt, consult with the lecturer before you hand in an assignment. Plagiarism is regarded as a major offence that will have serious implications. For more information on the College policy on plagiarism, please see avoiding plagiarism guide. All students must complete the online tutorial on avoiding plagiarism which can be found on this webpage.
Generative AI Policy
The use of generative AI tools is permitted for this module, provided that their use is fully and transparently acknowledged in any submitted work. Every submission must include the specific version of the model used to assist in the creation of the work. Keep in mind that all information that you share with generative AI will be stored and re-used, so under no circumstances private, personal and copyrighted information (including any part of this module’s materials) should be used in prompt creation and submitted to any generative AI service. If you have access to hardware that allows you to run generative AI models locally this should be preferred to using third-party services.