Module Overview

POP77142 Quantitative Text Analysis for Social Scientists

Author

Tom Paskhalis, Department of Political Science

Syllabus Blackboard

Introduction Introduction (pdf)

At no time in human history has there been more textual information produced than the present day. Researchers now have access to massive collections of texts by different societal actors: parliamentary speeches and blog posts, corporate press releases and social media posts, newspaper articles and archival documents to name just a few. At the same time, the computational power has reached unprecedented levels and has enabled the development and use of practical software to process and analyze huge datasets of text.

This module focuses on a range of computational tools – stemming from the fields of machine learning and natural language processing (NLP) – that are essential for large-scale analyses of text information. The aim is to provide students with a hands-on introduction to processing and analyzing ‘text-as-data’ for the purpose of answering important social science research questions.

Instructors

Module Meetings

Week Date Topic Released Due
8 12 March Introduction Assignment 1
9 19 March Quantifying Texts
10 26 March Classifying Texts Assignment 2 Assignment 1
11 2 April Modelling Texts
12 9 April Beyond BOW Assignment 2

Prerequisites

This is an intermediate-level class focussing on representing text as quantitative data. The course assumes that you have a basic understanding of statistics and are comfortable with key programming concepts in R and/or Python.

Software

In this class we will use R to work with data. R is free, open-source and interactive programming language for statical analysis. RStudio is a versatile editor for working with R code and data that provides a more intuitive interface to many features of the language.

Both R and RStudio are widely available for all major operating systems (Windows, Mac OS, Linux). You should install them on your personal computer prior to attending tutorials. Use these links to download the installation files:

Materials

We will primarily be relying on the following core texts for this module:

  • Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, PA: Princeton University Press
  • Daniel Jurafsky and James H. Martin. 2025. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd. Draft. https://web.stanford.edu/∼jurafsky/slp3/ed3book_Jan25.pdf

Some other useful texts on natural language processing and text analysis:

Finally, I highly recommend taking a look at the foundational content analysis text that was largely developed in pre-digital era but, nevertheless, provides an in-depth overview of many topics (largely pertaining to manual coding of text data) that are still highly relevant today:

  • Klaus Krippendorff. 2019. Content Analysis: An Introduction to Its Methodology. 4th. Thousand Oaks, CA: SAGE Publications

Additional online resources:

See syllabus for further details.

Assessment

  • 2 programming exercises (20% each)

  • Research paper (60%)

    • Approximately 10 pages and 3,000-4,000 words (references excluded)

The final research paper will be due by 23:59 Wednesday, 23 April 2025.

See syllabus for further details.

Plagiarism

Plagiarism - defined by the College as the act of presenting the work of others as one’s own work, without acknowledgement — is unacceptable under any circumstances. All submitted coursework must be individual and original (you should not re-use parts of a paper you wrote for another module, for example). You need to reference any literature you use in the correct manner. This is true for use of quotations as well as summarising someone else’s ideas in your own words. When in doubt, consult with the lecturer before you hand in an assignment. Plagiarism is regarded as a major offence that will have serious implications. For more information on the College policy on plagiarism, please see avoiding plagiarism guide. All students must complete the online tutorial on avoiding plagiarism which can be found on this webpage.