Words & Tokens

Lecture Slides Lecture Slides (pdf) Lecture Slides (ipynb)

Tutorial Exercise Tutorial Exercise (pdf) Tutorial Exercise (ipynb)

This week we focus on the basic building blocks of text analysis: words and tokens. We will look at how to define and identify tokens in text data, as well as techniques for matching and comparing strings, which are essential for tasks such as data cleaning, deduplication, and record linkage.

Required Readings

Jurafsky and Martin 2026 Ch 2 Words and Tokens

Additional Readings

Gonzalo Navarro. 2001. “A Guided Tour to Approximate String Matching.” ACM Computing Surveys (CSUR) 33 (1): 31–88. https://doi.org/10.1145/375360.375365
Aaron R. Kaufman and Aja Klevs. 2022. “Adaptive Fuzzy String Matching: How to Merge Data Sets with Only One (Messy) Identifying Field.” Political Analysis 30 (4): 590–596. https://doi.org/10.1017/pan.2021.38

Tutorial

Regular expressions
String distances