Words & Tokens
Lecture Slides Lecture Slides (pdf) Lecture Slides (ipynb)
Tutorial Exercise Tutorial Exercise (pdf) Tutorial Exercise (ipynb)
This week we focus on the basic building blocks of text analysis: words and tokens. We will look at how to define and identify tokens in text data, as well as techniques for matching and comparing strings, which are essential for tasks such as data cleaning, deduplication, and record linkage.
Required Readings
- Jurafsky and Martin 2026 Ch 2 Words and Tokens
Additional Readings
- Jeffrey E. F. Friedl. 2006. Mastering Regular Expressions. 3rd. Sebastopol, CA: O’Reilly
- Gonzalo Navarro. 2001. “A Guided Tour to Approximate String Matching.” ACM Computing Surveys (CSUR) 33 (1): 31–88. https://doi.org/10.1145/375360.375365
Tutorial
- Regular expressions
- String distances