Words & Tokens


This week we focus on the basic building blocks of text analysis: words and tokens. We will look at how to define and identify tokens in text data, as well as techniques for matching and comparing strings, which are essential for tasks such as data cleaning, deduplication, and record linkage.

Required Readings

Additional Readings

  • Gonzalo Navarro. 2001. “A Guided Tour to Approximate String Matching.” ACM Computing Surveys (CSUR) 33 (1): 31–88. https://doi.org/10.1145/375360.375365
  • Aaron R. Kaufman and Aja Klevs. 2022. “Adaptive Fuzzy String Matching: How to Merge Data Sets with Only One (Messy) Identifying Field.” Political Analysis 30 (4): 590–596. https://doi.org/10.1017/pan.2021.38

Tutorial

  • Regular expressions
  • String distances