Words & Tokens

Lecture Slides Lecture Slides (pdf) Lecture Slides (ipynb)

Tutorial Exercise Tutorial Exercise (pdf) Tutorial Exercise (ipynb)


This week we focus on the basic building blocks of text analysis: words and tokens. We will look at how to define and identify tokens in text data, as well as techniques for matching and comparing strings, which are essential for tasks such as data cleaning, deduplication, and record linkage.

Required Readings

Additional Readings

  • Jeffrey E. F. Friedl. 2006. Mastering Regular Expressions. 3rd. Sebastopol, CA: O’Reilly
  • Gonzalo Navarro. 2001. “A Guided Tour to Approximate String Matching.” ACM Computing Surveys (CSUR) 33 (1): 31–88. https://doi.org/10.1145/375360.375365

Tutorial

  • Regular expressions
  • String distances