Basic Idea: Observed text is a stochastic realization.
Systematic features shape most of observed verbal content.
Non-systematic, random features also shape verbal content.
Implications of a Stochastic View
Observed text is not the only text that could have been generated.
Research (system) design would depend on the question and quantity of interest.
Very different if you are trying to monitor something like hate speech, where what you actually say matters, not the value of your “expected statement”.
Means that having “all the text” is still not a “population”.
Sampling Strategies
Be clear what is you sample and your population.
May not be feasible to perform any sampling.
Different types of sampling vary from random to purposive:
Random sampling (e.g. politician’s speeches)
Non-random sampling (e.g. messages containing hate speech on a social media platform)
Key is to make sure that what is being analyzed is a valid representation of the phenomenon as a whole - a question of research design.
Text Preprocessing
Quantifying Texts
Some QTA Terminology
Corpus - a collection of texts for analysis.
E.g. SOTU addresses, Hansard debates, party manifestos, etc.
Document - a single text in the corpus.
E.g. a single SOTU address, one speech, a specific party manifesto, etc.
Token - a single unit of text.
E.g. typically a word, but can include punctuation, numbers, hashtags, etc.
Type - a unique token.
E.g. articles like “the” and “a” appearing throughout the corpus.
Tokenization - the process of breaking a text into tokens.
Some Linguistic Terminology
Tokens constitute the basic unit of analysis (particularly in NLP applications).
But how tokens are constructed can vary.
It might be useful to consider different tokens as the same.
E.g. “runs”, “running”, “ran” are all forms of the same word.
Stemming - mechanically removing affixes (usually, suffixes) from tokens.
E.g. “running” -> “run”, “runs” -> “run”.
Lemmatization - reducing tokens to their base (root) or dictionary form.
E.g. “ran” -> “run”, “runner” -> “run”.
While lemmatization is more accurate, it also requires more built-in knowledge about a language.
Tokenization: Example
library("quanteda")
text <-"The quick brown fox jumps over the lazy dog."tokens <- quanteda::tokens(text)tokens
But usually is rather technical in nature (written for developers).
Some key terms:
Endpoint - URL (web location) that receives requests and sends responses.
Parameters - custom information that can be passed to the API.
Response - the data returned by the API.
Authentication
Many APIs require a key or tokens.
Most APIs are rate-limited
E.g. restrictions by user/key/IP address/time period.
Make sure that you understand the terms of service/use.
Even providers of public free APIs can impose some restrictions.
Example: Guardian API
library("httr")
# It is a good idea to not hard-code the API key in the script# and, instead, load it dynamically from a fileapi_key <-readLines("../temp/guardian_api_key.txt")