10
POP77032 Quantitative Text Analysis for Social Scientists
| Decimal | Binary | Hexadecimal | UTF-8 | Character | Description |
|---|---|---|---|---|---|
| 65 | 01000001 | 0x41 | U+0041 | A | Latin Capital Letter A |
| 66 | 01000010 | 0x42 | U+0042 | B | Latin Capital Letter B |
| 67 | 01000011 | 0x43 | U+0043 | C | Latin Capital Letter C |
| 68 | 01000100 | 0x44 | U+0044 | D | Latin Capital Letter D |
| 69 | 01000101 | 0x45 | U+0045 | E | Latin Capital Letter E |
| 70 | 01000110 | 0x46 | U+0046 | F | Latin Capital Letter F |
Character: H, Code point: 72
Character: e, Code point: 101
Character: l, Code point: 108
Character: l, Code point: 108
Character: o, Code point: 111
Character: ,, Code point: 44
Character: , Code point: 32
Character: 世, Code point: 19990
Character: 界, Code point: 30028
Character: !, Code point: 33
Character: H, Code point: 72
Character: e, Code point: 101
Character: l, Code point: 108
Character: l, Code point: 108
Character: o, Code point: 111
Character: ,, Code point: 44
Character: , Code point: 32
Character: 世, Code point: 19990
Character: 界, Code point: 30028
Character: !, Code point: 33
Hohohoho, Mister Finn, you’re going to be Mister Finnagain!
Hello, world!
And in:
Hello, 世界!
Imagine that you want to identify all instances of the word ‘times’ in the following text:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair…
tale = """It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair..."""[Ii]| Regex | Name | Description | Example | Matches |
|---|---|---|---|---|
| . | Wildcard | Matches any single character (except newline, usually) | c.t | cat | cut |
| * | Zero or more | Matches 0 or more of the preceding token | lo*l | ll | lol | looool |
| + | One or more | Matches 1 or more of the preceding token | lo+l | lol | looool |
| ? | Optional | Matches 0 or 1 of the preceding token | colou?r | color | colour |
| {n} | Exact count | Matches exactly n occurrences | a{3} | aaa |
| {n,} | At least n | Matches n or more occurrences | a{2,} | aa | aaa |
| {n,m} | Range | Matches between n and m occurrences | a{2,4} | aa | aaa | aaaa |
| ^ | Start anchor | Matches start of string | ^Hello | Hello world |
| $ | End anchor | Matches end of string | world$ | Hello world |
| [] | Character class | Matches any one character inside brackets | [aeiou] | a | e | i |
| [^ ] | Negated class | Matches any character not in brackets | [^0-9] | a | # |
| | | Alternation | Logical OR | cat|dog | cat | dog |
| () | Grouping | Groups tokens and captures matches | (ab)+ | ab | abab |
Going back to the original quote, let’s find all the attributes of the period that Charles Dickens describes:
'It was the best of times, it was the worst of times,\nit was the age of wisdom, it was the age of foolishness,\nit was the epoch of belief, it was the epoch of incredulity,\nit was the season of Light, it was the season of Darkness,\nit was the spring of hope, it was the winter of despair...'
We could start by constructing a regex part that captures age, epoch and season.
Since we don’t want to extract (capture) these words as such, we will re-write it as a non-capturing group:
Finally, we can add the search pattern for the actual word with the attribute of the period:
'(?:age|epoch|season)\\s+of\\s+(\\w+)'