Let’s assume the following string is given:
s = "this is a test"
I want to get a list of all the 2-word pairs from the string.
Here is one attempt:
import re
re.findall('[a-z]+\s[a-z]+', s)
The result is not satisfactory:
['this is', 'a test']
findall() returns non-overlapping matches. In this context this means that the pair “is a” will not be returned, since “is” was already matched in the “this is” string.
This returns a complete list of pairings:
words = re.findall("[a-z]+", s)
maxPos = len(words) - 1
currentPos = 0
while currentPos < maxPos:
print words[currentPos: currentPos + 2]
currentPos += 1
Here is the result:
['this', 'is']
['is', 'a']
['a', 'test']
Generalizing this to make it usable for n-grams of sizes other than two:
def getNGrams(words, n):
maxPos = len(words) - n + 1
currentPos = 0
while currentPos < maxPos:
print words[currentPos: currentPos + 3]
currentPos += 1
Here is how that works for n = 3:
>>> getNGrams(words, 3)
['this', 'is', 'a']
['is', 'a', 'test']
It is probably more interesting to have that functions return a list of n-grams along with frequency information.
Also, this seems reasonably quick, using larger strings (> 200,000 words). Still I wonder, if this can be accomplished using only regular expressions.