overlapping matches
Let’s assume the following string is given:
s = "this is a test"
I want to get a list of all the 2-word pairs from the string.
Here is one attempt:
import re re.findall('[a-z]+\s[a-z]+', s)
The result is not satisfactory:
['this is', 'a test']
findall() returns non-overlapping matches. In this context this means that the pair “is a” will not be returned, since “is” was already matched in the “this is” string.
This returns a complete list of pairings:
words = re.findall("[a-z]+", s) maxPos = len(words) - 1 currentPos = 0 while currentPos < maxPos: print words[currentPos: currentPos + 2] currentPos += 1
Here is the result:
['this', 'is'] ['is', 'a'] ['a', 'test']
Generalizing this to make it usable for n-grams of sizes other than two:
def getNGrams(words, n): maxPos = len(words) - n + 1 currentPos = 0 while currentPos < maxPos: print words[currentPos: currentPos + 3] currentPos += 1
Here is how that works for n = 3:
>>> getNGrams(words, 3) ['this', 'is', 'a'] ['is', 'a', 'test']
It is probably more interesting to have that functions return a list of n-grams along with frequency information.
Also, this seems reasonably quick, using larger strings (> 200,000 words). Still I wonder, if this can be accomplished using only regular expressions.
August 15th, 2007 at 2:03 pm
Thought I might comment on this. After some research I found a regex solution using Perl
my $s = ‘This is a test’;
#Create an array
my @words;
while ($s =~ /(\w+)(?=\s+(\w+))/g){
#Push each found word into an array
push @words, [$1,$2];
}
#Print the array/list.
use Data::Dumper;
print Dumper(\@words);
=cut
Output:
$VAR1 = [
[
'This',
'is'
],
[
'is',
'a'
],
[
'a',
'test'
]
];
=end