not just random

July 3, 2007

overlapping matches

Filed under: python — Alex @ 8:39 pm

Let’s assume the following string is given:

s = "this is a test"

I want to get a list of all the 2-word pairs from the string.

Here is one attempt:

import re
re.findall('[a-z]+\s[a-z]+', s)

The result is not satisfactory:

['this is', 'a test']

findall() returns non-overlapping matches. In this context this means that the pair “is a” will not be returned, since “is” was already matched in the “this is” string.

This returns a complete list of pairings:

words = re.findall("[a-z]+", s)
maxPos = len(words) - 1
currentPos = 0
while currentPos < maxPos:
    print words[currentPos: currentPos + 2]
    currentPos += 1

Here is the result:

['this', 'is']
['is', 'a']
['a', 'test']

Generalizing this to make it usable for n-grams of sizes other than two:

def getNGrams(words, n):
    maxPos = len(words) - n + 1
    currentPos = 0
    while currentPos < maxPos:
        print words[currentPos: currentPos + 3]
        currentPos += 1

Here is how that works for n = 3:

>>> getNGrams(words, 3)
['this', 'is', 'a']
['is', 'a', 'test']

It is probably more interesting to have that functions return a list of n-grams along with frequency information.

Also, this seems reasonably quick, using larger strings (> 200,000 words). Still I wonder, if this can be accomplished using only regular expressions.

One Response to “overlapping matches”

  1. logan Says:

    Thought I might comment on this. After some research I found a regex solution using Perl :)

    my $s = ‘This is a test’;
    #Create an array
    my @words;
    while ($s =~ /(\w+)(?=\s+(\w+))/g){
    #Push each found word into an array
    push @words, [$1,$2];
    }

    #Print the array/list.
    use Data::Dumper;
    print Dumper(\@words);

    =cut
    Output:
    $VAR1 = [
    [
    'This',
    'is'
    ],
    [
    'is',
    'a'
    ],
    [
    'a',
    'test'
    ]
    ];
    =end

Leave a Reply