overlapping matches

Posted: July 3rd, 2007 | Author: Alex | Filed under: python | 1 Comment »

Let’s assume the following string is given:

s = "this is a test"

I want to get a list of all the 2-word pairs from the string.

Here is one attempt:

import re
re.findall('[a-z]+\s[a-z]+', s)

The result is not satisfactory:

['this is', 'a test']

findall() returns non-overlapping matches. In this context this means that the pair “is a” will not be returned, since “is” was already matched in the “this is” string.

This returns a complete list of pairings:

words = re.findall("[a-z]+", s)
maxPos = len(words) - 1
currentPos = 0
while currentPos < maxPos:
    print words[currentPos: currentPos + 2]
    currentPos += 1

Here is the result:

['this', 'is']
['is', 'a']
['a', 'test']

Generalizing this to make it usable for n-grams of sizes other than two:

def getNGrams(words, n):
    maxPos = len(words) - n + 1
    currentPos = 0
    while currentPos < maxPos:
        print words[currentPos: currentPos + 3]
        currentPos += 1

Here is how that works for n = 3:

>>> getNGrams(words, 3)
['this', 'is', 'a']
['is', 'a', 'test']

It is probably more interesting to have that functions return a list of n-grams along with frequency information.

Also, this seems reasonably quick, using larger strings (> 200,000 words). Still I wonder, if this can be accomplished using only regular expressions.


Turning a pickle string into a pickle list

Posted: July 2nd, 2007 | Author: Alex | Filed under: python | No Comments »

Suppose we have a few pickled items, maybe even using different protocols:

import pickle
allPickles = ''
allPickles += pickle.dumps('test0', protocol = 0)
allPickles += pickle.dumps('test1', protocol = 1)
allPickles += pickle.dumps('test2', protocol = 2)

This would result in the following data:

"S'test0'\np0\n.U\x05test1q\x00.\x80\x02U\x05test2q\x00."

Here is an easy way to break the string apart into individual pickles:

pickles = []
import StringIO
sio = StringIO.StringIO(allPickles)
while sio.len > sio.pos:
    currentPos = sio.pos
    pickle.load(sio)
    pickles.append(sio.buf[currentPos:sio.pos])

Here is the resulting list:

["S'test0'\np0\n.", 
'U\x05test1q\x00.', 
'\x80\x02U\x05test2q\x00.']

Just to verify:

for item in pickles:
    print pickle.loads(item)

The result is correct:

test0
test1
test2

If this displays with correct syntax highlighting on the blog, then this was also a successful test of the wp-syntax plugin.


Computing the day of week

Posted: June 27th, 2006 | Author: Alex | Filed under: Algorithms, python | No Comments »

Here is a nice gem of an algorithm by Kim S. Larsen to compute the day of the week, where m, d and y represent month, day and year, respectively.

def dow(m, d, y):
    if ((m == 1) or (m == 2)):
        m += 12
        y -= 1
    return (d + 2 * m + 3 * (m + 1) / 5 +
    y + y / 4 - y / 100 + y / 400) % 7

The returned integer is a value between 0 (Monday) and 6 (Sunday).

Combine it with a list of day strings:

days = ('Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday', 'Sunday')

Like in the following example to display the name of day, given a date:

days[dow(6, 28, 2006)]

I found the algorithm in Dr. Dobbs Journal issue 229 (April 1995) and the article features a very nice description of how the formula was derived. The original C implementation of the algorithm is about as readable as the above pieces of code.