Posted: July 3rd, 2007 | Author: Alex | Filed under: python | 1 Comment »
Let’s assume the following string is given:
s = "this is a test"
I want to get a list of all the 2-word pairs from the string.
Here is one attempt:
import re
re.findall('[a-z]+\s[a-z]+', s)
The result is not satisfactory:
['this is', 'a test']
findall() returns non-overlapping matches. In this context this means that the pair “is a” will not be returned, since “is” was already matched in the “this is” string.
This returns a complete list of pairings:
words = re.findall("[a-z]+", s)
maxPos = len(words) - 1
currentPos = 0
while currentPos < maxPos:
print words[currentPos: currentPos + 2]
currentPos += 1
Here is the result:
['this', 'is']
['is', 'a']
['a', 'test']
Generalizing this to make it usable for n-grams of sizes other than two:
def getNGrams(words, n):
maxPos = len(words) - n + 1
currentPos = 0
while currentPos < maxPos:
print words[currentPos: currentPos + 3]
currentPos += 1
Here is how that works for n = 3:
>>> getNGrams(words, 3)
['this', 'is', 'a']
['is', 'a', 'test']
It is probably more interesting to have that functions return a list of n-grams along with frequency information.
Also, this seems reasonably quick, using larger strings (> 200,000 words). Still I wonder, if this can be accomplished using only regular expressions.
Posted: July 2nd, 2007 | Author: Alex | Filed under: python | No Comments »
Suppose we have a few pickled items, maybe even using different protocols:
import pickle
allPickles = ''
allPickles += pickle.dumps('test0', protocol = 0)
allPickles += pickle.dumps('test1', protocol = 1)
allPickles += pickle.dumps('test2', protocol = 2)
This would result in the following data:
"S'test0'\np0\n.U\x05test1q\x00.\x80\x02U\x05test2q\x00."
Here is an easy way to break the string apart into individual pickles:
pickles = []
import StringIO
sio = StringIO.StringIO(allPickles)
while sio.len > sio.pos:
currentPos = sio.pos
pickle.load(sio)
pickles.append(sio.buf[currentPos:sio.pos])
Here is the resulting list:
["S'test0'\np0\n.",
'U\x05test1q\x00.',
'\x80\x02U\x05test2q\x00.']
Just to verify:
for item in pickles:
print pickle.loads(item)
The result is correct:
If this displays with correct syntax highlighting on the blog, then this was also a successful test of the wp-syntax plugin.
Posted: June 27th, 2006 | Author: Alex | Filed under: Algorithms, python | No Comments »
Here is a nice gem of an algorithm by Kim S. Larsen to compute the day of the week, where m, d and y represent month, day and year, respectively.
def dow(m, d, y):
if ((m == 1) or (m == 2)):
m += 12
y -= 1
return (d + 2 * m + 3 * (m + 1) / 5 +
y + y / 4 - y / 100 + y / 400) % 7
The returned integer is a value between 0 (Monday) and 6 (Sunday).
Combine it with a list of day strings:
days = ('Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday', 'Sunday')
Like in the following example to display the name of day, given a date:
days[dow(6, 28, 2006)]
I found the algorithm in Dr. Dobbs Journal issue 229 (April 1995) and the article features a very nice description of how the formula was derived. The original C implementation of the algorithm is about as readable as the above pieces of code.