not just random

July 16, 2007

Blue Scholars: Back Home

Filed under: Music, seattle — Alex @ 4:34 pm

Blue Scholars‘ new album Bayani just came out. It has been on heavy rotation in my CD player for the past several days. I am pretty excited about it. Conscious and thoughtful, it is probably one of the best Rap CDs I have found lately.

Oh yeah: They’ll be playing at the Capitol Hill Block Party on July 27.

July 12, 2007

unavailable advertisement

Filed under: Uncategorized — Alex @ 11:45 pm

Advertisement itself tends to be much more important to the website provider than the website visitor. Is there really any good excuse for the following though?

amznad.jpg

Painful, no?

no sleep

Filed under: Music — Alex @ 11:24 pm

Last time I felt like complaining about the summer heat, it did not take long after forming the thought that winter would be upon us with ice storms and snow. Wind chill. That was Minnesota.

Seattle is different, but my AC does not seem like it is doing much and it is hot, too hot. Rain seems like a welcome prospect to cool things down.

In the meantime, no sleep.

Here is Insomnia by Faithless:

Imagine driving. Observing the lights of the city in the rear view window. At night.

July 3, 2007

overlapping matches

Filed under: python — Alex @ 8:39 pm

Let’s assume the following string is given:

s = "this is a test"

I want to get a list of all the 2-word pairs from the string.

Here is one attempt:

import re
re.findall('[a-z]+\s[a-z]+', s)

The result is not satisfactory:

['this is', 'a test']

findall() returns non-overlapping matches. In this context this means that the pair “is a” will not be returned, since “is” was already matched in the “this is” string.

This returns a complete list of pairings:

words = re.findall("[a-z]+", s)
maxPos = len(words) - 1
currentPos = 0
while currentPos < maxPos:
    print words[currentPos: currentPos + 2]
    currentPos += 1

Here is the result:

['this', 'is']
['is', 'a']
['a', 'test']

Generalizing this to make it usable for n-grams of sizes other than two:

def getNGrams(words, n):
    maxPos = len(words) - n + 1
    currentPos = 0
    while currentPos < maxPos:
        print words[currentPos: currentPos + 3]
        currentPos += 1

Here is how that works for n = 3:

>>> getNGrams(words, 3)
['this', 'is', 'a']
['is', 'a', 'test']

It is probably more interesting to have that functions return a list of n-grams along with frequency information.

Also, this seems reasonably quick, using larger strings (> 200,000 words). Still I wonder, if this can be accomplished using only regular expressions.

Next Page »