Finding frequent items in a data stream

Posted: November 13th, 2009 | Author: Alex | Filed under: Algorithms, Data Mining, python | No Comments »

In Finding the Frequent Items in Streams of Data [PDF], Graham Cormode and Marios Hadjieleftheriou discuss the frequent items problem and some of the algorithms that are used to solve it:

The frequent items problem is to process a stream of items and find all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. In this paper, we describe the most important algorithms for this problem in a common framework. We place the different solutions in their historical context, and describe the connections between them, with the aim of clarifying some of the confusion that has surrounded their properties.

Some of the interesting bits here are that the data stream will easily contain millions (or billions) of items and the algorithm will typically only get to take one look at each item as it comes up in the stream.

Space-Saving

In this post I focus on the Space-Saving algorithm and provide an implementation in Python. Read the rest of this entry »


Matching circular string rotations

Posted: November 12th, 2008 | Author: Alex | Filed under: Algorithms, python | No Comments »

This post examines two questions:

  1. Is one string a circular rotation of a second string?
  2. Is one string a substring of a (potentially rotated) second string?

Previously discussed Z-values are used for a solution.

Circular string rotation?

Given a prefix p of string t and a suffix s of string t, such that p + s = t, then a circular string rotation r is a string of the form s + p. Thus, it is a string that consists of the suffix of string t, directly followed by the prefix of string t, such that the resulting string r has the same length as the original string t: |r| = |t|.

Example:

Original:
t = “abcd”

Rotation:
r = “cdab”, with p = “ab” and s = “cd”

How can one string be shown to be a circular rotation of a second string?

If it is, then the lengths of the two strings t and r have to be identical. Also, since t = p + s and the rotation r = s + p, then 2r = r + r = s + p + s + p = s + t + p. Thus, if r is a rotation of t, then t is included completely within 2r. Not only that, but if it is shown that r is a rotation of t, then if t is found within 2r, more specific information can be shown about the rotation, because p and s can easily be indicated, too.

Assuming access to a fast string matching routine, the mechanism to answer the question should be straightforward:

If |t| = |r| and t in r+r
Then Circular Rotation
Else No Circular Rotation

Saving space

It may not be desirable to create a new string that contains twice the data of r, particularly assuming very large instances of strings r and t. Since r = s + p is available though, it is easy to imagine what 2r would look like, without actually creating a new string consisting of 2r.

The previously mentioned Z algorithm can be heavily modified to

  1. operate on two strings, and
  2. allow for Z-values up to maximum of the string length, regardless of the index position, effectively allowing for a substring to begin towards the end of the string and continue at the beginning of the string.

getMaxSuffixZ

Let getMaxSuffixZ be a process that takes as input two strings, t and r and returns a maximum Z-value <= |t|. This involves calculating Z-values for each position in r and returning the largest one. In this case the Z-values are not calculated based on a prefix of r. Rather, string t is considered the prefix. So, if r is a rotation of t and t = p + s, then r = s + p, so maxZ should return |p|. It then only makes sense for maxZ to pick that maximum Z-Value that represents a substring that stretches to the end of r.

Here is the modified implementation:

def getMaxSuffixZ(p, s):
	result = {}
	l = 0
	r =  -1
	for k in range(0, len(s)):
		if k > r:
			zk = 0
			for si in range(0, len(s)):
				if k + si < len(s) and \
					si < len(p) and \
					p[si] == s[k + si]:
					zk += 1
				else:
					break
			if zk > 0:
				r = zk + k - 1
				l = k
		else:
			kOld = k - l - 1
			zOld = result[kOld]
			b = r - k + 1
			if zOld < b:
				zk = zOld
			else:
				zk = b
				for si in range(b, len(s)):
					if k + si < len(s) \
						and si < len(p) \
						and p[si] \
						== s[k + si]:
						pass
					else:
						break
				zk = si
				r = zk + k - 1
				l = k
		result[k] = zk
		if zk == len(p) - k:
			return zk
	return 0

This returns the expected results at least for this sample set:

assert getMaxSuffixZ(s, "") == 0
assert getMaxSuffixZ(s, "a") == 0
assert getMaxSuffixZ(s, "b") == 0
assert getMaxSuffixZ(s, "defabc") == 3
assert getMaxSuffixZ(s, "defabcd") == 0
assert getMaxSuffixZ(s, "abcdef") == 6
assert getMaxSuffixZ(s, "fabcde") == 5
print 'all tests passed.'
 
all tests passed.

Let z = getMaxSuffixZ(t, r). z is the prefix of t (and the suffix of r). If z > 0 then t[0..z-1] = r[z..|r|]. Similarly getMaxSuffixZ(r, t) yields the prefix of r (and the suffix of t).

s = "abcdef"
assert getMaxSuffixZ("", s) == 0
assert getMaxSuffixZ("a", s) == 1
assert getMaxSuffixZ("b", s) == 0
assert getMaxSuffixZ("defabc", s) == 3
assert getMaxSuffixZ("defabcd", s) == 0
assert getMaxSuffixZ("abcdef", s) == 6
assert getMaxSuffixZ("fabcde", s) == 1
print 'all tests passed.'
 
all tests passed.

If r is a rotation of t, then getMaxSuffixZ(t, r) + getMaxSuffixZ(r, t) = |t| = |r|. If t = r, then getMaxSuffixZ(t, r) = getMaxSuffixZ(r, t) = |t| = |r|. This leads to the following flow:

If |t| = |r|
Then
    zTr = getMaxSuffixZ(t, r)
    If zTr = |t|:
    Then Circular Rotation (t = r)
    Else
        If zTr = getMaxSuffixZ(r, t)
        Then Circular Rotation
        Else No Circular Rotation
Else No Circular Rotation

getMaxZ

Let getMaxZ have the same attributes of getMaxSuffixZ, with one exception: When finding matching substrings that begin at position k and reach to the end of the string s, the algorithm continues comparisons at the beginning of string s up to a maximum of position k-1. It no longer makes sense to stop at the first substring that reaches to the end of the string, as later substrings could have a longer reach into the beginning of the string. It does make sense to return a result as soon as a z-value is found that equals the length of the prefix.

def getMaxZ(p, s):
	result = {}
	l = 0
	r =  -1
	maxZk = 0
	for k in range(0, len(s)):
		if k > r:
			zk = 0
			for si in range(0, len(s)):
				if k + si < len(s) and \
					si < len(p) and \
					p[si] == s[k + si]:
					zk += 1
				else:
					break
			if zk > 0:
				r = zk + k - 1
				l = k
				if r == len(s) - 1:
					r2 = 0
					for si in range(0, k):
						if zk < len(p) and \
						p[zk] == s[si]:
							zk += 1
							r2 += 1
						else:
							break
		else:
			kOld = k - l - 1
			zOld = result[kOld]
			b = r - k + 1
			if zOld < b:
				zk = zOld
			else:
				zk = b
				for si in range(b, len(s)):
					if k + si < len(s) \
						and si < len(p) \
						and p[si] \
						== s[k + si]:
						pass
					else:
						break
				zk = si
				r = zk + k - 1
				l = k
				if r == len(s) - 1:
					r2 = 0
					for si in range(0, k):
						if zk < len(p) and \
						p[zk] == s[si]:
							zk += 1
							r2 += 1
						else:
							break
		result[k] = zk
		if zk >= len(s) - k:
			if zk > maxZk:
				maxZk = zk
		if zk == len(p):
			return zk
	return maxZk

Tests show the correct results:

s = "abcdef"
assert getMaxZ(s, s) == 6
assert getMaxZ(s, "fabcde") == 6
assert getMaxZ(s, "fgabcde") == 6
assert getMaxZ(s, "a") == 1
assert getMaxZ(s, "ba") == 2
assert getMaxZ(s, "") == 0
assert getMaxZ(s, "xyz") == 0
assert getMaxZ(s, "defabcabcdabc") == 6
assert getMaxZ("", "") == 0
assert getMaxZ(s, "defabc") == 6
print 'all tests passed.'
 
all tests passed.

Following the above manner, the resulting flow looks like this:

If |t| = |r| and maxZ(t, r) = |t|
Then Circular Rotation
Else No Circular Rotation

Matching substrings in string rotations

Given the above discussion, a mechanism can be shown that allows the matching of substrings in circular string rotations. To illustrate the problem, in a string t = “abcd” it would then be possible to find substrings such as “abc”, “cda”, “da”, “dab”, etc.

The above algorithm can be applied directly:

Let t = text and p = substring to search
If getMaxZ(p, t) = |p|
Then p found in t
Else p not found in t

Likewise, getMaxZ can be applied to this problem directly:

def showFound(p, s):
	if getMaxZ(p, s) == len(p):
		print "%s found in %s" % (p, s)
	else:
		print "%s not found in %s" % (p, s)
 
showFound("ab", s)
showFound("fa", s)
showFound("abcdef", s)
showFound("x", s)
showFound("abc", "")
showFound("efabcd", s)
showFound("efgabc", s)
showFound("abd", s)

The results look correct.

ab found in abcdef
fa found in abcdef
abcdef found in abcdef
x not found in abcdef
abc not found in
efabcd found in abcdef
efgabc not found in abcdef
abd not found in abcdef

It would make sense to add modifications to allow detecting all instances (up to a specifiable number) of substring matches along with their positions.

Next?

Run-time analysis. And lots of code cleanup!


The Z Algorithm

Posted: November 1st, 2008 | Author: Alex | Filed under: Algorithms, Z, python | 1 Comment »

In Algorithms on Strings, Trees and Sequences, Dan Gusfield presents the Z algorithm.

A string prefix P is a substring of S that begins at S[0]. The Z algorithm calculates for each position k > 0 in S the maximum length of P that is matched by a substring starting at k. The algorithm performs in linear time by keeping track of previously calculated values and recognizing, if the character at a currently examined start position is within a previously detected substring.

An example

Here is a string S and its Z-values for each index position k:

S = ‘aabcaabxaaaz’

k s[k] Z

0 a n/a

1 a 1

2 b 0

3 c 0

4 a 3

5 a 1

6 b 0

7 x 0

8 a 2

9 a 2

10 a 1

11 z 0

Step by step

Let’s look at a these one at a time to show how easily the Z-values can be computed.

k = 1

Then s[0] = s[1], but s[1] != [2], so Z[1] = 1. The matched substring has the boundaries l = 1 and r = 1, so s[l..r] = s[1..1] = ‘a’

k = 2

Then k > r, which means s[2] is outside the previously discovered substring. s[0] != s[2], so
Z[2] = 0.

k = 3

In the same manner, if k = 3, s[0] != s[3], so Z[3] = 0.

k = 4

Then s[0] = s[4], s[1] = s[5], s[2] = s[6], but s[3] != s[7], so Z[4] = 3. The matched substring has the boundaries l = 4 and r = 6, so s[l..r] = s[4..6] = ‘aab’

k = 5

Then k <= r: s[5] is within a previously discovered substring. The previously discovered substring starts at k - l = 5 - 4 = 1. The Z-value found at that position is Z[1] = 1. Since that value is smaller than the length of the remaining substring S[k..r], there is no need to perform other character comparisons. Z[5] = Z[1] = 1. l and r remain unchanged.

k = 6

In the same manner, if k = 6, then Z[6] = z[2] = 0.

k = 7

Then k > r, s[0] != s[7], so Z[7] = 0.

k = 8

Then k > r, s[0] = s[8], s[1] = s[9] and s[2] != s[10], so Z[8] = 2. The matched substring has the boundaries l = 8 and r = 9, so s[l..r] = s[8..9] = ‘aa’.

k = 9

Then k <= r, so s[k] is within a previously discovered substring. The previously discovered substring starts at k - l = 9 - 8 = 1. The Z-value found at that position is Z[1] = 1. That value is not smaller than the length of the remaining substring S[k..r] = s[9..9] = 'a', so additional character comparisons need to be performed. Let b = r - k + 1 = 9 - 9 + 1. Z[9] will be >= b. s[b] = s[1] = s[k + b] = s[10], but s[2] != s[11], so Z[9] = 2. The new substring has the boundaries l = 9 and r = 10.

k = 10

Then k = r, so s[k] is within the previously discovered substring of s[l..r]. The z-value for that string is Z[k - l] = Z[10-9] = Z[1] = 1. Since that value is equal to the length of the remainder of the substring s[k..r], additional comparisons are performed, but since s[2] != s[11], Z[10] = Z[1] = 1.

k = 11

Then k < r and since s[0] != s[11], Z[11] = 0.

Implementation

Here is the implementation of the algorithm (as outlined in the book) in Python:

def getZ(s):
        result = {}
 
        l = r = 0
        for k in range(1, len(s)):
                if k > r:
                        zk = 0
                        for si in range(0, len(s)):
                                if k + si < len(s) and \
                                        s[si] == s[k + si]:
                                        pass
                                else:
                                        break
                        if si > 0:
                                zk = si
                                r = zk + k - 1
                                l = k
                else:
                        kOld = k - l
                        zOld = result[kOld]
                        b = r - k + 1
                        if zOld < b:
                                zk = zOld
                        else:
                                zk = b
 
                                for si in range(b, len(s)):
                                        if k + si < len(s) \
                                                and s[si] \
                                                == s[k + si]:
                                                pass
                                        else:
                                                break
                                zk = si
                                r = zk + k - 1
                                l = k
                result[k] = zk
        return result

That code deserves some cleanup, but it does yield the correct result:

s = 'aabcaabxaaaz'
z = getZ(s)
for k in range(1, len(s)):
	print k, z[k]
1 1
2 0
3 0
4 3
5 1
6 0
7 0
8 2
9 2
10 1
11 0

Next?

The Z algorithm can be used as a precursor for additional string analysis. With little modification it can also be changed to work as a simple, linear exact matching algorithm.


Computing the day of week

Posted: June 27th, 2006 | Author: Alex | Filed under: Algorithms, python | No Comments »

Here is a nice gem of an algorithm by Kim S. Larsen to compute the day of the week, where m, d and y represent month, day and year, respectively.

def dow(m, d, y):
    if ((m == 1) or (m == 2)):
        m += 12
        y -= 1
    return (d + 2 * m + 3 * (m + 1) / 5 +
    y + y / 4 - y / 100 + y / 400) % 7

The returned integer is a value between 0 (Monday) and 6 (Sunday).

Combine it with a list of day strings:

days = ('Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday', 'Sunday')

Like in the following example to display the name of day, given a date:

days[dow(6, 28, 2006)]

I found the algorithm in Dr. Dobbs Journal issue 229 (April 1995) and the article features a very nice description of how the formula was derived. The original C implementation of the algorithm is about as readable as the above pieces of code.