not just random

November 12, 2008

Matching circular string rotations

Filed under: Algorithms, python — Alex @ 8:11 pm

This post examines two questions:

  1. Is one string a circular rotation of a second string?
  2. Is one string a substring of a (potentially rotated) second string?

Previously discussed Z-values are used for a solution.

Circular string rotation?

Given a prefix p of string t and a suffix s of string t, such that p + s = t, then a circular string rotation r is a string of the form s + p. Thus, it is a string that consists of the suffix of string t, directly followed by the prefix of string t, such that the resulting string r has the same length as the original string t: |r| = |t|.

Example:

Original:
t = “abcd”

Rotation:
r = “cdab”, with p = “ab” and s = “cd”

How can one string be shown to be a circular rotation of a second string?

If it is, then the lengths of the two strings t and r have to be identical. Also, since t = p + s and the rotation r = s + p, then 2r = r + r = s + p + s + p = s + t + p. Thus, if r is a rotation of t, then t is included completely within 2r. Not only that, but if it is shown that r is a rotation of t, then if t is found within 2r, more specific information can be shown about the rotation, because p and s can easily be indicated, too.

Assuming access to a fast string matching routine, the mechanism to answer the question should be straightforward:

If |t| = |r| and t in r+r
Then Circular Rotation
Else No Circular Rotation

Saving space

It may not be desirable to create a new string that contains twice the data of r, particularly assuming very large instances of strings r and t. Since r = s + p is available though, it is easy to imagine what 2r would look like, without actually creating a new string consisting of 2r.

The previously mentioned Z algorithm can be heavily modified to

  1. operate on two strings, and
  2. allow for Z-values up to maximum of the string length, regardless of the index position, effectively allowing for a substring to begin towards the end of the string and continue at the beginning of the string.

getMaxSuffixZ

Let getMaxSuffixZ be a process that takes as input two strings, t and r and returns a maximum Z-value <= |t|. This involves calculating Z-values for each position in r and returning the largest one. In this case the Z-values are not calculated based on a prefix of r. Rather, string t is considered the prefix. So, if r is a rotation of t and t = p + s, then r = s + p, so maxZ should return |p|. It then only makes sense for maxZ to pick that maximum Z-Value that represents a substring that stretches to the end of r.

Here is the modified implementation:

def getMaxSuffixZ(p, s):
	result = {}
	l = 0
	r =  -1
	for k in range(0, len(s)):
		if k > r:
			zk = 0
			for si in range(0, len(s)):
				if k + si < len(s) and \
					si < len(p) and \
					p[si] == s[k + si]:
					zk += 1
				else:
					break
			if zk > 0:
				r = zk + k - 1
				l = k
		else:
			kOld = k - l - 1
			zOld = result[kOld]
			b = r - k + 1
			if zOld < b:
				zk = zOld
			else:
				zk = b
				for si in range(b, len(s)):
					if k + si < len(s) \
						and si < len(p) \
						and p[si] \
						== s[k + si]:
						pass
					else:
						break
				zk = si
				r = zk + k - 1
				l = k
		result[k] = zk
		if zk == len(p) - k:
			return zk
	return 0

This returns the expected results at least for this sample set:

assert getMaxSuffixZ(s, "") == 0
assert getMaxSuffixZ(s, "a") == 0
assert getMaxSuffixZ(s, "b") == 0
assert getMaxSuffixZ(s, "defabc") == 3
assert getMaxSuffixZ(s, "defabcd") == 0
assert getMaxSuffixZ(s, "abcdef") == 6
assert getMaxSuffixZ(s, "fabcde") == 5
print 'all tests passed.'
 
all tests passed.

Let z = getMaxSuffixZ(t, r). z is the prefix of t (and the suffix of r). If z > 0 then t[0..z-1] = r[z..|r|]. Similarly getMaxSuffixZ(r, t) yields the prefix of r (and the suffix of t).

s = "abcdef"
assert getMaxSuffixZ("", s) == 0
assert getMaxSuffixZ("a", s) == 1
assert getMaxSuffixZ("b", s) == 0
assert getMaxSuffixZ("defabc", s) == 3
assert getMaxSuffixZ("defabcd", s) == 0
assert getMaxSuffixZ("abcdef", s) == 6
assert getMaxSuffixZ("fabcde", s) == 1
print 'all tests passed.'
 
all tests passed.

If r is a rotation of t, then getMaxSuffixZ(t, r) + getMaxSuffixZ(r, t) = |t| = |r|. If t = r, then getMaxSuffixZ(t, r) = getMaxSuffixZ(r, t) = |t| = |r|. This leads to the following flow:

If |t| = |r|
Then
    zTr = getMaxSuffixZ(t, r)
    If zTr = |t|:
    Then Circular Rotation (t = r)
    Else
        If zTr = getMaxSuffixZ(r, t)
        Then Circular Rotation
        Else No Circular Rotation
Else No Circular Rotation

getMaxZ

Let getMaxZ have the same attributes of getMaxSuffixZ, with one exception: When finding matching substrings that begin at position k and reach to the end of the string s, the algorithm continues comparisons at the beginning of string s up to a maximum of position k-1. It no longer makes sense to stop at the first substring that reaches to the end of the string, as later substrings could have a longer reach into the beginning of the string. It does make sense to return a result as soon as a z-value is found that equals the length of the prefix.

def getMaxZ(p, s):
	result = {}
	l = 0
	r =  -1
	maxZk = 0
	for k in range(0, len(s)):
		if k > r:
			zk = 0
			for si in range(0, len(s)):
				if k + si < len(s) and \
					si < len(p) and \
					p[si] == s[k + si]:
					zk += 1
				else:
					break
			if zk > 0:
				r = zk + k - 1
				l = k
				if r == len(s) - 1:
					r2 = 0
					for si in range(0, k):
						if zk < len(p) and \
						p[zk] == s[si]:
							zk += 1
							r2 += 1
						else:
							break
		else:
			kOld = k - l - 1
			zOld = result[kOld]
			b = r - k + 1
			if zOld < b:
				zk = zOld
			else:
				zk = b
				for si in range(b, len(s)):
					if k + si < len(s) \
						and si < len(p) \
						and p[si] \
						== s[k + si]:
						pass
					else:
						break
				zk = si
				r = zk + k - 1
				l = k
				if r == len(s) - 1:
					r2 = 0
					for si in range(0, k):
						if zk < len(p) and \
						p[zk] == s[si]:
							zk += 1
							r2 += 1
						else:
							break
		result[k] = zk
		if zk >= len(s) - k:
			if zk > maxZk:
				maxZk = zk
		if zk == len(p):
			return zk
	return maxZk

Tests show the correct results:

s = "abcdef"
assert getMaxZ(s, s) == 6
assert getMaxZ(s, "fabcde") == 6
assert getMaxZ(s, "fgabcde") == 6
assert getMaxZ(s, "a") == 1
assert getMaxZ(s, "ba") == 2
assert getMaxZ(s, "") == 0
assert getMaxZ(s, "xyz") == 0
assert getMaxZ(s, "defabcabcdabc") == 6
assert getMaxZ("", "") == 0
assert getMaxZ(s, "defabc") == 6
print 'all tests passed.'
 
all tests passed.

Following the above manner, the resulting flow looks like this:

If |t| = |r| and maxZ(t, r) = |t|
Then Circular Rotation
Else No Circular Rotation

Matching substrings in string rotations

Given the above discussion, a mechanism can be shown that allows the matching of substrings in circular string rotations. To illustrate the problem, in a string t = “abcd” it would then be possible to find substrings such as “abc”, “cda”, “da”, “dab”, etc.

The above algorithm can be applied directly:

Let t = text and p = substring to search
If getMaxZ(p, t) = |p|
Then p found in t
Else p not found in t

Likewise, getMaxZ can be applied to this problem directly:

def showFound(p, s):
	if getMaxZ(p, s) == len(p):
		print "%s found in %s" % (p, s)
	else:
		print "%s not found in %s" % (p, s)
 
showFound("ab", s)
showFound("fa", s)
showFound("abcdef", s)
showFound("x", s)
showFound("abc", "")
showFound("efabcd", s)
showFound("efgabc", s)
showFound("abd", s)

The results look correct.

ab found in abcdef
fa found in abcdef
abcdef found in abcdef
x not found in abcdef
abc not found in
efabcd found in abcdef
efgabc not found in abcdef
abd not found in abcdef

It would make sense to add modifications to allow detecting all instances (up to a specifiable number) of substring matches along with their positions.

Next?

Run-time analysis. And lots of code cleanup!

November 3, 2008

AI Landscape

Filed under: Uncategorized — Alex @ 8:50 pm

AI Magazine recently published the following poster.

AI Landscape

It is a nice visualization of some of the many areas of life that are impacted by current AI.

November 1, 2008

The Z Algorithm

Filed under: Algorithms, Z, python — Alex @ 3:26 pm

In Algorithms on Strings, Trees and Sequences, Dan Gusfield presents the Z algorithm.

A string prefix P is a substring of S that begins at S[0]. The Z algorithm calculates for each position k > 0 in S the maximum length of P that is matched by a substring starting at k. The algorithm performs in linear time by keeping track of previously calculated values and recognizing, if the character at a currently examined start position is within a previously detected substring.

An example

Here is a string S and its Z-values for each index position k:

S = ‘aabcaabxaaaz’

k s[k] Z

0 a n/a

1 a 1

2 b 0

3 c 0

4 a 3

5 a 1

6 b 0

7 x 0

8 a 2

9 a 2

10 a 1

11 z 0

Step by step

Let’s look at a these one at a time to show how easily the Z-values can be computed.

k = 1

Then s[0] = s[1], but s[1] != [2], so Z[1] = 1. The matched substring has the boundaries l = 1 and r = 1, so s[l..r] = s[1..1] = ‘a’

k = 2

Then k > r, which means s[2] is outside the previously discovered substring. s[0] != s[2], so
Z[2] = 0.

k = 3

In the same manner, if k = 3, s[0] != s[3], so Z[3] = 0.

k = 4

Then s[0] = s[4], s[1] = s[5], s[2] = s[6], but s[3] != s[7], so Z[4] = 3. The matched substring has the boundaries l = 4 and r = 6, so s[l..r] = s[4..6] = ‘aab’

k = 5

Then k <= r: s[5] is within a previously discovered substring. The previously discovered substring starts at k - l = 5 - 4 = 1. The Z-value found at that position is Z[1] = 1. Since that value is smaller than the length of the remaining substring S[k..r], there is no need to perform other character comparisons. Z[5] = Z[1] = 1. l and r remain unchanged.

k = 6

In the same manner, if k = 6, then Z[6] = z[2] = 0.

k = 7

Then k > r, s[0] != s[7], so Z[7] = 0.

k = 8

Then k > r, s[0] = s[8], s[1] = s[9] and s[2] != s[10], so Z[8] = 2. The matched substring has the boundaries l = 8 and r = 9, so s[l..r] = s[8..9] = ‘aa’.

k = 9

Then k <= r, so s[k] is within a previously discovered substring. The previously discovered substring starts at k - l = 9 - 8 = 1. The Z-value found at that position is Z[1] = 1. That value is not smaller than the length of the remaining substring S[k..r] = s[9..9] = 'a', so additional character comparisons need to be performed. Let b = r - k + 1 = 9 - 9 + 1. Z[9] will be >= b. s[b] = s[1] = s[k + b] = s[10], but s[2] != s[11], so Z[9] = 2. The new substring has the boundaries l = 9 and r = 10.

k = 10

Then k = r, so s[k] is within the previously discovered substring of s[l..r]. The z-value for that string is Z[k - l] = Z[10-9] = Z[1] = 1. Since that value is equal to the length of the remainder of the substring s[k..r], additional comparisons are performed, but since s[2] != s[11], Z[10] = Z[1] = 1.

k = 11

Then k < r and since s[0] != s[11], Z[11] = 0.

Implementation

Here is the implementation of the algorithm (as outlined in the book) in Python:

def getZ(s):
	result = {}
 
	l = r = 0
	for k in range(1, len(s)):
		if k > r:
			zk = 0
			for si in range(0, len(s)):
				if k + si < len(s) and \
					s[si] == s[k + si]:
					pass
				else:
					break
			if si > 0:
				zk = si
				r = zk + k - 1
				l = k
		else:
			kOld = k - l
			zOld = result[kOld]
			b = r - k + 1
			if zOld < b:
				zk = zOld
			else:
				zk = b
 
				for si in range(b, len(s)):
					if k + si < len(s) \
						and s[si] \
						== s[k + si]:
						pass
					else:
						break
				zk = si
				r = zk + k - 1
				l = k
		result[k] = zk
	return result

That code deserves some cleanup, but it does yield the correct result:

s = 'aabcaabxaaaz'
z = getZ(s)
for k in range(1, len(s)):
	print k, z[k]
1 1
2 0
3 0
4 3
5 1
6 0
7 0
8 2
9 2
10 1
11 0

Next?

The Z algorithm can be used as a precursor for additional string analysis. With little modification it can also be changed to work as a simple, linear exact matching algorithm.

October 20, 2008

Holiday Inn Express Marketing

Filed under: Uncategorized — Alex @ 11:18 am

The freestyle battle rap version.

« Previous PageNext Page »