The Z Algorithm

Posted: November 1st, 2008 | Author: Alex | Filed under: Algorithms, Z, python | 1 Comment »

In Algorithms on Strings, Trees and Sequences, Dan Gusfield presents the Z algorithm.

A string prefix P is a substring of S that begins at S[0]. The Z algorithm calculates for each position k > 0 in S the maximum length of P that is matched by a substring starting at k. The algorithm performs in linear time by keeping track of previously calculated values and recognizing, if the character at a currently examined start position is within a previously detected substring.

An example

Here is a string S and its Z-values for each index position k:

S = ‘aabcaabxaaaz’

k s[k] Z

0 a n/a

1 a 1

2 b 0

3 c 0

4 a 3

5 a 1

6 b 0

7 x 0

8 a 2

9 a 2

10 a 1

11 z 0

Step by step

Let’s look at a these one at a time to show how easily the Z-values can be computed.

k = 1

Then s[0] = s[1], but s[1] != [2], so Z[1] = 1. The matched substring has the boundaries l = 1 and r = 1, so s[l..r] = s[1..1] = ‘a’

k = 2

Then k > r, which means s[2] is outside the previously discovered substring. s[0] != s[2], so
Z[2] = 0.

k = 3

In the same manner, if k = 3, s[0] != s[3], so Z[3] = 0.

k = 4

Then s[0] = s[4], s[1] = s[5], s[2] = s[6], but s[3] != s[7], so Z[4] = 3. The matched substring has the boundaries l = 4 and r = 6, so s[l..r] = s[4..6] = ‘aab’

k = 5

Then k <= r: s[5] is within a previously discovered substring. The previously discovered substring starts at k - l = 5 - 4 = 1. The Z-value found at that position is Z[1] = 1. Since that value is smaller than the length of the remaining substring S[k..r], there is no need to perform other character comparisons. Z[5] = Z[1] = 1. l and r remain unchanged.

k = 6

In the same manner, if k = 6, then Z[6] = z[2] = 0.

k = 7

Then k > r, s[0] != s[7], so Z[7] = 0.

k = 8

Then k > r, s[0] = s[8], s[1] = s[9] and s[2] != s[10], so Z[8] = 2. The matched substring has the boundaries l = 8 and r = 9, so s[l..r] = s[8..9] = ‘aa’.

k = 9

Then k <= r, so s[k] is within a previously discovered substring. The previously discovered substring starts at k - l = 9 - 8 = 1. The Z-value found at that position is Z[1] = 1. That value is not smaller than the length of the remaining substring S[k..r] = s[9..9] = 'a', so additional character comparisons need to be performed. Let b = r - k + 1 = 9 - 9 + 1. Z[9] will be >= b. s[b] = s[1] = s[k + b] = s[10], but s[2] != s[11], so Z[9] = 2. The new substring has the boundaries l = 9 and r = 10.

k = 10

Then k = r, so s[k] is within the previously discovered substring of s[l..r]. The z-value for that string is Z[k - l] = Z[10-9] = Z[1] = 1. Since that value is equal to the length of the remainder of the substring s[k..r], additional comparisons are performed, but since s[2] != s[11], Z[10] = Z[1] = 1.

k = 11

Then k < r and since s[0] != s[11], Z[11] = 0.

Implementation

Here is the implementation of the algorithm (as outlined in the book) in Python:

def getZ(s):
        result = {}
 
        l = r = 0
        for k in range(1, len(s)):
                if k > r:
                        zk = 0
                        for si in range(0, len(s)):
                                if k + si < len(s) and \
                                        s[si] == s[k + si]:
                                        pass
                                else:
                                        break
                        if si > 0:
                                zk = si
                                r = zk + k - 1
                                l = k
                else:
                        kOld = k - l
                        zOld = result[kOld]
                        b = r - k + 1
                        if zOld < b:
                                zk = zOld
                        else:
                                zk = b
 
                                for si in range(b, len(s)):
                                        if k + si < len(s) \
                                                and s[si] \
                                                == s[k + si]:
                                                pass
                                        else:
                                                break
                                zk = si
                                r = zk + k - 1
                                l = k
                result[k] = zk
        return result

That code deserves some cleanup, but it does yield the correct result:

s = 'aabcaabxaaaz'
z = getZ(s)
for k in range(1, len(s)):
	print k, z[k]
1 1
2 0
3 0
4 3
5 1
6 0
7 0
8 2
9 2
10 1
11 0

Next?

The Z algorithm can be used as a precursor for additional string analysis. With little modification it can also be changed to work as a simple, linear exact matching algorithm.


One Comment on “The Z Algorithm”

  1. 1 Peter said at 5:59 am on August 7th, 2009:

    Thanks for this step-by-step explanation.
    I could find any book that would explain algorithm (Z and other) in so straightforward way.


Leave a Reply