The Z Algorithm
Posted: November 1st, 2008 | Author: Alex | Filed under: Algorithms, python, Z | 1 Comment »In Algorithms on Strings, Trees and Sequences, Dan Gusfield presents the Z algorithm.
A string prefix P is a substring of S that begins at S[0]. The Z algorithm calculates for each position k > 0 in S the maximum length of P that is matched by a substring starting at k. The algorithm performs in linear time by keeping track of previously calculated values and recognizing, if the character at a currently examined start position is within a previously detected substring.
An example
Here is a string S and its Z-values for each index position k:
S = ‘aabcaabxaaaz’
| k | s[k] | Z |
|---|---|---|
| 0 | a | n/a |
| 1 | a | 1 |
| 2 | b | 0 |
| 3 | c | 0 |
| 4 | a | 3 |
| 5 | a | 1 |
| 6 | b | 0 |
| 7 | x | 0 |
| 8 | a | 2 |
| 9 | a | 2 |
| 10 | a | 1 |
| 11 | z | 0 |
Step by step
Let’s look at a these one at a time to show how easily the Z-values can be computed.
k = 1
Then s[0] = s[1], but s[1] != [2], so Z[1] = 1. The matched substring has the boundaries l = 1 and r = 1, so s[l..r] = s[1..1] = ‘a’
k = 2
Then k > r, which means s[2] is outside the previously discovered substring. s[0] != s[2], so
Z[2] = 0.
k = 3
In the same manner, if k = 3, s[0] != s[3], so Z[3] = 0.
k = 4
Then s[0] = s[4], s[1] = s[5], s[2] = s[6], but s[3] != s[7], so Z[4] = 3. The matched substring has the boundaries l = 4 and r = 6, so s[l..r] = s[4..6] = ‘aab’
k = 5
Then k <= r: s[5] is within a previously discovered substring. The previously discovered substring starts at k - l = 5 - 4 = 1. The Z-value found at that position is Z[1] = 1. Since that value is smaller than the length of the remaining substring S[k..r], there is no need to perform other character comparisons. Z[5] = Z[1] = 1. l and r remain unchanged.
k = 6
In the same manner, if k = 6, then Z[6] = z[2] = 0.
k = 7
Then k > r, s[0] != s[7], so Z[7] = 0.
k = 8
Then k > r, s[0] = s[8], s[1] = s[9] and s[2] != s[10], so Z[8] = 2. The matched substring has the boundaries l = 8 and r = 9, so s[l..r] = s[8..9] = ‘aa’.
k = 9
Then k <= r, so s[k] is within a previously discovered substring. The previously discovered substring starts at k - l = 9 - 8 = 1. The Z-value found at that position is Z[1] = 1. That value is not smaller than the length of the remaining substring S[k..r] = s[9..9] = 'a', so additional character comparisons need to be performed. Let b = r - k + 1 = 9 - 9 + 1. Z[9] will be >= b. s[b] = s[1] = s[k + b] = s[10], but s[2] != s[11], so Z[9] = 2. The new substring has the boundaries l = 9 and r = 10.
k = 10
Then k = r, so s[k] is within the previously discovered substring of s[l..r]. The z-value for that string is Z[k - l] = Z[10-9] = Z[1] = 1. Since that value is equal to the length of the remainder of the substring s[k..r], additional comparisons are performed, but since s[2] != s[11], Z[10] = Z[1] = 1.
k = 11
Then k < r and since s[0] != s[11], Z[11] = 0.
Implementation
Here is the implementation of the algorithm (as outlined in the book) in Python:
def getZ(s): result = {} l = r = 0 for k in range(1, len(s)): if k > r: zk = 0 for si in range(0, len(s)): if k + si < len(s) and \ s[si] == s[k + si]: pass else: break if si > 0: zk = si r = zk + k - 1 l = k else: kOld = k - l zOld = result[kOld] b = r - k + 1 if zOld < b: zk = zOld else: zk = b for si in range(b, len(s)): if k + si < len(s) \ and s[si] \ == s[k + si]: pass else: break zk = si r = zk + k - 1 l = k result[k] = zk return result
That code deserves some cleanup, but it does yield the correct result:
s = 'aabcaabxaaaz' z = getZ(s) for k in range(1, len(s)): print k, z[k] 1 1 2 0 3 0 4 3 5 1 6 0 7 0 8 2 9 2 10 1 11 0
Next?
The Z algorithm can be used as a precursor for additional string analysis. With little modification it can also be changed to work as a simple, linear exact matching algorithm.
Thanks for this step-by-step explanation.
I could find any book that would explain algorithm (Z and other) in so straightforward way.