Matching circular string rotations
This post examines two questions:
- Is one string a circular rotation of a second string?
- Is one string a substring of a (potentially rotated) second string?
Previously discussed Z-values are used for a solution.
Circular string rotation?
Given a prefix p of string t and a suffix s of string t, such that p + s = t, then a circular string rotation r is a string of the form s + p. Thus, it is a string that consists of the suffix of string t, directly followed by the prefix of string t, such that the resulting string r has the same length as the original string t: |r| = |t|.
Example:
Original:
t = “abcd”
Rotation:
r = “cdab”, with p = “ab” and s = “cd”
How can one string be shown to be a circular rotation of a second string?
If it is, then the lengths of the two strings t and r have to be identical. Also, since t = p + s and the rotation r = s + p, then 2r = r + r = s + p + s + p = s + t + p. Thus, if r is a rotation of t, then t is included completely within 2r. Not only that, but if it is shown that r is a rotation of t, then if t is found within 2r, more specific information can be shown about the rotation, because p and s can easily be indicated, too.
Assuming access to a fast string matching routine, the mechanism to answer the question should be straightforward:
If |t| = |r| and t in r+r Then Circular Rotation Else No Circular Rotation
Saving space
It may not be desirable to create a new string that contains twice the data of r, particularly assuming very large instances of strings r and t. Since r = s + p is available though, it is easy to imagine what 2r would look like, without actually creating a new string consisting of 2r.
The previously mentioned Z algorithm can be heavily modified to
- operate on two strings, and
- allow for Z-values up to maximum of the string length, regardless of the index position, effectively allowing for a substring to begin towards the end of the string and continue at the beginning of the string.
getMaxSuffixZ
Let getMaxSuffixZ be a process that takes as input two strings, t and r and returns a maximum Z-value <= |t|. This involves calculating Z-values for each position in r and returning the largest one. In this case the Z-values are not calculated based on a prefix of r. Rather, string t is considered the prefix. So, if r is a rotation of t and t = p + s, then r = s + p, so maxZ should return |p|. It then only makes sense for maxZ to pick that maximum Z-Value that represents a substring that stretches to the end of r.
Here is the modified implementation:
def getMaxSuffixZ(p, s): result = {} l = 0 r = -1 for k in range(0, len(s)): if k > r: zk = 0 for si in range(0, len(s)): if k + si < len(s) and \ si < len(p) and \ p[si] == s[k + si]: zk += 1 else: break if zk > 0: r = zk + k - 1 l = k else: kOld = k - l - 1 zOld = result[kOld] b = r - k + 1 if zOld < b: zk = zOld else: zk = b for si in range(b, len(s)): if k + si < len(s) \ and si < len(p) \ and p[si] \ == s[k + si]: pass else: break zk = si r = zk + k - 1 l = k result[k] = zk if zk == len(p) - k: return zk return 0
This returns the expected results at least for this sample set:
assert getMaxSuffixZ(s, "") == 0 assert getMaxSuffixZ(s, "a") == 0 assert getMaxSuffixZ(s, "b") == 0 assert getMaxSuffixZ(s, "defabc") == 3 assert getMaxSuffixZ(s, "defabcd") == 0 assert getMaxSuffixZ(s, "abcdef") == 6 assert getMaxSuffixZ(s, "fabcde") == 5 print 'all tests passed.' all tests passed.
Let z = getMaxSuffixZ(t, r). z is the prefix of t (and the suffix of r). If z > 0 then t[0..z-1] = r[z..|r|]. Similarly getMaxSuffixZ(r, t) yields the prefix of r (and the suffix of t).
s = "abcdef" assert getMaxSuffixZ("", s) == 0 assert getMaxSuffixZ("a", s) == 1 assert getMaxSuffixZ("b", s) == 0 assert getMaxSuffixZ("defabc", s) == 3 assert getMaxSuffixZ("defabcd", s) == 0 assert getMaxSuffixZ("abcdef", s) == 6 assert getMaxSuffixZ("fabcde", s) == 1 print 'all tests passed.' all tests passed.
If r is a rotation of t, then getMaxSuffixZ(t, r) + getMaxSuffixZ(r, t) = |t| = |r|. If t = r, then getMaxSuffixZ(t, r) = getMaxSuffixZ(r, t) = |t| = |r|. This leads to the following flow:
If |t| = |r|
Then
zTr = getMaxSuffixZ(t, r)
If zTr = |t|:
Then Circular Rotation (t = r)
Else
If zTr = getMaxSuffixZ(r, t)
Then Circular Rotation
Else No Circular Rotation
Else No Circular Rotation
getMaxZ
Let getMaxZ have the same attributes of getMaxSuffixZ, with one exception: When finding matching substrings that begin at position k and reach to the end of the string s, the algorithm continues comparisons at the beginning of string s up to a maximum of position k-1. It no longer makes sense to stop at the first substring that reaches to the end of the string, as later substrings could have a longer reach into the beginning of the string. It does make sense to return a result as soon as a z-value is found that equals the length of the prefix.
def getMaxZ(p, s): result = {} l = 0 r = -1 maxZk = 0 for k in range(0, len(s)): if k > r: zk = 0 for si in range(0, len(s)): if k + si < len(s) and \ si < len(p) and \ p[si] == s[k + si]: zk += 1 else: break if zk > 0: r = zk + k - 1 l = k if r == len(s) - 1: r2 = 0 for si in range(0, k): if zk < len(p) and \ p[zk] == s[si]: zk += 1 r2 += 1 else: break else: kOld = k - l - 1 zOld = result[kOld] b = r - k + 1 if zOld < b: zk = zOld else: zk = b for si in range(b, len(s)): if k + si < len(s) \ and si < len(p) \ and p[si] \ == s[k + si]: pass else: break zk = si r = zk + k - 1 l = k if r == len(s) - 1: r2 = 0 for si in range(0, k): if zk < len(p) and \ p[zk] == s[si]: zk += 1 r2 += 1 else: break result[k] = zk if zk >= len(s) - k: if zk > maxZk: maxZk = zk if zk == len(p): return zk return maxZk
Tests show the correct results:
s = "abcdef" assert getMaxZ(s, s) == 6 assert getMaxZ(s, "fabcde") == 6 assert getMaxZ(s, "fgabcde") == 6 assert getMaxZ(s, "a") == 1 assert getMaxZ(s, "ba") == 2 assert getMaxZ(s, "") == 0 assert getMaxZ(s, "xyz") == 0 assert getMaxZ(s, "defabcabcdabc") == 6 assert getMaxZ("", "") == 0 assert getMaxZ(s, "defabc") == 6 print 'all tests passed.' all tests passed.
Following the above manner, the resulting flow looks like this:
If |t| = |r| and maxZ(t, r) = |t| Then Circular Rotation Else No Circular Rotation
Matching substrings in string rotations
Given the above discussion, a mechanism can be shown that allows the matching of substrings in circular string rotations. To illustrate the problem, in a string t = “abcd” it would then be possible to find substrings such as “abc”, “cda”, “da”, “dab”, etc.
The above algorithm can be applied directly:
Let t = text and p = substring to search If getMaxZ(p, t) = |p| Then p found in t Else p not found in t
Likewise, getMaxZ can be applied to this problem directly:
def showFound(p, s): if getMaxZ(p, s) == len(p): print "%s found in %s" % (p, s) else: print "%s not found in %s" % (p, s) showFound("ab", s) showFound("fa", s) showFound("abcdef", s) showFound("x", s) showFound("abc", "") showFound("efabcd", s) showFound("efgabc", s) showFound("abd", s)
The results look correct.
ab found in abcdef fa found in abcdef abcdef found in abcdef x not found in abcdef abc not found in efabcd found in abcdef efgabc not found in abcdef abd not found in abcdef
It would make sense to add modifications to allow detecting all instances (up to a specifiable number) of substring matches along with their positions.
Next?
Run-time analysis. And lots of code cleanup!