ICS 311 #23: String Matching


Even if you aren't going to need to write string matching algorithms, these algorithms are part of the core knowledge of computer science, and it is worth being familiar with this topic for ideas that can be applied elsewhere, such as table-driven algorithms (the finite-state automata being an example), and how algorithms can be written to write other algorithms.


Outline

  1. String Matching Introduction
  2. Naive (Brute Force) String Matching
  3. Matching with Finite State Automata
  4. Knuth-Morris-Pratt Algorithm
  5. FYI: Rabin-Karp Algorithm

Readings and Screencasts


String Matching Introduction

The string matching problem is the problem of finding all occurrences of a pattern string P in a target text T, also a string.

We treat both T and P as arrays T[1 .. n] and P[1 .. m], where mn. The elements of the arrays are members of a finite alphabet Σ.

More precisely, we say that P occurs with shift s in T (P occurs beginning at position s + 1 in T) if

0   ≤   s   ≤   nm     (one cannot fall off the end of T), and
T[s + 1 .. s + m]   =   P[1 .. m].

The string matching problem is the problem of finding all valid shifts.

A simple example:

Are there any other valid shifts in this example?

Applications

String matching is obviously important for implementing search in text editing programs, and for document search, whether searching within a single document or searching multiple documents for a topic of interest, such as in internet search engines. Other specialized applications include processing DNA sequences in bioinformatics.

Brief History

In 1970 Cook proved that an O(M+N) machine is possible. His theorem was not intended to be practical, but Knuth and Pratt followed the structure of the theorem to design an algorithm. Around the same time, Morris derived the algorithm independently as part of his design of a text editor. Eventually they found out about each other's work and published it in 1976. Also in 1976, Boyer and Moore found another efficient algorithm not covered here: see the Sedgewick book. In 1980 Rabin & Karp came up with a modification to brute force that searches on chunks of text of size M using a hash function. Researchers continue to come up with new algorithms for string pattern matching.

Algorithms

There are a surprising number of variations on string searching algorithms. See http://www-igm.univ-mlv.fr/~lecroq/string/index.html for summary descriptions, code and animations in Java. Find out what a "Backward Nondeterministic Dawg Matching" algorithm is!

Most algorithms require preprocessing of the pattern before entering the matching phase. Analysis must consider both costs, and applications must determine whether the preprocessing is worth the speedup in matching.

The algorithms we cover are summarized by this table (m is the size of the pattern and n is the usually much larger size of the text to be searched).

  Algorithm     Preprocessing Time     Matching Time  
  Naive   0   O((nm+1)m)
  Rabin-Karp   Θ(m)   O((nm+1)m)
  expected O(n+m)  
  Finite State Automaton     O(m3 |Σ|)   Θ(n)
  Knuth-Morris-Pratt     Θ(m)   Θ(n)

(The entry for FSA preprocessing differs from CLRS as this table gives the time for the algorithm we show below, not the algorithm in the section on Knuth-Morris-Pratt.)

The following notes will cover how these algorithms work, but not any proofs of correctness, which you may find in the CLRS text.

Notation and Terminology

Prefix: pt (p is a prefix of t) if t = pw for some w ∈ Σ*

Suffix: st (s is a suffix of t) if t = ws for some w ∈ Σ*

The empty string is denoted ε.

The k-character prefix T[1 .. k] of any text or pattern T is denoted Tk.

Comment on String Matching Time: The test of whether "x == y" takes Θ(t + 1) time, where t is the length of the longest string z such that zx and zy. (It's based on the length of z because the loop for character-by-character equality test can exit as soon as the match fails. The "1" is included to cover the case where t = 0, since a positive amount of time must be expended to determine this fact.) This comparison loop will be implicit in some of our pseudocode.


Naive (Brute Force) String Matching

It is often instructive to start with a brute force algorithm, that we can then examine for possible improvements and also use as a baseline for comparison.

The obvious approach is to start at the first character of both strings, P[1] and T[1], and then step through T and P together, checking to see whether the characters match at P[i] and T[i].

If the match fails on some character, start this process over at P[1] and T[2], and so on, but quitting after start point T[n−m].

Once P has been matched at any given shift T[s], then go on to checking at T[s+1] (since we are looking for all matches), up until s = |T| − |P| (which is nm).

There is an implied loop in line 4, as discussed under "comment on string matching time" above.

Example

Suppose P = aab and T = acaabc. There are four passes:

You an see C code and a Java applet animation at http://www-igm.univ-mlv.fr/~lecroq/string/node3.html

Analysis

No preprocessing is required.

For each of nm + 1 start positions (line 3), potentially m pattern characters are compared to the target text in the matching phase (line 4). Thus, the naive algorithm is O((nm + 1)m) in the worst case.

Inefficiencies

The brute force method does not use information about what has been matched, and does not use information about recurrences within the pattern itself. Consideration of these factors leads to improvements.

For example, when we matched P = aab at s = 2, we found that T[5] = b:

Then it is not possible for a shift of s = 3 (or s = 4 if T were longer) to be valid, beause these shifts juxtapose P[2] = a (and P[1] = a if applicable) against T[5] = b:

This information is used in the finite state automata and Knuth-Morris-Pratt approaches.


Matching with Finite State Automata

Finite state automata are machines that have a finite number of states, and move between states as they read input one symbol at a time.

Algorithms that construct (or simulate) finite state automata can be very efficient matchers, but they require some preprocessing to construct.

Finite State Automata

Finite state automata are widely used in computability theory as well as in practical algorithms. It's worth knowing about them even if you are not doing string matching.

A finite state automaton (FSA) or finite automaton M is a 5-tuple (Q, q0, A, Σ, δ) where

The FSA starts in state q0. As each character (symbol) of the input string is read, it uses δ to determine what state to transition into. Whenever M is in a state of A, the input read so far is accepted.

Here is a simple example:   q0 = 0 and A = {1}.   (What are Q, Σ, and δ?)

What strings does this FSA accept?

We define the final state function φ : Σ*Q such that φ(w) is the final state M ends up in after reading w:

φ(ε)   =   q0
φ(wa)   =   δ(φ(w), a)   for w ∈ Σ*, a ∈ Σ

String Matching Automata

Let's see how they work by an example. This is the string matching automaton for P = ababaca:

The start state is 0 and the only accepting state is 7.

This automaton can be represented with the table to the right. The shading shows the sequence of states through which a successful match transitions. These transitions correspond to the thicker arrows in the diagram.

We can run this automaton continuously on a text T, and if and when state 7 is entered output the relevant positions: if we enter state 7 when at T[i] the match will start with shift im or at position im + 1.

The following code simulates any FSA on input text T, given the FSA's table δ and pattern length m:

Where is P? Which parameter provides information about the pattern? Which parameter tells us what the accepting state q0 will be?

For example, below is a run on T = abababacaba, which includes P = ababaca starting at position 3: ab ababaca ba.

State 7 is reached at i = 9, so the pattern occurs starting at im + 1, or 9 − 7 + 1 = 3. The FSA keeps going after a match, as it may find other occurrences of the pattern.

Unlike the brute force approach, past work is not thrown away: the transitions following either failure to match a character or success to match the entire pattern pick up with the input read so far.

For example,

This makes FSAs much more efficient than brute force in the matching phase. In fact, matching is Θ(n). But how do we build them?

Constructing Finite State Automata (Preprocessing Phase)

In general, the FSA is constructed so that the state number tells us how much of a prefix of P has been matched.

Another definition is needed to formalize this.

Definitions and Strategy

The suffix function corresponding to P of length m is σP : Σ* → {0, 1, ... m} such that σP (w) is the length of the longest prefix of P that is also a suffix of w:

σP(w) = max {k : Pkw}.

For example, if P = ab then

(For simplicity, we leave out the subscript P when it is clear from the context.) Then we can define the automaton for pattern P[1 .. m] as:

Now you see why we wanted the state numbers to be the length of the prefix matched. By defining δ(q, a) = σ(Pq a), the state of the FSA keeps track of the longest prefix of the pattern P that has matched the input text T so far.

In order for a substring of T ending at T[i] to match some prefix Pj, then this prefix Pj must be a suffix of T[i].

We design δ such that the state q = φ(T[i]) gives the length of the longest prefix of P that matches a suffix of T. We have:

q = φ(Ti) = σ(Ti), and PqTi (Pq is a suffix of Ti).

Given a match so far to Pq (which may be ε) and reading character a, there are two kinds of transitions:

An example of this second case was already noted above for δ(5, b) = 4.

Preprocessing Procedure

The following procedure computes the transition function δ from a pattern P[1 .. m], by matching P against itself. Try to explain how the algorithm works before I explain it below!

The nested loops process all combinations of states q and characters a needed for the cells of the table representing δ.

Lines 4-8 set δ(q, a) to the largest k such that PkPqa.

Analysis

Compute-Transition-Function requires m*|Σ| for the nested outer loops.

Within these loops, the inner repeat runs at most m + 1 times; and the test on line 7 can require comparing up to m characters. Together lines 5-7 contribute O(m2).

Therefore, Compute-Transition-Function is O(m3 |Σ|). This is rather expensive preprocessing, but the Θ(n) matching is the best that can be expected.

You an see C code and a Java applet animation at http://www-igm.univ-mlv.fr/~lecroq/string/nod43.html


Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm improves on the FSA algorithm by avoiding computation of δ.

Instead it precomputes an auxiliary prefix function πP, represented as an array πP[1 .. m], that can be computed from a pattern P of length m in Θ(m) time. (We'll leave off the P subscript from now on.)

π[q] is the length of the longest prefix of P that is a proper suffix of Pq. This information enables fast amortized computation of δ(q, a) on the fly.

The KMP matcher operates like the FSA matcher, but using π instead of δ. There is some additional overhead at matching time, but asymptotic performance of matching remains Θ(n).

How π Works

Given P[1 .. m], the prefix function π for P is π : {1, 2 ..., m} → {0, 1, ..., m-1} such that

π[q] = max{k : k < q and PkPq}

For example, for P = ababaca, π is as follows (try using the formula above to explain a few of the entries below, particularly for i=5.):

The table is structured such that a failure at a given position will drive the system to hop to the next smaller prefix that could be matched. (Here I am skipping over the textbook's discussion of π* and Lemma 32.5, and using examples instead).

For example, for P = ababaca, if q = 5 and we are reading the 6th character (to the right of the line), consider what happens when that character is c, b or a:

From P[6] we see that a c is expected next.

Thus, π helps us retain information from prior comparisions, shifting back to the state for the maximum prefix matched so far, rather than starting over.

The KMP Matcher

The KMP matching code is below. We can see the "skipping" discussed above happening in lines 6-7, with successful matching of the next character handled in lines 8-9. After a successful match, we jump to the appropriate state to continue looking for the next item in line 12.

Computing π

The code for computing π is very similar to that of KMP-Matcher, because the constructor is matching P against itself (take some time to compare them!):

You an see C code and a Java applet animation at http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

Analysis and Correctness

The text presents an analysis showing that Compute-Prefix-Function is Θ(m) -- a considerable improvement over the FSA's O(m3 |Σ|) -- and KMP-Matcher is Θ(n).

These are good results, but in practice the FSA approach is still used when a pattern will be used many times repeatedly, because Compute-Transition-Function need only be run once on relatively small patterns of size m, and the greater simplicity of the Finite-Automaton-Matcher code.

The proof of correctness is accomplished by showing how KMP-Matcher simulates the FSA matcher's operation. It's worth reading the informal discussion pages 1009-1010, even if you don't wade into the formal proof.


Rabin-Karp Algorithm

There is an entirely different approach that functions in some respects like the brute force approach, but instead of testing character by character it tests on whole substrings of length m by using a hash function. If the hash function matches, it then uses brute force checking to verify the match.

Comparing Strings as Radix-d Numbers

Given |Σ| = d, the Rabin-Karp algorithm treats each string in Σ* as if it were a number in radix d notation.

This idea can be extended to large but finite |Σ|.

Thus, we can treat P[1 .. m] and substrings T[s+1 .. s+m] (0 ≤ snm) as m digit numbers in radix-|Σ|.

Our running example will assume Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and d = 10.

If we can compare these numbers directly to each other, we can quickly determine whether P[1 .. m] matches T[s+1 .. s+m].

Two modifications need to be made to make this method suitable for a string matching algorithm: shifting and hashing.

Shifting

If a match to T[s+1 .. s+m] fails, we need to compute the next number for T[s+2 .. s+m+1].

Rather than recomputing the entire sum, this can be done efficiently by a mathematical "shift":

An example is shown below, shifting a 5-digit number. (The mod will be explained below.)

Hashing

These numbers can get very large. Even on a machine that allows arbitrarily large numbers, this is still a problem: we can no longer assume that comparing two numbers takes constant time.

The solution is to compute the numbers modulo the largest prime number q such that dq still fits in one computer word (which can be manipulated with one machine level instruction).

Now you see why it is called "hashing:" it is like the division method of hashing.

For example, suppose we are matching to a 5 digit pattern P = 31415 in radix d = 10, we choose q = 13, and we are comparing to string 14142. The hash code for the pattern is 7, and for the target is 8, as shown in the previous example. The hashes do not match, so the two patterns cannot be the same.

However, the converse is not necessarily true. Hashing introduces a secondary problem: collisions. Two different numbers may hash to the same value, as shown in the example below.

To solve this problem, when the hashes of P[1 .. m] and T[s+1 .. s+m] are equal, the Rabin-Karp algorithm verifies the match with a brute force comparison.

This is still saving on comparisons over the brute force method, as we do not compare characters for failed hash matches: the strings cannot be the same.

The Rabin-Karp Algorithm

Here is the pseudocode:

(The subscripts on t are just for exposition and can be ignored.)

You an see C code and a Java applet animation at http://www-igm.univ-mlv.fr/~lecroq/string/node5.html

Anaysis

Preprocessing is Θ(m): the loop 6-8 executes m times and with the modular arithmetic keeping the numbers within the size of one computer word the numerical work in the loop is O(1).

The worst case matching time is still Θ((nm + 1)m), like brute force, because it is possible that every hash code matches and therefore every brute force comparison has to be done.

But this is highly unlikely in any realistic application (where we are searching for somewhat rare patterns in large texts). For constant number of valid shifts the text offers an argument that the expected time is O(n + m), or O(n) since mn.

Extensions

A major advantage of Rabin-Karp is that it has natural extensions to other pattern matching problems, such as two-dimensional pattern matching (finding an m x m pattern in an array of n x n characters), or searching for multiple patterns at once. Therefore it remains an important pattern matching algorithm.


Dan Suthers
Last modified: Sat Dec 2 05:09:59 HST 2017
Images are from the instructor's material for Cormen et al. Introduction to Algorithms, Third Edition.