# Aho-Corasick algorithm¶

The Aho-Corasick algorithm allows us to quickly search for multiple patterns in a text. The set of pattern strings is also called a dictionary. We will denote the total length of its constituent strings by $m$ and the size of the alphabet by $k$. The algorithm constructs a finite state automaton based on a trie in $O(m k)$ time and then uses it to process the text.

The algorithm was proposed by Alfred Aho and Margaret Corasick in 1975.

## Construction of the trie¶

A trie based on words "Java", "Rad", "Rand", "Rau", "Raum" and "Rose".

Formally, a trie is a rooted tree, where each edge of the tree is labeled with some letter and outgoing edges of a vertex have distinct labels.

We will identify each vertex in the trie with the string formed by the labels on the path from the root to that vertex.

Each vertex will also have a flag $\text{output}$ which will be set if the vertex corresponds to a pattern in the dictionary.

Accordingly, a trie for a set of strings is a trie such that each $\text{output}$ vertex corresponds to one string from the set, and conversely, each string of the set corresponds to one $\text{output}$ vertex.

We now describe how to construct a trie for a given set of strings in linear time with respect to their total length.

We introduce a structure for the vertices of the tree:

const int K = 26;

struct Vertex {
int next[K];
bool output = false;

Vertex() {
fill(begin(next), end(next), -1);
}
};

vector<Vertex> trie(1);


Here, we store the trie as an array of $\text{Vertex}$. Each $\text{Vertex}$ contains the flag $\text{output}$ and the edges in the form of an array $\text{next}[]$, where $\text{next}[i]$ is the index of the vertex that we reach by following the character $i$, or $-1$ if there is no such edge. Initially, the trie consists of only one vertex - the root - with the index $0$.

Now we implement a function that will add a string $s$ to the trie. The implementation is simple: we start at the root node, and as long as there are edges corresponding to the characters of $s$ we follow them. If there is no edge for one character, we generate a new vertex and connect it with an edge. At the end of the process we mark the last vertex with the flag $\text{output}$.

void add_string(string const& s) {
int v = 0;
for (char ch : s) {
int c = ch - 'a';
if (trie[v].next[c] == -1) {
trie[v].next[c] = trie.size();
trie.emplace_back();
}
v = trie[v].next[c];
}
trie[v].output = true;
}


This implementation obviously runs in linear time, and since every vertex stores $k$ links, it will use $O(m k)$ memory.

It is possible to decrease the memory consumption to $O(m)$ by using a map instead of an array in each vertex. However, this will increase the time complexity to $O(m \log k)$.

## Construction of an automaton¶

Suppose we have built a trie for the given set of strings. Now let's look at it from a different side. If we look at any vertex, the string that corresponds to it is a prefix of one or more strings in the set, thus each vertex of the trie can be interpreted as a position in one or more strings from the set.

In fact, the trie vertices can be interpreted as states in a finite deterministic automaton. From any state we can transition - using some input letter - to other states, i.e., to another position in the set of strings. For example, if there is only one string $abc$ in the dictionary, and we are standing at vertex $ab$, then using the letter $c$ we can go to the vertex $abc$.

Thus we can understand the edges of the trie as transitions in an automaton according to the corresponding letter. However, in an automaton we need to have transitions for each combination of a state and a letter. If we try to perform a transition using a letter, and there is no corresponding edge in the trie, then we nevertheless must go into some state.

More precisely, suppose we are in a state corresponding to a string $t$, and we want to transition to a different state using the character $c$. If there is an edge labeled with this letter $c$, then we can simply go over this edge, and get the vertex corresponding to $t + c$. If there is no such edge, since we want to maintain the invariant that the current state is the longest partial match in the processed string, we must find the longest string in the trie that's a proper suffix of the string $t$, and try to perform a transition from there.

For example, let the trie be constructed by the strings $ab$ and $bc$, and we are currently at the vertex corresponding to $ab$, which is also an $\text{output}$ vertex. To transition with the letter $c$, we are forced to go to the state corresponding to the string $b$, and from there follow the edge with the letter $c$.

An Aho-Corasick automaton based on words "a", "ab", "bc", "bca", "c" and "caa".

A suffix link for a vertex $p$ is an edge that points to the longest proper suffix of the string corresponding to the vertex $p$. The only special case is the root of the trie, whose suffix link will point to itself. Now we can reformulate the statement about the transitions in the automaton like this: while there is no transition from the current vertex of the trie using the current letter (or until we reach the root), we follow the suffix link.

Thus we reduced the problem of constructing an automaton to the problem of finding suffix links for all vertices of the trie. However, we will build these suffix links, oddly enough, using the transitions constructed in the automaton.

The suffix links of the root vertex and all its immediate children point to the root vertex. For any vertex $v$ deeper in the tree, we can calculate the suffix link as follows: if $p$ is the ancestor of $v$ with $c$ being the letter labeling the edge from $p$ to $v$, go to $p$, then follow its suffix link, and perform the transition with the letter $c$ from there.

Thus, the problem of finding the transitions has been reduced to the problem of finding suffix links, and the problem of finding suffix links has been reduced to the problem of finding a suffix link and a transition, except for vertices closer to the root. So we have a recursive dependence that we can resolve in linear time.

Let's move to the implementation. Note that we now will store the ancestor $p$ and the character $pch$ of the edge from $p$ to $v$ for each vertex $v$. Also, at each vertex we will store the suffix link $\text{link}$ (or $-1$ if it hasn't been calculated yet), and in the array $\text{go}[k]$ the transitions in the machine for each symbol (again $-1$ if it hasn't been calculated yet).

const int K = 26;

struct Vertex {
int next[K];
bool output = false;
int p = -1;
char pch;