# Suffix Array

## Definition

Let $s$ be a string of length $n$. The $i$-th suffix of $s$ is the substring $s[i \ldots n - 1]$.

A suffix array will contain integers that represent the starting indexes of the all the suffixes of a given string, after the aforementioned suffixes are sorted.

As an example look at the string $s = abaab$. All suffixes are as follows $$\begin{array}{ll} 0. & abaab \\ 1. & baab \\ 2. & aab \\ 3. & ab \\ 4. & b \end{array}$$

After sorting these strings: $$\begin{array}{ll} 2. & aab \\ 3. & ab \\ 0. & abaab \\ 4. & b \\ 1. & baab \end{array}$$

Therefore the suffix array for $s$ will be $(2,~ 3,~ 0,~ 4,~ 1)$.

As a data structure it is widely used in areas such as data compression, bioinformatics and, in general, in any area that deals with strings and string matching problems.

## Construction

### $O(n^2 \log n)$ approach

This is the most naive approach. Get all the suffixes and sort them using quicksort or mergesort and simultaneously retain their original indices. Sorting uses $O(n \log n)$ comparisons, and since comparing two strings will additionally take $O(n)$ time, we get the final complexity of $O(n^2 \log n)$.

### $O(n \log n)$ approach

Strictly speaking the following algorithm will not sort the suffixes, but rather the cyclic shifts of a string. However we can very easily derive an algorithm for sorting suffixes from it: it is enough to append an arbitrary character to the end of the string which is smaller than any character from the string. It is common to use the symbol $\$$. Then the order of the sorted cyclic shifts is equivalent to the order of the sorted suffixes, as demonstrated here with the string dabbb.$$\begin{array}{lll} 1. & abbb\$d & abbb \\ 4. & b\$dabb & b \\ 3. & bb\$dab & bb \\ 2. & bbb\$da & bbb \\ 0. & dabbb\$ & dabbb \end{array}$$Since we are going to sort cyclic shifts, we will consider cyclic substrings. We will use the notation s[i \dots j] for the substring of s even if i > j. In this case we actually mean the string s[i \dots n-1] + s[0 \dots j]. In addition we will take all indices modulo the length of s, and will omit the modulo operation for simplicity. The algorithm we discuss will perform \lceil \log n \rceil + 1 iterations. In the k-th iteration (k = 0 \dots \lceil \log n \rceil) we sort the n cyclic substrings of s of length 2^k. After the \lceil \log n \rceil-th iteration the substrings of length 2^{\lceil \log n \rceil} \ge n will be sorted, so this is equivalent to sorting the cyclic shifts altogether. In each iteration of the algorithm, in addition to the permutation p[0 \dots n-1], where p[i] is the index of the i-th substring (starting at i and with length 2^k) in the sorted order, we will also maintain an array c[0 \dots n-1], where c[i] corresponds to the equivalence class to which the substring belongs. Because some of the substrings will be identical, and the algorithm needs to treat them equally. For convenience the classes will be labeled by numbers started from zero. In addition the numbers c[i] will be assigned in such a way that they preserve information about the order: if one substring is smaller than the other, then it should also have a smaller class label. The number of equivalence classes will be stored in a variable \text{classes}. Let's look at an example. Consider the string s = aaba. The cyclic substrings and the corresponding arrays p[] and c[] are given for each iteration:$$\begin{array}{cccc} 0: & (a,~ a,~ b,~ a) & p = (0,~ 1,~ 3,~ 2) & c = (0,~ 0,~ 1,~ 0)\\ 1: & (aa,~ ab,~ ba,~ aa) & p = (0,~ 3,~ 1,~ 2) & c = (0,~ 1,~ 2,~ 0)\\ 2: & (aaba,~ abaa,~ baaa,~ aaab) & p = (3,~ 0,~ 1,~ 2) & c = (1,~ 2,~ 3,~ 0)\\ \end{array}$$It is worth noting that the values of p[] can be different. For example in the 0-th iteration the array could also be p = (3,~ 1,~ 0,~ 2) or p = (3,~ 0,~ 1,~ 2). All these options permutation the substrings into a sorted order. So they are all valid. At the same time the array c[] is fixed, there can be no ambiguities. Let us now focus on the implementation of the algorithm. We will write a function that takes a string s and returns the permutations of the sorted cyclic shifts. vector<int> sort_cyclic_shifts(string const& s) { int n = s.size(); const int alphabet = 256;  At the beginning (in the 0-th iteration) we must sort the cyclic substrings of length 1, that is we have to sort all characters of the string and divide them into equivalence classes (same symbols get assigned to the same class). This can be done trivially, for example, by using counting sort. For each character we count how many times it appears in the string, and then use this information to create the array p[]. After that we go through the array p[] and construct c[] by comparing adjacent characters.  vector<int> p(n), c(n), cnt(max(alphabet, n), 0); for (int i = 0; i < n; i++) cnt[s[i]]++; for (int i = 1; i < alphabet; i++) cnt[i] += cnt[i-1]; for (int i = 0; i < n; i++) p[--cnt[s[i]]] = i; c[p] = 0; int classes = 1; for (int i = 1; i < n; i++) { if (s[p[i]] != s[p[i-1]]) classes++; c[p[i]] = classes - 1; }  Now we have to talk about the iteration step. Let's assume we have already performed the k-1-th step and computed the values of the arrays p[] and c[] for it. We want to compute the values for the k-th step in O(n) time. Since we perform this step O(\log n) times, the complete algorithm will have a time complexity of O(n \log n). To do this, note that the cyclic substrings of length 2^k consists of two substrings of length 2^{k-1} which we can compare with each other in O(1) using the information from the previous phase - the values of the equivalence classes c[]. Thus, for two substrings of length 2^k starting at position i and j, all necessary information to compare them is contained in the pairs (c[i],~ c[i + 2^{k-1}]) and (c[j],~ c[j + 2^{k-1}]).$$\dots \overbrace{ \underbrace{s_i \dots s_{i+2^{k-1}-1}}_{\text{length} = 2^{k-1},~ \text{class} = c[i]} \quad \underbrace{s_{i+2^{k-1}} \dots s_{i+2^k-1}}_{\text{length} = 2^{k-1},~ \text{class} = c[i + 2^{k-1}]} }^{\text{length} = 2^k} \dots \overbrace{ \underbrace{s_j \dots s_{j+2^{k-1}-1}}_{\text{length} = 2^{k-1},~ \text{class} = c[j]} \quad \underbrace{s_{j+2^{k-1}} \dots s_{j+2^k-1}}_{\text{length} = 2^{k-1},~ \text{class} = c[j + 2^{k-1}]} }^{\text{length} = 2^k} \dots $$This gives us a very simple solution: sort the substrings of length 2^k by these pairs of numbers. This will give us the required order p[]. However a normal sort runs in O(n \log n) time, with which we are not satisfied. This will only give us an algorithm for constructing a suffix array in O(n \log^2 n) times. How do we quickly perform such a sorting of the pairs? Since the elements of the pairs do not exceed n, we can use counting sort again. However sorting pairs with counting sort is not the most efficient. To achieve a better hidden constant in the complexity, we will use another trick. We use here the technique on which radix sort is based: to sort the pairs we first sort them by the second element, and then by the first element (with a stable sort, i.e. sorting without breaking the relative order of equal elements). However the second elements were already sorted in the previous iteration. Thus, in order to sort the pairs by the second elements, we just need to subtract 2^{k-1} from the indices in p[] (e.g. if the smallest substring of length 2^{k-1} starts at position i, then the substring of length 2^k with the smallest second half starts at i - 2^{k-1}). So only by simple subtractions we can sort the second elements of the pairs in p[]. Now we need to perform a stable sort by the first elements. As already mentioned, this can be accomplished with counting sort. The only thing left is to compute the equivalence classes c[], but as before this can be done by simply iterating over the sorted permutation p[] and comparing neighboring pairs. Here is the remaining implementation. We use temporary arrays pn[] and cn[] to store the permutation by the second elements and the new equivalent class indices.  vector<int> pn(n), cn(n); for (int h = 0; (1 << h) < n; ++h) { for (int i = 0; i < n; i++) { pn[i] = p[i] - (1 << h); if (pn[i] < 0) pn[i] += n; } fill(cnt.begin(), cnt.begin() + classes, 0); for (int i = 0; i < n; i++) cnt[c[pn[i]]]++; for (int i = 1; i < classes; i++) cnt[i] += cnt[i-1]; for (int i = n-1; i >= 0; i--) p[--cnt[c[pn[i]]]] = pn[i]; cn[p] = 0; classes = 1; for (int i = 1; i < n; i++) { pair<int, int> cur = {c[p[i]], c[(p[i] + (1 << h)) % n]}; pair<int, int> prev = {c[p[i-1]], c[(p[i-1] + (1 << h)) % n]}; if (cur != prev) ++classes; cn[p[i]] = classes - 1; } c.swap(cn); } return p; }  The algorithm requires O(n \log n) time and O(n) memory. However if we take the size of the alphabet k into account, then it uses O((n + k) \log n) time and O(n + k) memory. For simplicity we used the complete ASCII range as alphabet. If we know that the string only contains a subset of characters, e.g. only lowercase letters, then this implementation can obviously be optimized. However not by much, since the alphabet size only appears with a factor of O(\log n) in the complexity. Also note, that this algorithm only sorts the cycle shifts. As mentioned at the beginning of this section we can generate the sorted order of the suffixes by appending a character that is smaller than all other characters of the string, and sorting this resulting string by cycle shifts, e.g. by sorting the cycle shifts of s + \$$. This will obviously give the suffix array of $s$, however prepended with $|s|$.

vector<int> suffix_array_construction(string s) {