Trigram Indexes

As explained in the Indexes page, the eXtremeDB , Trigram (trigram) indexes are ideal for text searches when the exact spelling of the target object is not precisely known. It finds objects which match the maximum number of three-character strings in the entered search terms, i.e., near matches. A classical tree index allows exact match queries (eg. x='qqq'), range queries (eg. x >= 10 and x <= 100) and prefix match queries (eg. x >= 'abc’).

However, for more complex regular expressions like “%xyz%” a tree index would require a sequential search, which is significantly slower. In this case a trigram index can be optimal.

The idea of trigram index is rather simple. Assume that we have a string key 'qwerty'. The string is split into trigrams: sequences of three subsequent characters: "qwe", "wer", "ert", "rty". These trigrams are stored in a normal tree index. Now consider a query looking for the substring "wert". The query condition is also split into trigrams: "wer" and "ert". The search then is looking for these trigrams in the index. The result of the lookup is two lists of objects -- so called inverse lists. The next step is to join these lists: i.e. to find objects which are present in both lists.

There is no guarantee that the objects that were found through this algorithm actually contain the "wert" substring yet. For example, it could be "werrert". So the next step in the lookup process is to recheck that the object really matches the given regular expression.

The main limitation of the trigram index is clear from the description above: it cannot be used for substrings shorter than 3. For example the pattern '%01%' requires a sequential search even if the trigram index is present.

The trigram index can be used to locate objects containing a specified substring and regular expression matching; similar to the LIKE operator in SQL. In the case of the LIKE operator the runtime just extracts the longest substring from the pattern excluding wildcards and performs a search for this substring, and then performs a regular expression match for all selected objects. It is also possible to use the trigram index for an exact match lookup, especially if there are no other indexes declared for the data set.