eXtremeDB Indexes

eXtremeDB supports hash indexes, and tree indexes of the following types: tree (b-tree), patricia (Patricia Trie), rtree (R-tree spatial), kdtree (kd-tree multi-dimensional), trigram (text substring) and user-defined (“custom”). Hash and unique tree indexes can also be used to uniquely identify objects in the database. Also, specialized hash indexes of type oid and autoid are provided as unique object identifiers. Unlike oids and autoids, however, hash and tree index values are only required to be unique within a class; the nonunique declaration can be used to allow duplicate values for hash or tree indexes.

(See the Indexes and Cursors page for detailed descriptions and implementation details for each of the index types described below.)

Tree Index

An eXtremeDB tree index is a tree-like structure in the application’s memory that contains a collection of keys. A key can be simple (based on a single field) or compound (containing more than one field). The tree-index stores the keys in sorted order. In order to support sorting, the tree-index must have a compare function. Regardless of the key type (simple or compound), the compare function must be able to compare two keys and determine whether the first key is less than (compare returns -1), equal to (0) or greater than (1) the second key. The tree-index algorithm sorts the keys based on this comparison.

Search Methods

Search APIs are generated by the mcocomp schema compiler to locate desired objects or groups of objects by unique identifier or by index. While exact match lookups by unique identifier ( unique tree or hash index, oid and autoid ) using the generated _find() functions are extremely efficient for locating individual objects, the rich set of specialized indexes mentioned above employ cursors to navigate a group of objects as an ordered or unordered result set.

Find

By definition, an exact match lookup on a unique index returns exactly one result or zero if no match is found. So for C and C++ applications, the mcocomp schema compiler generates a _find() function for each unique index. Java and C# applications use the Cursor.Find() method.

Search

A cursor is essentially an iterator over the collection of objects in a result set. For each index declared for a class in the schema definition, the DDL compiler generates functions to instantiate a cursor, position it based on some value(s), and obtain a handle to the database object from the cursor. List and nonunique hash-based cursors allow navigation in sequential order (first to last, or last to first), though the order of the sequence is not defined; it is simply a mechanism to iterate over the unordered list of objects of a class.

For C or C++ applications, mcocomp generates a _search() function for each nonunique index in the schema. Java and C# applications use the Cursor.Search() method. The search function is used whenever it is necessary to establish:

  • a starting position in a sorted list with a known starting value and optionally retrieve subsequent results in ascending or descending sorted order.
  • a starting position in a sorted list when only part of the starting value is known, find the closest match, and optionally retrieve subsequent results in ascending or descending sorted order.
  • a starting position as above, iterate over the sorted list in ascending or descending sorted order until a lower/upper bound is reached, using the _compare() method to determine when the range limit is reached.
  • a starting position with a known starting value, iterate over the list to retrieve each duplicate, using the _compare() method to determine when the last duplicate was retrieved.

Pattern Search

eXtremeDB also supports "wild card" pattern matching ability. This is the capability to search tree index entries matching patterns specified with wild card characters for single character and multiple character matches. By default, the question mark “?” will match any single character in the specified position within the pattern, and the asterisk “*” will match any combination of characters (including no characters) in that position. If a match on the characters “?” or “*” is desired, the wild card characters themselves can be modified by specifying different characters in the pattern search policy (see below).

For example, “G*e*” would return “Graves” and “Gorine”, while “Gr?ve*” would match “Graves”, “Grove”, “Grover” and so on... In this example, ‘*’ matches zero, one, or more characters, while ‘?’ matches exactly one character. Further, the pattern “G*E*” would match all uppercase entries like “GRAVES”, “GORINE”. However, because the standard compare functions used to match index values with search keys use case-sensitive compare functions, the case specified in the search pattern will affect the search results.

Key-Value-Inclusive Indexes

eXtremeDB allows a “key-value-inclusive” or “cache-oblivious” implementation for in-memory (transient) database classes. This expedites access for certain access patterns. In particular experiments with very large B-Tree indexes show significant performance improvements. Also, the trigram index (described below) is implemented via the “key-value-inclusive” tree because it is much more efficient to keep 3 extra bytes (trigrams) on the index page than to simply store an 8-byte object address which requires further access steps.

Covering Indexes

If all the fields requested in an index search are available in the index tree node, then the runtime doesn't have to look up the object covered by the index again. This can significantly increase the performance of searches. Since all the requested fields are available within the index, the index is called a “covering index”. In addition, a “covering” index can be also declared inclusive to leverage CPU cache usage.

Patricia Trie Index

The eXtremeDB Patricia index uses a Patricia trie structure that is based on a radix tree using a radix of two. The term "trie" is derived from the word "retrieval". Patricia stands for "Practical Algorithm to Retrieve Information Coded as Alphanumeric”. This algorithm is optimized to perform quick IP address prefix matching for IP subnet, network or routing table lookups. So Patricia Trie indexes are particularly useful for network and telecommunications applications.

A Patricia index can be declared over scalar and boolean data types as well as arrays and vectors of those types. In fact, the boolean data type was introduced to facilitate Patricia index implementation where bit arrays are used to store IP addresses.

RTree Index

RTree indexes are commonly used to speed spatial searches, for example, find the rectangle that contains this point, find all rectangles that overlap this rectangle, or find all rectangles in the neighborhood of this point.

All manner of shapes can be stored and searched with the rtree index. For example, a point is represented as a rectangle with width and height = 1 and a line that has starting and ending coordinates of 15, 844 and 0, 3647 is stored as rectangle with its upper left corner at 15, 844 and its lower right corner at 0, 3647.

To determine if two lines intersect, or if a point is within a given area (described by a circle, rectangle, etc.), an rtree search is performed to find all overlapping rectangles.

Kd-Tree Index

A kdtree is a data structure for organizing points in a k-dimensional space. kdtrees are a useful data structure for several applications, such as lookups that involve a multidimensional search key. The kdtree is a binary tree in which every node is a k-dimensional point. Every non-leaf node generates a splitting hyperplane that divides the space into two subspaces. Points left of the hyperplane represent the left sub-tree of that node and the points right of the hyperplane represent the right sub-tree. The hyperplane direction is chosen in the following way: every node split to sub-trees is associated with one of the k-dimensions, such that the hyperplane is perpendicular to that dimension vector.

The kdtree uses a Query-By-Example approach to locate objects that match a given search condition. The application creates pattern object(s) and assigns values to the fields that are included in the search criteria. The kdtree supports simple exact matches as well as range lookups. In the latter case, two pattern objects will be specified: one for the lower and one for the upper bounds of the search condition. If a field value is defined only for one boundary, it is considered an open interval that corresponds to a greater-than-or-equal-to or less-than-or-equal-to search condition.

Trigram Index

Trigram search is a method of searching for text when the exact syntax or spelling of the target object is not precisely known. It finds objects which match the maximum number of three-character strings in the entered search terms, i.e. near matches. A classical B-Tree index allows exact match queries (eg. x='qqq'), range queries (eg. “x >= 10 and x <= 100”) and prefix match queries (eg. “x >= 'abc’).

However, for more complex regular expressions like “%xyz%” a B-Tree index would require a sequential search, which is significantly slower. In this case a trigram index can be ideal.

The idea of trigram index is rather simple. Assume that we have a string key 'qwerty'. The string is split into trigrams: sequences of three subsequent characters: "qwe", "wer", "ert", "rty". These trigrams are stored in a normal B-Tree index. Now consider a query looking for the substring "wert". The query condition is also split into trigrams: "wer" and "ert". The search then is looking for these trigrams in the index. The result of the lookup is two lists of objects -- so called inverse lists. The next step is to join these lists: i.e. to find objects which are present in both lists.

OID

An eXtremeDB oid is a unique object identifier which is implemented internally as a unique hash index. Generally, an oid is a user-defined structure which has fixed-length fields corresponding to identifying characteristics of a principle database class. C and C++ applications can use oids to quickly retrieve the object it identifies.

Autoid

An autoid is similar to oid in that it is a unique object identifier implemented internally as a unique hash index. However it differs in that its value is determined by the eXtremeDB runtime, and the value is unique for the class, not the entire database. Autoids are 8-byte signed integer values which means that the value can be incremented every microsecond for nearly 300,000 years without overflowing.

LIST

A sequential navigation method is available for unordered lists of objects by specifying a list index. This allows applications to iterate over the objects of a class without regard to any particular order.