C and C++ applications include support for custom collations. In some character sets, multiple-character combinations like “AE” (“labor lapsus” in the Danish and Norwegian alphabets or “ash” in Old-English) and “OE” (an alternate form of “Ö” or “O-umlaut” in the German alphabet) are treated as single letters. This poses a collation problem when strings containing these character combinations need to be ordered. Clearly, a collation algorithm to sort strings of these character sets must compare more than a single character at a time.
“Capitalization” is also a collation issue. In some cases strings will be compared in a “case sensitive” manner where for example the letters “a-z” will follow the (uppercase) letter “Z”, while more often strings will be compared in a “case insensitive” manner where “a” follows “A”, “b” follows “B”, etc. This can be easily accomplished by treating uppercase and lowercase versions of each letter as equivalent, by converting upper to lower or vice versa before comparing strings, or by assigning them the same ordinal in a case-insensitive character set. (See page Case Sensitivity for an alternative method.)
eXtremeDB enables comparison of strings using a variety of collations, and to mix strings and character arrays with different character sets or collations in the same database; character sets and collations are specified at the application level.
The eXtremeDB DDL provides a
collation
declaration for indexes on string-type fields as follows:[unique] tree<string_field_name_1 [collate C1] [, string_field_name_2 [collate C2]], …> index_name; hash<string_field_name_1 [collate C1] [, string_field_name_2 [collate C2]], …> index_name;If a collation is not explicitly specified for an index component, the default collation is used. Based on the DDL declaration, for each collation the DDL compiler will generate the following compare function placeholders for tree indexes and/or hash indexes using this collation:
int2 collation_name_collation_compare ( mco_collate_h c1, uint2 len1, mco_collate_h c2, uint2 len2 ); { /* TODO: add your implementation here */ return 0; } mco_hash_counter_t collation_name_collation_hash (mco_collate_h c, uint2 len) { /* TODO: add your implementation here */ return (mco_hash_counter_t)0; }For each defined collation, a separate API is generated. The actual implementation of the compare functions, including the definition of character sets, is the application’s responsibility. To facilitate compare function implementation, eXtremeDB provides the following set of functions:
mco_collate_get_char(mco_collate_h s, char *buf, uint2 len); mco_collate_get_nchar(mco_collate_h s, nchar_t *buf, uint2 len); mco_collate_get_wchar(mco_collate_h s, wchar_t *buf, uint2 len); mco_collate_get_char_range(mco_collate_h s, char *buf, uint2 from, uint2 len); mco_collate_get_nchar_range(mco_collate_h s, nchar_t *buf, uint2 from, uint2 len); mco_collate_get_wchar_range(mco_collate_h s, wchar_t *buf, uint2 from, uint2 len);Note that three different versions of the
mco_collate_get_*char()
andmco_collate_get_*char_range()
functions are required because, in order to use the same collation, the arguments must be of the corresponding type for the field being accessed. In other words: for fields of type string andchar<n>
, the*char
versionmco_collate_get_char()
will be called; for fields of typenstring
andnchar<n>
, the*nchar
version; and for fields of typewstring
andwchar<n>
, the*wchar()
version.The C/C++ application registers user-defined collations via the following function:
mco_db_register_collations(dbname, mydb_get_collations());This function must be called prior to
mco_db_connect()
ormco_db_connect_ctx()
and must be called once for each process that accesses a shared memory database. The second argumentmydb_get_collations()
is a database specific function similar tomydb_get_dictionary()
that is generated by the DDL compiler in the filesmydb.h
andmydb.c
. In addition, the DDL compiler generates the collation compare function stubs inmydb_coll.c
. (Note that if the filemydb_coll.c
already exists, the DDL compiler will display a warning and generatemydb_coll.c.new
instead.)
Note: What is the difference between user-defined indexes and collations?
User–defined indexes are implemented through user-defined compare and hash functions that are passed objects and can compare key fields any way they like. However, collations can only be defined for character fields (
string, char<>, nstring, nchar<>, wstring and wchar<>
) and the key segments are compared in the sequence defined in the schema.But collations are much simpler to implement. Whereas user-defined indexes require object-to-object and object-to-key function implementations for tree indexes, and hash_object and hash_external_key function implementations for hash indexes, collations require a single compare function for each collation. Furthermore the same collation can be used in different classes and indexes. For example, for the case-insensitive collation it is necessary to implement a single function.
To summarize: collations are a better choice when the application needs to simply change how strings are sorted (compared) and user-defined indexes are appropriate for more complex user-defined data sort algorithms.
Example 1
File
schema1.mco
:declare database mydb; class A { string name; tree <name collate Cname> tname; };The key word
collate
declares that the tree indextname
is to be generated on string fieldname
, using collationCname
. This DDL instructs the database runtime to use a custom rule namedCname
to compare the string fieldname
. Note that the same collation (rule) can be used multiple times in the same index, in different indexes within the same class, or in different classes.Example 2
File
schema2.mco
:declare database mydb; class A { string s; char<20> c; tree <s collate C1> sidx; hash <c collate C1> cidx[1000]; /* CORRECT: string and char<20> can be used with the same collation C1 */ }; class B { string s; nchar<20> nc; tree<s collate C1> sidx; // tree<nc collate C1> ncidx; /* INCORRECT: string and nchar<N> can’t be used with the same collation */ tree<nc collate C2> ncidx2; /* CORRECT – different collation, C2 */ }Note that in class A the same collation (
C1
) can be used in a tree and hash index, and in class B a new collation (C2
) must be defined because its base fieldnc
is of typenchar
. To use the collationC1
in the tree indexes, the application must implement a compare function with the following signature:typedef int2 (*mco_compare_collation_f) ( mco_collate_h c1, uint2 len1, mco_collate_h c2, uint2 len2);The parameters are collation descriptors (as strings)
c1
andc2
and their lengths (number of symbols)len1
andlen2
. The compare function must return an integer value indicating how the strings are compared: negative ifc1 < c2
, zero ifc1 == c2
and positive ifc1 > c2
. This function is called by the runtime to compare field values in two objects as well as to compare the field value with an external key value.If a collation is used in a hash index, as is
C1
in class A, the application must implement a hash function with the following signature:typedef uint4 (*mco_hash_collation_f) ( mco_collate_h c, uint2 len);The parameters are a descriptor
c
(as a string) and its length (number of symbols)len
. The function must return an integer hash code for the string. (Note that if the compare function returns zero for two strings X and Y, i.e. X is equal to Y, the hash function must generate the same hash code for X and Y.)For the sample
schema2.mco
, the DDL compiler generates these compare function stubs inmydb_coll.c
:/* collation compare function */ int2 C1_collation_compare ( mco_collate_h c1, uint2 len1, mco_collate_h c2, uint2 len2) { /* TODO: add your implementation here */ return 0; } uint4 C1_collation_hash (mco_collate_h c, uint2 len) { /* TODO: add your implementation here */ return 0; } /* collation compare function */ int2 C2_collation_compare ( mco_collate_h c1, uint2 len1, mco_collate_h c2, uint2 len2) { /* TODO: add your implementation here */ return 0; }The DDL compiler also generates the function applications will use to register the specified collations with the eXtremeDB database runtime in
mydb.h
andmydb.c
:mco_collation_funcs_h mydb_get_collations(void);Example 3
C/C++ example using a “case-insensitive” collation tree index:
File
schema3.mco
:declare database colldb; class Record { string name; unsigned<4> value; unique tree <name> tstd; unique tree <name collate C1> tcoll; };Application code snippets:
char * fruits[] = { "banana", "PEAR", "plum", "Peach", "apricot", "Kiwi", "QUINCE", "pineapple", "Lemon", "orange", "apple", "pawpaw", "Fig", "mango", "MANDARIN", "Persimmon", "Grapefruit", 0 }; /* collation compare function */ int2 C1_collation_compare ( mco_collate_h c1, uint2 len1, mco_collate_h c2, uint2 len2) { char buf1[16], buf2[16]; mco_collate_get_char(c1, buf1, sizeof(buf1)); mco_collate_get_char(c2, buf2, sizeof(buf2)); // perform case-insensitive compare return stricmp(buf1, buf2); } int main(void) { MCO_RET rc; mco_db_h db = 0; mco_trans_h t; mco_cursor_t c; uint2 len; char buf[16]; ... /* open the database */ /* register the custom compare & hash functions */ mco_db_register_collations(db_name, colldb_get_collations()); /* connect to database */ rc = mco_db_connect(db_name, &db); if ( MCO_S_OK == rc ) { /* fill database with records setting field s to fruit names */ rc = mco_trans_start(db, MCO_READ_ONLY, MCO_TRANS_FOREGROUND,&t); if (rc == MCO_S_OK) { /* using custom collate tree index iterate through the cursor */ rc = Record_tcoll_index_cursor(t, &c); if (rc == MCO_S_OK) { for (rc = mco_cursor_first(t, &c); MCO_S_OK == rc; rc = mco_cursor_next(t, &c)) { Record_from_cursor(t, &c, &rec); Record_s_get(&rec, buf, 11, &len); printf("\n\t%-15s", buf); } rc = mco_trans_commit(t); } } ... } }Note that the only additional step the main application needs to perform in order to implement a specialized string collation is to register the collation prior to connecting to the database. The sorting logic is handled by the collation compare function. In this case the compare logic simply returns the value returned by the case-insensitive C runtime function
stricmp()
.