Custom Collations

C and C++ applications include support for custom collations. In some character sets, multiple-character combinations like “AE” (“labor lapsus” in the Danish and Norwegian alphabets or “ash” in Old-English) and “OE” (an alternate form of “Ö” or “O-umlaut” in the German alphabet) are treated as single letters. This poses a collation problem when strings containing these character combinations need to be ordered. Clearly, a collation algorithm to sort strings of these character sets must compare more than a single character at a time.

“Capitalization” is also a collation issue. In some cases strings will be compared in a “case sensitive” manner where for example the letters “a-z” will follow the (uppercase) letter “Z”, while more often strings will be compared in a “case insensitive” manner where “a” follows “A”, “b” follows “B”, etc. This can be easily accomplished by treating uppercase and lowercase versions of each letter as equivalent, by converting upper to lower or vice versa before comparing strings, or by assigning them the same ordinal in a case-insensitive character set. (See page Case Sensitivity for an alternative method.)

eXtremeDB enables comparison of strings using a variety of collations, and to mix strings and character arrays with different character sets or collations in the same database; character sets and collations are specified at the application level.

The eXtremeDB DDL provides a collation declaration for indexes on string-type fields as follows:

 
    [unique] tree<string_field_name_1 [collate C1]
            [, string_field_name_2 [collate C2]], …> index_name;
             
    hash<string_field_name_1 [collate C1]
            [, string_field_name_2 [collate C2]], …> index_name;
             

If a collation is not explicitly specified for an index component, the default collation is used. Based on the DDL declaration, for each collation the DDL compiler will generate the following compare function placeholders for tree indexes and/or hash indexes using this collation:

 
    int2  collation_name_collation_compare ( mco_collate_h c1, uint2 len1,
                    mco_collate_h c2, uint2 len2 );
    {
        /* TODO: add your implementation here */
        return 0;
    }
 
    mco_hash_counter_t collation_name_collation_hash (mco_collate_h c, uint2 len)
    {
        /* TODO: add your implementation here */
        return (mco_hash_counter_t)0;
    }
     

For each defined collation, a separate API is generated. The actual implementation of the compare functions, including the definition of character sets, is the application’s responsibility. To facilitate compare function implementation, eXtremeDB provides the following set of functions:

 
    mco_collate_get_char(mco_collate_h s, char *buf, uint2 len);
    mco_collate_get_nchar(mco_collate_h s, nchar_t *buf, uint2 len);
    mco_collate_get_wchar(mco_collate_h s, wchar_t *buf, uint2 len);
    mco_collate_get_char_range(mco_collate_h s, char *buf,
                uint2 from, uint2 len);
    mco_collate_get_nchar_range(mco_collate_h s, nchar_t *buf,
                uint2 from, uint2 len);
    mco_collate_get_wchar_range(mco_collate_h s, wchar_t *buf,
                uint2 from, uint2 len);
                 

Note that three different versions of the mco_collate_get_*char() and mco_collate_get_*char_range() functions are required because, in order to use the same collation, the arguments must be of the corresponding type for the field being accessed. In other words: for fields of type string and char<n>, the *char version mco_collate_get_char() will be called; for fields of type nstring and nchar<n>, the *nchar version; and for fields of type wstring and wchar<n>, the *wchar() version.

The C/C++ application registers user-defined collations via the following function:

 
    mco_db_register_collations(dbname, mydb_get_collations());
     

This function must be called prior to mco_db_connect() or mco_db_connect_ctx() and must be called once for each process that accesses a shared memory database. The second argument mydb_get_collations() is a database specific function similar to mydb_get_dictionary() that is generated by the DDL compiler in the files mydb.h and mydb.c. In addition, the DDL compiler generates the collation compare function stubs in mydb_coll.c. (Note that if the file mydb_coll.c already exists, the DDL compiler will display a warning and generate mydb_coll.c.new instead.)

Note: What is the difference between user-defined indexes and collations?

User–defined indexes are implemented through user-defined compare and hash functions that are passed objects and can compare key fields any way they like. However, collations can only be defined for character fields (string, char<>, nstring, nchar<>, wstring and wchar<>) and the key segments are compared in the sequence defined in the schema.

But collations are much simpler to implement. Whereas user-defined indexes require object-to-object and object-to-key function implementations for tree indexes, and hash_object and hash_external_key function implementations for hash indexes, collations require a single compare function for each collation. Furthermore the same collation can be used in different classes and indexes. For example, for the case-insensitive collation it is necessary to implement a single function.

To summarize: collations are a better choice when the application needs to simply change how strings are sorted (compared) and user-defined indexes are appropriate for more complex user-defined data sort algorithms.

Collation examples

Example 1

File schema1.mco:

     
    declare database mydb;
     
    class A 
    {
        string   name;
 
        tree <name collate Cname> tname;
    };
     

The key word collate declares that the tree index tname is to be generated on string field name, using collation Cname. This DDL instructs the database runtime to use a custom rule named Cname to compare the string field name. Note that the same collation (rule) can be used multiple times in the same index, in different indexes within the same class, or in different classes.

Example 2

File schema2.mco:

 
    declare database mydb;
     
    class A 
    {
        string   s;
        char<20> c;
 
        tree <s collate C1> sidx;
        hash <c collate C1> cidx[1000]; /* CORRECT: string and char<20> can be
                                    used with the same collation C1 */
    };
 
    class B 
    {
        string    s;
        nchar<20> nc;
 
        tree<s collate C1> sidx;
        //  tree<nc collate C1> ncidx;     /* INCORRECT: string and nchar<N> can’t
                                be used with the same collation */
        tree<nc collate C2> ncidx2;    /* CORRECT – different collation, C2 */
    }
     

Note that in class A the same collation (C1) can be used in a tree and hash index, and in class B a new collation (C2) must be defined because its base field nc is of type nchar. To use the collation C1 in the tree indexes, the application must implement a compare function with the following signature:

     
    typedef int2 (*mco_compare_collation_f) ( mco_collate_h c1, uint2 len1, 
                mco_collate_h c2, uint2 len2);
                 

The parameters are collation descriptors (as strings) c1 and c2 and their lengths (number of symbols) len1 and len2. The compare function must return an integer value indicating how the strings are compared: negative if c1 < c2, zero if c1 == c2 and positive if c1 > c2. This function is called by the runtime to compare field values in two objects as well as to compare the field value with an external key value.

If a collation is used in a hash index, as is C1 in class A, the application must implement a hash function with the following signature:

 
    typedef uint4 (*mco_hash_collation_f) ( mco_collate_h c, uint2 len);
     

The parameters are a descriptor c (as a string) and its length (number of symbols) len. The function must return an integer hash code for the string. (Note that if the compare function returns zero for two strings X and Y, i.e. X is equal to Y, the hash function must generate the same hash code for X and Y.)

For the sample schema2.mco, the DDL compiler generates these compare function stubs in mydb_coll.c:

 
    /* collation compare function */
    int2  C1_collation_compare ( mco_collate_h c1, uint2 len1,
                mco_collate_h c2, uint2 len2)
    {
        /* TODO: add your implementation here */
        return 0;
    }
 
    uint4 C1_collation_hash (mco_collate_h c, uint2 len)
    {
        /* TODO: add your implementation here */
        return 0;
    }
 
    /* collation compare function */
    int2  C2_collation_compare ( mco_collate_h c1, uint2 len1,
                mco_collate_h c2, uint2 len2)
    {
        /* TODO: add your implementation here */
        return 0;
    }
     

The DDL compiler also generates the function applications will use to register the specified collations with the eXtremeDB database runtime in mydb.h and mydb.c:

 
    mco_collation_funcs_h mydb_get_collations(void);
     

Example 3

C/C++ example using a “case-insensitive” collation tree index:

File schema3.mco:

 
    declare database colldb;
 
    class Record
    {
        string name;
        unsigned<4> value;
 
        unique tree <name> tstd;
        unique tree <name collate C1> tcoll;
    };
     

Application code snippets:

 
    char * fruits[] = {
        "banana", "PEAR", "plum", "Peach", "apricot", "Kiwi", 
        "QUINCE", "pineapple", "Lemon", "orange", "apple",
        "pawpaw", "Fig", "mango", "MANDARIN", "Persimmon", 
        "Grapefruit", 0
    };
     
    /* collation compare function */
    int2  C1_collation_compare ( mco_collate_h c1, uint2 len1,
                mco_collate_h c2, uint2 len2)
    {
        char buf1[16], buf2[16];
        mco_collate_get_char(c1, buf1, sizeof(buf1));
        mco_collate_get_char(c2, buf2, sizeof(buf2));
         
        // perform case-insensitive compare
        return stricmp(buf1, buf2);
    }
 
    int main(void)
    {
        MCO_RET rc;
        mco_db_h db = 0;
        mco_trans_h t;
        mco_cursor_t c;
        uint2 len;
        char buf[16];
         
        ...
        /* open the database */
         
        /* register the custom compare & hash functions */
        mco_db_register_collations(db_name, colldb_get_collations());
 
        /* connect to database */
        rc = mco_db_connect(db_name, &db);
        if ( MCO_S_OK == rc ) 
        {
            /* fill database with records setting field s to fruit names */
             
            rc = mco_trans_start(db, MCO_READ_ONLY, MCO_TRANS_FOREGROUND,&t);
            if (rc == MCO_S_OK) 
            {
                /* using custom collate tree index iterate through the cursor */
                rc = Record_tcoll_index_cursor(t, &c);
                if (rc == MCO_S_OK) 
                {
                    for (rc = mco_cursor_first(t, &c);
                            MCO_S_OK == rc;
                            rc = mco_cursor_next(t, &c))
                    {
                        Record_from_cursor(t, &c, &rec);
                        Record_s_get(&rec, buf, 11, &len);
                        printf("\n\t%-15s", buf);
                    }
                    rc = mco_trans_commit(t);
                }
            }
        ...
        }
    }
     

Note that the only additional step the main application needs to perform in order to implement a specialized string collation is to register the collation prior to connecting to the database. The sorting logic is handled by the collation compare function. In this case the compare logic simply returns the value returned by the case-insensitive C runtime function stricmp().