ToC path:

Importing Data From External Files in SQL

As explained in the page xSQL Data Export and Import Operations, it can be useful to import external data from CSV files using the xSQL import command. Data can also be imported using the embedded eXtremeSQL APIs. The SQL statement syntax and implementation are described in the following sections.

Data file formats

eXtremeSQL allows importing data from external files having various formats:

text files with fixed-length records -- for example (from historical NYSE quotes):
 
    N12182013 Record Count :    728040656
    040000459PA               000000000000000000000000000000000000R    PP000000000000001122 C
    070109890TA               000003600000000001000000000000000000R    TT000000000007250312 C
    080000362KA               000005480000000002000006000000000002R    KK000000000008758212 C
    ...
     
comma-delimited variable-length records - for example (from CME Trades):
 
    20140109-04:49:26.003353392 [INFO] Build version 1.7.0.5
    TradeDate,TradeTime,SeqNum,SecurityID,GroupCode,ProductCode,InstrumentName,EntryPrice,EntryQuantity,AggressorSide,TickDirection
    20131215,19:29:58.432385699,501109,4143,CT,RBT,RBTF4,0,10,2,
    20131216,00:54:49.081829959,864415,4143,CT,RBT,RBTF4,1,2,1,0
    ...
Implementation

The data is loaded into the database through the use of a "foreign table" concept: the data in the file is viewed as a SQL data set and is used in SQL statements. Optional data conversion such as remapping columns is supported and the database schema does not have to reflect the CSV data layout precisely. eXtremeSQL allows specifying the database layout at runtime (see Dynamic DDL). The ability to load data from CSV files and convert them into various internal layouts is especially useful when importing historical market data from different market data providers (such as NYSE TAQ, Reuters, etc.).

The eXtremeSQL engine represents the CSV files as an external virtual (or foreign) table. The database table that is populated from the content of the file is referred to as a target table. The database table that is used to read the layout of the target table is referred to as a pattern table. It is important to note that the foreign table is not materialized, but merely creates a conveyer through which the data flows from the input text file into the database and is formatted at runtime as desired by the application. The foreign table can be defined implicitly through a SQL statement by providing the input filename as a table name and the target table layout via an existing pattern table, for example:
 
    insert into TargetTable select * from foreign table (path='/some_directory/file.csv', delimiter='|', skip=2, nostop=1) 
    as PatternTable;
     
The foreign table parameters are as follows:

path Input file path

delimiter A character delimiter which can be used instead of the comma.*

skip Optionally skip N records from the beginning of file (default=0)

nostop Optionally ignore errors including embedded ‘\n’ (default=0)

nullstr Representation for null values

*Note that when the delimiter character is not specified explicitly, then if the skip parameter is set to zero (meaning that the CSV file has no header) the delimiter will be a comma; otherwise (i.e. the skip parameter is greater than zero) the database runtime reads the first line in the input file and looks for the first character from the following string to be used as the delimiter character: "; , | \t ". The first one of these characters encountered in the input stream will be used as the delimiter for the following data. Following are some example input strings explaining how they will be interpreted:
 
    A;B,C;D|E - the ';' will be used as delimiter, there are three fields: 'A', 'B,C' and 'D|E'
     
    A,B,C;D|E - the ';' will be used as delimiter, there are three fields: 'A', 'B' and 'C;D|E'
     
    A|B,C;D|E - the '|' will be used as delimiter, there are three fields: 'A', 'B,C;D'  and 'E'
     
In the example insert statement above the data is loaded from the /some_directory/file.csv file (the foreign table) and inserted into the target table TargetTable. The pattern table PatternTable is used to specify the database target table layout,. i.e. its column names and data types. Both TargetTable and the PatternTable must be present in the database prior to the import operation. The pattern table is not modified by the statement.

If desired, both the target table and the pattern table could reference the same database table (but they don't have to). For example:
 
    insert into TAQTrade select * from foreign table (path='taqtrade20141030', skip=1, nostop=1) 
     as TAQTrade;
     
Here ‘taqtrade20141030’ is the input file and TAQTrade is used to define the data layout and the target database table.

Importing files with fixed-size records

In order to import files with fixed length records, the pattern table fields must be defined as fixed length character type (CHAR (N) ) of the size defined by the input file format. For example:
 
    create table TAQTradeIN (
        tTime char(9), -- Time of trade with milliseconds (HHMMSSXXX)
        Exchange char(1), -- The Exchange where the trade took place
        Symbol char(16), -- Security symbol (six-character root, ten- character suffix)
        SaleCondition char(4), -- A Sale Condition (applies to all exchanges)
        TradeVolume char(9), -- The total volume of shares traded in the trade line
        TradePrice char(11), -- The price of the trade
        TradeStopStock char(1), -- This value indicates that this Trade (NYSE only)
        TradeCorrectionIndicator char(2), -- Trade Correction Indications
        TradeSequenceNumber char(16),
        SourceofTrade char(1),
        TradeReportingFacility char(1)
    );
     
If a conversion is required by the application, such as adding user defined fields, removing some of the fields that are present in the input file, etc., the target table columns should be explicitly specified. For example, given the TAQTradeIN fixed-size foreign table above, we can convert the input data into a table TAQTrade defined as follows:
 
    create table TAQTrade (
        tTime int, -- Time of trade with milliseconds (HHMMSSXXX)
        Exchange char(1), -- The Exchange where the trade took place
        Symbol char(16), -- Security symbol
        SaleCondition char(4), -- A Sale Condition
        TradeVolume int, -- The total volume of shares traded
        TradePrice DECIMAL(11,4), -- The price of the trade
        TradeStopStock char(1), -- Indicates a Stop Stock
        TradeCorrectionIndicator char(2), -- Trade Correction Indications
        TradeSequenceNumber int,
        SourceofTrade char(1),
        TradeReportingFacility char(1)
);
The insert statement to perform this import and conversion would look like the following:
 
    insert into TAQTrade(tTime, Exchange, Symbol, SaleCondition,
        TradeVolume, TradePrice, TradeStopStock,
        TradeCorrectionIndicator,
        TradeSequenceNumber, SourceofTrade,
        TradeReportingFacility )
         
        select 	int(tTime), Exchange, Symbol, SaleCondition, int(TradeVolume),
            cast((substr(TradePrice, 1, 7) || '.' ||
            substr(TradePrice, 8,4)) as decimal),
            TradeStopStock, TradeCorrectionIndicator,
            int(TradeSequenceNumber),
            SourceofTrade, TradeReportingFacility
        from foreign table (path='taqtrade20141030', skip=2,nostop=1)
        as TAQTradeIN;
         
Importing files into a vertical (sequence) representation

If the data being imported is to be inserted into fields of type sequence, then the INSERT OR UPDATE statement must be used. For example:
 
    INSERT OR UPDATE INTO TOBSeq (Book, SecurityID,EntryType,
        TradeDateTime, SeqNum, GroupCode, ProductCode, I
        nstrumentName, EntryPrice, 
        EntryQuantity, AggressorSide,TickDirection)
         
        SELECT Book, SecurityID, EntryType, cnvtime(TradeDate, TradeTime),
            SeqNum, GroupCode, ProductCode, InstrumentName,
            EntryPrice, EntryQuantity, AggressorSide, TickDirection
        FROM  foreign table(path='tob.csv', delimiter=',', skip=2,nostop=1)
        as FlatTOB;
         
Importing autoid fields

By definition the autoid is a database-generated counter and cannot be imported. If the source foreign table (CSV file, etc.,) includes an autoid column, the values are ignored. The autoid values for the target table are generated by the database runtime.

Some considerations

It is important to be aware of the following:

Keep in mind that the input files are loaded in a single transaction and the input file could be quite large. The REDO_LOG transaction mode (which is normally the default mode) requires that the entire transaction is kept in the page pool. This makes it impossible to load files with more than the page pool size in the REDO_LOG mode.

If the UNDO_LOG mode is used, the database memory device should be large enough to keep the transaction list.

In the UNDO_LOG mode the amount of I/O is quite large -- roughly twice the amount of I/O in the REDO_LOG mode. To mitigate this the following xSQLconfiguration file definition is recommended:
 
    db_params: {
        db_log_type: undo,
        log_params: {
            default_commit_policy : no_sync
        },
        mode_mask : bulk_write_modified_pages
    }
     
Batch Insert

Batch insert with partial sorting of inserted objects can greatly increase both insert and select speed for disk objects with a single index. If objects are inserted in more or less sorted order, then the locality of references for B-Tree pages increases significantly during the insert. And search operations with this index become much faster because the objects are stored mostly in the same order as they are present in the index.

When data is inserted into a database table from a CSV file, or from another table, the physical order in which the rows (objects) are stored on disk does not match the logical order defined for the table via indexes. While inserting the data, the database runtime must build the index (usually tree-based), which causes it to read data and index pages from disk and re-sort the data on the fly, impacting the insert speed. Furthermore any lookup / search returns a dataset, the physical layout of which does not match the order in which the objects are read. Consequently, when the objects are retrieved, the database runtime must access pages scattered across the database physical media.

The batch insert improves the locality of references by storing the data in the order defined by the single index. Whenever a search is done using that index, the lookup will be much faster because of the fact that the records in the result set are physically stored right next to each other on disk.

The size of the buffer used by the batch insert is controlled in the C++ API by the batchInsertBufferSize field of the SqlOptimizerParameters class. This buffer size must be set prior to creating the SQL engine (through the engine::open(database) method). Batch insert is disabled by assigning zero to the batchInsertBufferSize. (xSQL also supports batch insert through the sql_optimizer_params configuration file option).

To use batch insert, the data must be inserted from a CSV file (or other source table) using an insert select from statement into a persistent table (the target table). For example:
 
    insert into Quote 
        select * from foreign table (path='IBM-q1-2013.csv', skip=1 ) 
        as Quote;
     
or
 
    insert into Quote (symbol,date, time, price ) 
        select symbol, date, time, price
        from foreign table (path='IBM-q1-2013.csv', skip=1) 
        as Quote;
     
Note that the index that the batch insert algorithm makes use of must be a B-Tree index. Also, note that the batch insert algorithm utilizes the first B-Tree index in the list of all indexes created for the target table to pre-sort the input data stream. For example, if multiple indexes are created as follows:
 
    create table Quote (symbol varchar, date int, time int, price numeric);
    create index Quote.MyIndex on Quote (symbol, data, time);
    create index Quote.MyOtherIndex on Quote (symbol, price);
     
Here, the MyIndex index is used. The same would result from using the following schema file:
 
    class Quote {
        string symbol;
        uint4 date;
        uint4 time;
        float price;
        tree<symbol, date, time> MyIndex;
        tree<symbol, price> MyOtherIndex;
    };
     
Note that any autoid, list index or any hash indexes defined for the table are not taken into account by the batch insert algorithm. By definition the autoid is a database-generated counter and can't be imported. If the source foreign table (CSV file, etc.,) includes an autoid column, the values are ignored. The autoid values for the target table are generated by the database runtime.

Null values

While loading data from CSV files it is sometimes necessary to define null strings. String representation for null values is nullstr. By default the null string is represented by the empty string i.e. ””. However applications can explicitly define null values. For example:
 
    INSERT INTO T select * from foreign table (path=/my.csv',nullstr='NULL',skip=0) as T;

				 
Example

Please see SDK sample xSQL import for a practical example of importing securities trading data from external files.

path	Input file path
delimiter	A character delimiter which can be used instead of the comma.*
skip	Optionally skip N records from the beginning of file (default=0)
nostop	Optionally ignore errors including embedded ‘\n’ (default=0)
nullstr	Representation for null values