KDB+ Database: From Finance to Formula 1

KDB+, a product of the company KX is a widely known in narrow circles, exceptionally fast, columnar database designed for storing time series and analytical calculations based on them. Initially, it was (and is) very popular in the financial industry - it is used by all the top 10 investment banks and many well-known hedge funds, exchanges and other organizations. Recently, KX decided to expand its customer base and now offer solutions in other areas where there is a large amount of data, ordered by time or otherwise - telecom, bioinformatics, manufacturing, etc. Among other things, they became a partner of the Aston Martin Red Bull Racing team in Formula 1, where they help collect and process data from car sensors and analyze wind tunnel tests. In this article, I want to talk about what features of KDB+ make it super-performing, why companies are willing to spend big money on it, and finally, why it's not really a database.
 
KDB+ Database: From Finance to Formula 1
 
In this article, I will try to tell you in general what KDB+ is, what features and limitations it has, what is its use for companies that want to process large amounts of data. I won't go into the details of KDB+'s implementation and the details of its Q programming language. Both of these topics are very broad and merit separate articles. A lot of information on these topics can be found at code.kx.com, including the Q book - Q For Mortals (see link below).

Some terms

  • In-memory database. A database that stores data in RAM for faster access. The advantages of such a database are clear, and the disadvantages are the possibility of data loss, the need to have a lot of memory on the server.
  • Column database. A database where data is stored column by column rather than record by record. The main advantage of such a database is that data from one column is stored together on disk and in memory, which greatly speeds up access to them. There is no need to load columns that are not used in the query. The main disadvantage is that it is difficult to modify and delete records.
  • Time series. Data with a date or time column. As a rule, ordering in time is important for such data, so that it is easy to determine which record precedes or follows the current one, or to apply functions whose result depends on the order of records. Classical databases are built on a completely different principle - the representation of a collection of records as a set, where the order of the records is in principle not defined.
  • Vector. In the context of KDB+, this is a list of elements of the same atomic type, such as numbers. In other words, an array of elements. Arrays, unlike lists, can be stored compactly and processed using processor vector instructions.

 

Historical information

KX was founded in 1993 by Arthur Whitney, who previously worked at Morgan Stanley on the A+ language, the successor to APL, a very original and once popular language in the financial world. Of course, in KX, Arthur continued in the same spirit and created a vector-functional language K, guided by the ideas of radical minimalism. K programs look like a jumble of punctuation marks and special characters, the meaning of signs and functions depends on the context, and each operation carries a lot more meaning than it happens in conventional programming languages. Due to this, the K program takes up a minimum of space - a few lines can replace pages of text in a verbose language like Java - and is a super-concentrated implementation of the algorithm.
 
A function on K that implements most of the LL1 parser generator according to a given grammar:

1. pp:{q:{(x;p3(),y)};r:$[-11=@x;$x;11=@x;q[`N;$*x];10=abs@@x;q[`N;x]  
2.   ($)~*x;(`P;p3 x 1);(1=#x)&11=@*x;pp[{(1#x;$[2=#x;;,:]1_x)}@*x]  
3.      (?)~*x;(`Q;pp[x 1]);(*)~*x;(`M;pp[x 1]);(+)~*x;(`MP;pp[x 1]);(!)~*x;(`Y;p3 x 1)  
4.      (2=#x)&(@x 1)in 100 101 107 7 -7h;($[(@x 1)in 100 101 107h;`Ff;`Fi];p3 x 1;pp[*x])  
5.      (|)~*x;`S,(pp'1_x);2=#x;`C,{@[@[x;-1+#x;{x,")"}];0;"(",]}({$[".s.C"~4#x;6_-2_x;x]}'pp'x);'`pp];  
6.   $[@r;r;($[1<#r;".s.";""],$*r),$[1<#r;"[",(";"/:1_r),"]";""]]}  

 Arthur embodied this philosophy of extreme efficiency with a minimum of body movements in KDB +, which appeared in 2003 (I think it’s clear now where the letter K in the name comes from) and is nothing more than an interpreter of the fourth version of the K language. A more user-friendly version has been added on top of K K called Q. Q also added support for a specific dialect of SQL - QSQL, and support for tables as a system data type, tools for working with tables in memory and on disk, etc. was added to the interpreter.
 
Thus, from the user's point of view, KDB+ is simply a Q interpreter with support for tables and SQL-like LINQ-style expressions from C#. This is the most important difference between KDB+ and its main competitive advantage, which is often overlooked. This is not a database + an auxiliary disabled language, but a full-fledged powerful programming language + built-in support for database functions. This distinction will play a decisive role in enumerating all the benefits of KDB+. For example…
 

Size

By today's standards, KDB+ is just microscopic in size. It is literally one sub-megabyte executable file and one small text file with some system functions. In reality, less than one megabyte, and companies pay tens of thousands of dollars a year for one processor on the server for this program.

  • This size makes KDB+ feel great on any hardware, from the Pi microcomputer to servers with terabytes of memory. This does not affect the functionality in any way, moreover, Q starts instantly, which allows it to be used, among other things, as a scripting language.
  • With this size, the Q interpreter fits completely into the processor cache, which speeds up the execution of programs.
  • With this size of the executable file, the Q process takes up negligible memory space, you can run hundreds of them. At the same time, if necessary, Q can operate with tens or hundreds of gigabytes of memory within a single process.

Universalism

Q is great for a wide variety of tasks. The Q process can act as a historical database and provide fast access to terabytes of information. We, for example, have dozens of historical databases, in some of which one uncompressed day of data takes more than 100 gigabytes. Nevertheless, with reasonable restrictions, a query to the database will be completed in tens to hundreds of milliseconds. In general, we have a universal timeout for user requests - 30 seconds - and it works very rarely.
 
Q can just as easily be an in-memory database. Adding new data to in-memory tables is so fast that user requests are the limiting factor. Data in tables is stored in columns, which means that any operation on a column will use the processor cache at full capacity. In addition to this, KX tried to implement all basic operations such as arithmetic through vector processor instructions, maximizing their speed. Q can also perform tasks that are not typical for databases - for example, process streaming data and calculate in “real time” (with a delay from tens of milliseconds to several seconds depending on the task) various aggregating functions for financial instruments for different time intervals or build a model of the impact of perfect transactions to the market and conduct its profiling almost immediately after its completion. In such tasks, most often, the main time delay is not introduced by Q, but by the need to synchronize data from different sources. High speed is achieved due to the fact that the data and the functions that process them are in one process, and the processing is reduced to executing several QSQL expressions and joins that are not interpreted, but executed by binary code.
 
Finally, any service processes can be written on Q. For example, Gateway processes that automatically distribute user requests to the required databases and servers. The programmer has complete freedom to implement any algorithm for balancing, prioritization, fault tolerance, access rights, quotas, and whatever else your heart desires. The main problem here is that you have to implement all this yourself.
 
For example, I will list what types of processes we have. All of them are actively used and work together, combining dozens of different databases into one whole, processing data from many sources and serving hundreds of users and applications.

  • Connectors (feedhandler) to data sources. These processes usually use external libraries that are loaded into Q. The C interface in Q is extremely simple and makes it easy to create proxy functions for any C/C++ library. Q is fast enough to handle, for example, a flood of FIX messages from all European stock exchanges at the same time.
  • Data distributors (tickerplant), which serve as an intermediate link between connectors and consumers. At the same time, they write incoming data to a special binary log, providing resistance for consumers to connection losses or restarts.
  • In-memory databases (rdb). These databases provide the fastest possible access to raw fresh data by keeping it in memory. As a rule, they accumulate data in tables during the day and reset them to zero at night.
  • Database persistence (pdb). These databases ensure that the data for today is stored in a historical database. As a rule, unlike rdb, they do not store data in memory, but use a special cache on disk during the day and copy the data at midnight to the historical database.
  • Historical bases (hdb). These databases provide access to data for previous days, months and years. Their size (in days) is limited only by the size of hard drives. Data can be located anywhere, in particular on different disks for faster access. It is possible to compress data using several algorithms to choose from. The database structure is well-documented and simple, the data is stored column by column in regular files, so that they can be processed, including by means of the operating system.
  • Databases with aggregated information. They store various aggregations, usually c, grouped by instrument name and time interval. In-memory databases update their state with each incoming message, and historical databases store precomputed data to speed up access to historical data.
  • Finally, the gateway processesserving applications and users. Q allows you to implement fully asynchronous processing of incoming messages, their distribution among databases, checking access rights, etc. I note that the messages are not limited and most often are not SQL statements, as is the case in other databases. Most often, the SQL expression is hidden in a special function and is constructed based on the parameters requested by the user - time conversion is performed, filtering is performed, the data is normalized (for example, the share price is leveled if there was a dividend payment), etc.

Typical architecture for one data type:

KDB+ Database: From Finance to Formula 1

Speed

Although Q is an interpreted language, it is also a vector language. This means that many built-in functions, in particular arithmetic ones, take arguments of any form - numbers, vectors, matrices, lists, and the programmer is expected to implement the program as operations on arrays. In such a language, if you add two vectors of a million elements, it doesn't matter that the language is interpreted, the addition will be done by a super-optimized binary function. Since the lion's share of time in Q programs is spent on operations with tables using these basic vectorized functions, we have a very decent performance at the output, allowing us to process a huge amount of data even in one process. This is similar to the math libraries in python - although the python language itself is very slow, it has many excellent libraries like numpy that allow you to process numerical data at the speed of a compiled language (by the way, numpy is ideologically close to Q).
 
In addition, KX took a very careful approach to designing tables and optimizing work with them. Firstly, several types of indexes are supported, which are supported by built-in functions and can be applied not only to table columns, but also to any vectors - grouping, sorting, unique attribute and special grouping for historical databases. The index is superimposed in an elementary way and is automatically adjusted when adding elements to the column/vector. Indexes can equally well be superimposed on the columns of tables both in memory and on disk. When executing a QSQL query, indexes are used automatically, if possible. Secondly, work with historical data is done through the OS file display mechanism (memory map). Large tables are never loaded into memory, instead, the necessary columns are displayed directly into memory and only that part of them (including indexes that help here) that is needed is actually loaded. For the programmer, it makes no difference whether the data is in memory or not, the mechanism for working with mmap is completely hidden in the depths of Q.
 
KDB+ is not a relational database, tables can contain arbitrary data, while the order of rows in the table does not change when new elements are added and can and should be used when writing queries. This feature is urgently needed for working with time series (data from exchanges, telemetry, event logs), because if the data is sorted by time, then the user does not need to use any SQL tricks to find the first or last row or N rows in the table , determine which line follows the Nth line, and so on. Table joins are simplified even more, for example, finding the last quote in a table of 16000 million elements for 500 VOD.L transactions (Vodafone) takes about a second on disk and ten milliseconds in memory.
 
An example of a join by time is the quote table is mapped to memory, so there is no need to specify VOD.L in where, the index on the sym column and the fact that the data is sorted by time are implicitly used. Almost all joins in Q are normal functions, not part of a select statement:

1. aj[`sym`time;select from trade where date=2019.03.26, sym=`VOD.L;select from quote where date=2019.03.26]  

Finally, it's worth noting that the engineers at KX, from Arthur Whitney himself, are really efficiency freaks and go out of their way to get the most out of the Q's standard features and optimize the most common usage patterns.
 

Сonclusion

KDB+ is popular with businesses primarily due to its exceptional versatility - it serves equally well as an in-memory database, as a base for storing terabytes of historical data, and as a platform for data analysis. Due to the fact that data processing takes place directly in the database, high speed and resource savings are achieved. A full-fledged programming language integrated with database functions allows you to implement the entire stack of necessary processes on one platform - from receiving data to processing user requests.
 

additional information

Disadvantages

A significant disadvantage of KDB+/Q is the high entry threshold. The language has a strange syntax, some functions are heavily overloaded (value, for example, has about 11 use cases). Most importantly, it requires a radically different approach to writing programs. In a vector language, you must always think in terms of array transformations, implement all loops through several versions of map / reduce functions (called adverbs in Q), never try to save money by replacing vector operations with atomic ones. For example, to find the index of the Nth occurrence of an element in an array, you would write:

1. (where element=vector)[N]  

although this looks terribly inefficient by C/Java standards (= creates a boolean vector, where returns true indexes of the elements in it). But such a notation makes the meaning of the expression more clear and you use fast vector operations instead of slow atomic ones. The conceptual difference between a vector language and the rest is comparable to the difference between imperative and functional approaches to programming, and you need to be prepared for this.
 
Some users are also unhappy with QSQL. The point is that it only looks like real SQL. In fact, it is just a SQL-like expression interpreter that does not support query optimization. The user must write optimal queries himself, and on Q, which many are not ready for. On the other hand, of course, you can always write the optimal query yourself, and not rely on a black box optimizer.
 
As a plus, the Q book - Q For Mortals is available for free at website, also there are collected many other useful materials.
 
Another big disadvantage is the cost of the license. That's tens of thousands of dollars a year for one CPU. Only large firms can afford such expenses. Recently, KX has made the licensing policy more flexible and provides the option to pay only for the time of use or rent KDB+ in the Google and Amazon clouds. Also KX offers for download free version for non-commercial purposes (32 bit version or 64 bit on request).
 

Competitors

There are quite a few specialized databases built on similar principles - columnar, in-memory, focused on very large amounts of data. The problem is that these are specialized databases. A prime example is Clickhouse. This database has a very similar principle of storing data on disk and building an index to KDB+, it performs some queries faster than KDB+, although not significantly. But even as a Clickhouse database is more specialized than KDB+ - web analytics vs arbitrary time series (this difference is very important - because of it, for example, Clickhouse does not have the ability to use record ordering). But, most importantly, Clickhouse does not have the universality of KDB +, a language that would allow processing data directly in the database, and not loading it first into a separate application, building arbitrary SQL expressions, using arbitrary functions in a query, creating processes not related to the execution of functions of the historical database . Therefore, it is difficult to compare KDB+ with other databases, they may be better in certain use cases or simply better when it comes to classic database tasks, but I do not know of another tool that is equally effective and versatile for processing temporary data.
 

Integration with Python

To make working with KDB+ easier for people unfamiliar with the technology, KX has created libraries for tight integration with Python in a single process. You can either call any python function from Q, or vice versa - call any Q function from Python (in particular, QSQL expressions). Libraries convert, if necessary (for the sake of efficiency, not always) data from the format of one language to the format of another. As a result, Q and Python live in such a close symbiosis that the boundaries between them are blurred. As a result, the programmer, on the one hand, has full access to numerous useful Python libraries, on the other hand, he gets a fast base integrated into Python for working with big data, which is especially useful for those involved in machine learning or modeling.
 
Working with Q in Python:

1. >>> q()  
2.q)trade:([]date:();sym:();qty:())  
3. q)  
4. >>> q.insert('trade', (date(2006,10,6), 'IBM', 200))  
5. k(',0')  
6. >>> q.insert('trade', (date(2006,10,6), 'MSFT', 100))  
7. k(',1')  

references

The site of the company - https://kx.com/
Developer site - https://code.kx.com/v2/
Book Q For Mortals (in English) - https://code.kx.com/q4m3/
Articles on KDB+/Q applications from the kx staff — https://code.kx.com/v2/wp/

Source: habr.com

Add a comment