DuckDB 0.6.0, SQLite variant for analytical queries published

The DuckDB 0.6.0 DBMS release is available, combining such SQLite properties as compactness, the ability to connect in the form of an embedded library, storing the database in a single file and a convenient CLI interface, with tools and optimizations for performing analytical queries covering a significant part of the stored data, for example , which perform aggregation of the entire contents of tables or merge several large tables. The project code is distributed under the MIT license. The development is still at the stage of forming experimental releases, since the repository format is not yet stabilized and changes from version to version.

DuckDB provides an extended dialect of the SQL language that includes additional capabilities for handling very complex and long-running queries. It supports the use of complex types (arrays, structures, unions) and the ability to execute arbitrary and nested correlated subqueries. Supports simultaneous execution of multiple queries, execution of queries directly from files in CSV and Parquet format. It is possible to import from PostgreSQL DBMS.

In addition to the shell code from SQLite, the project uses a parser from PostgreSQL, moved to a separate library, the Date Math component from MonetDB, its own implementation of window functions (based on the Segment Tree Aggregation algorithm), a regular expression processor based on the RE2 library, its own query optimizer, MVCC control mechanism simultaneous execution of tasks (Multi-Version Concurrency Control), as well as a vectorized query execution engine based on the Hyper-Pipelining Query Execution algorithm, which allows processing large sets of values ​​​​at once in one operation.

Among the changes in the new release:

  • Work continued on improving the storage format. An optimistic disk write mode has been implemented, in which when a large data set is loaded in one transaction, the data is compressed and streamed to a file from the database without waiting for the transaction to be confirmed by the COMMIT command. During the arrival of the COMMIT command, the data is already written to disk, and when ROLLBACK is executed, it is discarded. Previously, data was initially stored entirely in memory, and when committed, it was saved to disk.
  • Support for parallel loading of data into separate tables has been added, which can significantly increase the loading speed on multi-core systems. For example, in the previous release, loading a database with 150 million rows on a 10-core CPU took 91 seconds, and in the new version, this operation takes 17 seconds. There are two modes of parallel loading - with preservation of the order of the records and without preservation of the order.
  • For data compression, the FSST (Fast Static Symbol Table) algorithm is used, which allows you to pack data inside rows using a common dictionary of type matches. The application of the new algorithm made it possible to reduce the size of the test database from 761MB to 251MB.
  • To compress floating point numbers (DOUBLE and FLOAT), Chimp and Patas algorithms are proposed. Compared to the previous Gorillas algorithm, Chimp provides a higher level of compression and faster decompression. The Patas algorithm lags behind Chimp in terms of compression, but is significantly faster in decompression speed, which is almost the same as reading uncompressed data.
  • An experimental ability to load data from CSV files into several parallel streams has been added (SET experimental_parallel_csv=true), which significantly reduces the load time for large CSV files. For example, when the option was enabled, the download time for a 720 MB CSV file was reduced from 3.5 to 0.6 seconds.
  • The possibility of parallel execution of operations for creating and managing indexes has been implemented. For example, the execution of the CREATE INDEX operation on a column with 16 million records was reduced from 5.92 to 1.38 seconds.
  • Parallelization of aggregation operations in queries containing the expression "COUNT(DISTINCT col)" is provided.
  • Support for the UNION type has been added to SQL, which allows multiple types to be bound to the same element (for example, "UNION(num INT, error VARCHAR))".
  • SQL provides the ability to generate queries starting with the word "FROM" instead of "SELECT". In this case, it is assumed that the query begins with "SELECT *".
  • Added support for the "COLUMNS" expression in SQL, allowing you to perform an operation on multiple columns without duplicating the expression. For example, "SELECT MIN(COLUMNS(*)) from obs;" will cause the MIN function to be executed for each column in the obs table, and "SELECT COLUMNS('val[0-9]+') from obs;" for columns with a name consisting of "val" and numbers.
  • Added support for list operations, such as "SELECT [x + 1 for x in [1, 2, 3]] AS l;".
  • Optimized memory consumption. By default on the Linux platform, the jemalloc library is used for memory management. Significantly improved the performance of hash merge operations when memory is limited.
  • Added ".mode duckbox" output mode to CLI, discarding middle columns based on terminal window width lines). Using the ".maxrows X" parameter, you can additionally limit the number of output rows.
  • The CLI provides input auto-completion taking into account the context (the input of keywords, table names, functions, column names and file names is completed).
  • The CLI is enabled by default to display a query progress indicator.

Source: opennet.ru

Add a comment