DuckDB 1.2.0 Release

DuckDB 1.2.0, a DBMS focused on executing analytical queries and conceptually reminiscent of SQLite, has been released. DuckDB combines SQLite's properties such as compactness, connection in the form of an embedded library, storing the database in a single file, and a CLI interface, with capabilities and optimizations for executing analytical queries that cover a significant portion of the stored data, for example, aggregating the entire contents of tables or merging several large tables. The project code is written in C++ and is distributed under the MIT license.

DuckDB provides an extended dialect of the SQL language, which includes additional capabilities for processing very complex and long-running queries. It is possible to use complex types (arrays, structures, unions), as well as execute arbitrary and nested correlated subqueries. Simultaneous execution of several queries is supported, as well as execution of queries directly from files in CSV and Parquet formats. Import from PostgreSQL DBMS is supported.

The project uses a shell from SQLite, a parser from PostgreSQL, a Date Math component from MonetDB, its own implementation of window functions (based on the Segment Tree Aggregation algorithm), a regular expression processor based on the RE2 library, its own query optimizer, a MVCC mechanism for managing the simultaneous execution of tasks (Multi-Version Concurrency Control), and a vectorized query execution engine based on the Hyper-Pipelining Query Execution algorithm, which allows large sets of values ​​to be processed in one operation.

In the new version:

  • Implemented support for new compression methods that are not enabled by default to maintain compatibility of DB files with previous versions of DuckDB. To use the improved DB file format, it is now possible to bind the DB to the version number — when opening a file using the "STORAGE_VERSION" parameter, you can now specify the minimum supported DB version ("ATTACH 'file.db' (STORAGE_VERSION 'v1.2.0');"). To convert the new format to the old one, you can use the COPY SQL command, for example: ATTACH 'file1.db'; ATTACH 'converted_file.db' (STORAGE_VERSION 'v1.0.0'); COPY FROM DATABASE file1 TO converted_file;
  • Added support for the SQL command "ALTER TABLE … ADD PRIMARY KEY" to add a primary key to an existing table.
  • Removed a restriction that prevented re-adding identifiers that had a unique tracking index if those identifiers were dropped in the current transaction. For example, the following SQL no longer produces an error: CREATE TABLE students (id INTEGER PRIMARY KEY, name VARCHAR); INSERT INTO students VALUES (1, 'John Doe'); BEGIN; DELETE FROM students WHERE id = 1; INSERT INTO students VALUES (1, 'Jane Doe');
  • Added support for loading CSV files in Latin-1 and UTF-16 encodings (previously only UTF-8 encoding was supported). FROM read_csv('cities-latin-1.csv', encoding = 'latin-1');
  • Added support for using multi-byte separators (up to 4 bytes) in CSV files, which allows you to specify emoji as a column separator. a🦆b hello🦆world FROM read_csv('example.dsv', sep = '🦆');
  • By default, strict mode is enabled for parsing CSV files ("strict_mode = true"), which checks that the formatting complies with the RFC 4180 specification. In strict mode, only one character is allowed to separate lines, which will result in an error when trying to parse files that use not only the line feed character, but also the carriage return character ("\r\n") to separate lines.
  • The CSV parser uses a new algorithm for determining a new line, which supports parallelization of operations. Using the new algorithm accelerated CSV parsing by about 15%.
  • Removed the limitation on the line size in CSV files (previously the line should not exceed 8 MB).
  • When exporting data in the Parquet format, support for hashes (dictionaries) and indexes based on the Bloom filter is implemented. The DELTA_BINARY_PACKED compression method is implemented, allowing for a significant reduction in the size of Parquet files.
  • A safe mode has been added to the command line interface, activated using the "-safe" option or the ".safe_mode" command. In this mode, access is only allowed to the initially specified database file, and an attempt to open other files will result in an error.
  • Improved autocompletion of input in the command line interface. The code for autocompletion has been converted to use PEG (Parsing Expression Grammar).
  • When executing commands in the command line interface, a visual display of large numbers has been implemented, for example, when displaying the number 100000000, “(100 million)” will be additionally appended.
  • SQL now supports a syntax where abbreviated table and expression names can be specified before the values ​​they point to (instead of using the "expression AS name" syntax): SELECT e1: some_long_and_winding_expression, e2: t2.a_column_name FROM t1: long_schema.some_long_table_name, t2: short_s.tbl;
  • The "SELECT" command now supports the "RENAME" operation for renaming fields output when the "*" expression is specified: SELECT * RENAME (col1 AS new_col1) FROM integers;
  • The "SELECT" command allows the use of "LIKE" and "SIMILAR TO" operations when outputting via "*": SELECT * LIKE 'val%' FROM key_val;
  • Improved quality of pseudo-random number generation.
  • The query optimizer has been upgraded. Performance when passing the TPC-H SF100 test has increased by 13%.
  • A new C-like API for developing add-ons is proposed, which can be used, for example, to create new aggregate or table functions.
  • Added support for systems with the standard C library Musl.

Source: opennet.ru

Add a comment