OpenZL 0.2.0

OpenZL 0.2.0 OpenZL 0.2.0

After seven months of development, version 0.2.0 of the framework was released. OpenZL, designed to create lossless data compressors.

The framework consists of a base library and tools for creating specialized compressors described in the language SDDL.
There are two steps to creating a good dedicated compressor:

  1. Data analysis to extract structure.
  2. Using good backend compressors that exploit the resulting structure to achieve good compression.

OpenZL provides tools for both stages.

The project is written in C and C++ and is distributed under the BSD license.

Major changes

SDDL2

SDDL was completely rewritten from the ground up to achieve its original design goals. While the original demo was a simplified runtime environment, SDDL2 is a full-fledged compiler: the parser passes data to the semantic analyzer, which in turn passes a typed abstract syntax tree (AST) to the optimizer, and the optimizer controls the code generator, which generates virtual machine bytecode.

The key result is instant parsing. When a record's location can be fully determined using parameters and constants alone, the engine jumps directly to any field without scanning previous bytes, enabling copy-less access and throughput of several GB/s.

The language itself has evolved alongside its toolset. It now supports when clauses for conditional statements, parameterized and anonymous records, access to record field members, and bitwise and logical operators.

On the developer side, the semantic analysis step now identifies undefined references, type mismatches, and arity errors at compile time—with source code location—rather than at runtime, and a VS Code extension for syntax highlighting of .sddl files has been released.

New built-in LZ codec

OpenZL now includes its own LZ codec, represented as ZL_GRAPH_LZ, as well as a sequential compression profile in the zli utility. Work on the codec is ongoing, expanding its feature set and improving performance when processing small input data. Currently, it supports functionality equivalent to zstd level 1, with a 64 KB compression window.

OpenZL allows each stage of the LZ pipeline to be redesigned for speed. Its graph architecture also allows for combining entropy encoding stages, rather than using a single pipeline that is equally well suited for all use cases. Multiple stages can then be combined into a single operation to improve processing speed. This allows OpenZL to achieve 10% faster compression and 70% faster decompression compared to Zstandard level 1 on the Silesia corpus. our tests:

CompressorCompression RatioCompression SpeedDecompression Speed
OpenZL LZ level 12.74466 MB / s2288 MB / s
Zstd level 1 with 64K window size2.74419 MB / s1254 MB / s
Zstd level 12.89424 MB / s1345 MB / s

Support for very large input data

zli now supports processing huge input data (several gigabytes in size). Before compression, such data is now automatically split into manageably sized chunks (approximately 16 MB by default), limiting memory usage, improving data locality, and enabling parallel processing. SDDL2 implements a similar automatic chunking feature when working with schema. New segmenters were created or updated in the process—for CSV, Parquet, and standard numeric data—and all segmenters are now serializable and configurable, so the chosen layout can be saved in the compressor and reused later.

This is applied transparently during compression. Note that the training pipeline is different and remains unaffected, so it's not designed to accept gigantic input data as training material.

Improvements in the online graph visualizer (try)
The visualizer now recognizes compression and decompression traces from start to finish.

The stream preview panel lets you see the bytes actually flowing along each edge, and trimming controls keep even large streams easy to work with.

The settings panel brings all display options together in one place, and a full set of hotkeys—directional navigation, ordered traversal, expanding and collapsing, and node selection—allows you to conveniently work with the tool without a mouse.

Traces are now versioned, block-based compression is displayed correctly, and zli can finally generate its own traces using the new --trace and --trace-streams-dir flags.

Miscellanea

  • Several codecs have been added to the catalog. The Partition and bitpack codecs now use a unified decoder. The floating-point bitsplit codec now includes dedicated encoders and decoders for the fp16, fp32, fp64, and bf16 formats with specialized acceleration. Range-aware splitting (split_byrange), a length multiplexer, the sentinel codec, an lz4 graph, and minor helper functions such as tryParseInt and splitByParam have been added.
  • The API has been streamlined.
  • Improved fuzz testing.
  • Improved build and packaging process for more platforms.

Source: linux.org.ru