Google has released Magika 1.0, a toolkit for identifying file content types.

Google has released Magika 1.0, a toolkit designed to identify file content types based on file data analysis. Magika can accurately identify programming languages, compression methods, installation packages, executable code, markup types, and audio, video, document, and image formats. The associated toolkit and machine learning model are licensed under the Apache 2.0 license. Bindings are available for Rust, Python, JavaScript/TypeScript, and Go.

Magika distinguishes itself from similar projects that detect MIME types based on content by its use of machine learning methods, high performance, and detection accuracy. The model was trained using the Keras framework on 100 million file examples (a dataset size of over 3 TB) and supports the recognition of 200 data types with at least 99% accuracy. The model is built in the ONNX format and is only a few megabytes in size. The use of deep machine learning methods resulted in a 50% increase in detection accuracy compared to Google's previously used system, which relied on manually defined rules.

At Google, the system is used to classify files in Gmail, Drive, Code Insight, and Safe Browsing services during security checks and compliance checks. Magika is integrated into the VirusTotal and abuse.ch platforms as a primary filtering layer before running specific analyzers. The Magika configuration deployed in Google's infrastructure scans several million files per second and several hundred billion files per week. After loading the model, inference time is 5 ms when tested on a single CPU core. Detection time is virtually independent of file size.

To use Magika in your projects, we've prepared a command-line utility, packages for Python, Rust, and Go, and a JavaScript library that can run in the browser or in Node.js projects. The command-line interface and API support batch mode operations, allowing you to scan multiple files in a single request. There's a recursive scan mode for the entire directory and three prediction modes for adjusting error tolerance (high confidence, medium confidence, and best guess).

The project was initially developed in Python, but during preparation for the 1.0 release, the content type detection engine was rewritten in Rust, achieving higher performance while maintaining the required level of code security. The ONNX Runtime framework is used to run the machine learning model, and the Tokio library is used for parallel, asynchronous request processing. On a MacBook Pro (M4), the engine's performance allows it to process approximately 1000 files per second.

In addition to the new engine, changes in the 1.0 release include an expansion of the number of supported types from approximately 100 to 200; the addition of a new command-line client written in Rust; improved precision in detecting text formats such as configuration files and code; and reworked Python and TypeScript modules to simplify their integration with other projects. Supported new content types include formats used in machine learning and AI; the Swift, Kotlin, TypeScript, Dart, Solidity (solidity), Web Assembly, and Zig programming languages; DevOps components (Dockerfiles, TOML, HashiCorp, Bazel build files, and YARA rules); SQLite databases; AutoCAD files (dwg, dxf), Adobe Photoshop (psd), and fonts (woff, woff2). Code separation into C++ and C, JavaScript, and TypeScript has been improved.

Source: opennet.ru

Add a comment