Google announced the open-source release of the Magika project, which is designed to identify content types based on file data analysis. Magika can accurately identify programming languages, compression methods, installation packages, executable code, markup types, and audio, video, document, and image formats within content. The associated toolkit and machine learning model are published under the Apache 2.0 license.
Magika distinguishes itself from similar projects that detect MIME types based on content by its use of machine learning methods, high performance, and excellent detection accuracy. The model was trained using the Keras framework on 25 million file examples and supports the recognition of 116 data types with at least 99% accuracy. The model is compiled in the ONNX format and is only 1 MB in size. The use of deep learning methods resulted in a 50% increase in detection accuracy compared to Google's previously used system, which relied on manually defined rules.

At Google, the system is used to classify files in Gmail, Drive, Code Insight, and Safe Browsing services during security checks and compliance checks. Work is underway to integrate Magika into the VirusTotal platform as a primary file filtering component before running specific analyzers. The Magika configuration deployed in Google's infrastructure scans several million files per second and several hundred billion files per week. After loading the model, inference time is 5-6 ms when tested on a single CPU core. Detection time is virtually independent of file size.
To use Magika in your projects, we've prepared a command-line utility, a Python package, and a JavaScript library that can run in the browser or in Node.js projects. The command-line interface and API support batch operations, allowing you to scan multiple files in a single request. There's a recursive scan mode for the entire directory and three prediction modes for adjusting error tolerance (high confidence, medium confidence, and best guess).

Source: opennet.ru
