Facebook publishes EnCodec audio codec using machine learning

Meta/Facebook (banned in Russia) has introduced a new audio codec called EnCodec, which uses machine learning techniques to increase the compression ratio without losing quality. The codec can be used both for real-time audio streaming and for encoding for subsequent storage in files. The EnCodec reference implementation is written in Python using the PyTorch framework and is licensed under a CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial) license for non-commercial use only.

Two ready-made models are offered for download:

  • A causal model using a 24 kHz sample rate, supporting only monophonic audio, and trained on a variety of audio data (suitable for speech coding). The model can be used to package audio data for transmission at 1.5, 3, 6, 12 and 24 kbps bit rates.
  • A non-causal model using a 48 kHz sample rate, supporting stereo sound, and trained on music only. The model supports 3, 6, 12 and 24 kbps bitrates.

For each model, an additional language model has been prepared, which makes it possible to achieve a significant increase in the compression ratio (up to 40%) without loss of quality. Unlike earlier projects to apply machine learning techniques to audio compression, EnCodec can be used not only for speech packaging, but also for music compression with a sampling rate of 48 kHz, corresponding to the level of audio CDs. According to the developers of the new codec, when transmitting at a bit rate of 64 kbps compared to the MP3 format, they managed to increase the audio compression ratio by about ten times while maintaining the same level of quality (for example, when using MP3 a bandwidth of 64 kbps is required, for transmission with that same quality in EnCodec, 6 kbps is sufficient).

The codec architecture is built on the basis of a neural network with a "transformer" architecture and is based on four links: encoder, quantizer, decoder and discriminator. The encoder extracts the parameters of the voice data and converts them into a packed stream at a lower frame rate. The quantizer (RVQ, Residual Vector Quantizer) converts the stream output by the encoder into sets of packets, compressing information in relation to the selected bitrate. The output of the quantizer is a compressed representation of the data suitable for transmission over the network or saving to disk.

The decoder decodes the compressed data representation and reconstructs the original sound wave. The discriminator improves the quality of the generated samples (sample) taking into account the model of human auditory perception. Regardless of the level of quality and bitrate, the models used for encoding and decoding differ in rather modest resource requirements (calculations necessary for real-time operation are performed on one CPU core).

Facebook publishes EnCodec audio codec using machine learning


Source: opennet.ru

Add a comment