Google publishes Lyra audio codec for voice transmission with poor connection quality

Google has introduced a new audio codec, Lyra, optimized to achieve maximum voice quality even when using very slow communication channels. The Lyra implementation code is written in C++ and is open under the Apache 2.0 license, but among the dependencies required for operation there is a proprietary library libsparse_inference.so with a kernel implementation for mathematical calculations. It is noted that the proprietary library is temporary - in the future, Google promises to develop an open replacement and provide support for various platforms.

In terms of the quality of voice data transmitted at low speeds, Lyra is significantly superior to traditional codecs that use digital signal processing methods. In order to achieve high quality voice transmission in conditions of a limited amount of transmitted information, in addition to the usual methods of audio compression and signal conversion, Lyra uses a speech model based on a machine learning system that allows you to recreate the missing information based on typical speech characteristics. The model used to generate the sound has been trained using several thousand hours of voice recordings in over 70 languages.

Google publishes Lyra audio codec for voice transmission with poor connection quality

The codec includes an encoder and a decoder. The algorithm of the encoder is to extract the parameters of voice data every 40 milliseconds, compress them and transfer them to the recipient over the network. For data transfer, a communication channel with a speed of 3 kilobits per second is sufficient. The retrieved sound parameters include logarithmic chalk spectrograms that take into account the characteristics of speech energy in different frequency ranges and are prepared taking into account the model of human auditory perception.

Google publishes Lyra audio codec for voice transmission with poor connection quality

The decoder uses a generative model that, based on the transmitted audio parameters, recreates the speech signal. To reduce the complexity of calculations, a light model based on a recurrent neural network is used, which is a variant of the WaveRNN speech synthesis model, which uses a lower sampling rate, but generates several signals in parallel at once in different frequency ranges. The received signals are then superimposed to produce a single output signal corresponding to the specified sampling rate.

For acceleration, specialized processor instructions available in 64-bit ARM processors are also used. As a result, despite the use of machine learning, the Lyra codec can be used to encode and decode speech in real time on mid-range smartphones, demonstrating a signal transmission delay of 90 milliseconds.

Source: opennet.ru

Add a comment