Mozilla unveils DeepSpeech 0.6 speech recognition engine

Submitted by release of the speech recognition engine developed by Mozilla DeepSpeech 0.6, which implements the speech recognition architecture of the same name, proposed researchers from Baidu. The implementation is written in Python using the TensorFlow machine learning framework and spreads under the free license MPL 2.0. Supports Linux, Android, macOS and Windows. The performance is enough to use the engine on LePotato, Raspberry Pi 3 and Raspberry Pi 4 boards.

The set also offered trained models, Examples sound files and tools for recognition from the command line. To embed the speech recognition function in your programs, ready-to-use modules for Python, NodeJS, C ++ and .NET are offered (third-party developers have prepared modules for Rust ΠΈ Go). The finished model is supplied only for English, but for other languages ​​on request. attached instructions you can train the system yourself using voice datacollected by the Common Voice project.

DeepSpeech is much simpler than traditional systems and at the same time provides a higher quality of recognition in the presence of extraneous noise. The development does not use traditional acoustic models and the concept of phonemes, instead using a well-optimized machine learning system based on a neural network, which eliminates the need to develop separate components for modeling various deviations, such as noise, echo and speech features.

The downside of this approach is that in order to obtain high-quality recognition and training of the neural network, the DeepSpeech engine requires a large amount of heterogeneous data dictated in real conditions by different voices and in the presence of natural noise.
The collection of such data is carried out by a project created in Mozilla Common Voice, providing a validated data set with 780 hours on English language, 325 in German, 173 in French and 27 hours in Russian.

The ultimate goal of the Common Voice project is to accumulate 10 hours of recordings of various pronunciations of typical human speech phrases, which will achieve an acceptable level of recognition errors. In its current form, the project participants have already dictated a total of 4.3 thousand hours, of which 3.5 thousand have been tested. When training the final model of the English language for DeepSpeech, 3816 hours of speech were used, in addition to Common Voice covering data from the LibriSpeech, Fisher and Switchboard projects, and also including about 1700 hours of transcribed radio show recordings.

When using the ready-made English language model offered for download, the level of recognition errors in DeepSpeech is 7.5% when assessed by the test set LibriSpeech. For comparison, the error rate in human recognition estimated in 5.83%.

DeepSpeech consists of two subsystems - an acoustic model and a decoder. The acoustic model uses deep machine learning methods to calculate the probability of certain symbols being present in the input sound. The decoder uses a beam search algorithm to convert the symbol probability data into a textual representation.

All innovations DeepSpeech 0.6 (0.6 branch is not backwards compatible and requires code and models to be updated):

  • A new streaming decoder is proposed that provides higher responsiveness and does not depend on the size of the processed audio data. As a result, the new version of DeepSpeech was able to reduce the recognition delay to 260 ms, which is 73% faster than before, and allows you to use DeepSpeech in on-the-fly speech recognition solutions.
  • Changes have been made to the API and work has been done to unify function names. Functions have been added to obtain additional metadata about synchronization, allowing not only to receive a text representation as an output, but also to track the binding of individual characters and sentences to a position in the audio stream.
  • Support for using the library has been added to the toolkit for learning modules CuDNN to optimize work with recurrent neural networks (RNN), which made it possible to achieve a significant (about two times) increase in model training performance, but required changes to the code that violated compatibility with previously prepared models.
  • The minimum requirements for the TensorFlow version have been raised from 1.13.1 to 1.14.0. Added support for the TensorFlow Lite lightweight edition, which reduced the DeepSpeech package size from 98 MB to 3.7 MB. For use on embedded and mobile devices, the size of the packed file with the model was also reduced from 188 MB to 47 MB ​​(the quantization method was used for compression after the model was trained).
  • The language model has been translated to a different format of data structures that allows you to map files into memory when loading. Support for the old format has been discontinued.
  • The mode of loading a file with a language model has been changed, which reduced memory consumption and reduced delays in processing the first request after the model was created. DeepSpeech now consumes 22x less memory while running and starts up 500x faster.

    Mozilla unveils DeepSpeech 0.6 speech recognition engine

  • Rare words were filtered in the language model. The total number of words has been reduced to 500 of the most popular words found in the text used to train the model. The cleaning made it possible to reduce the size of the language model from 1800MB to 900MB, with almost no effect on the level of recognition errors.
  • Added support for various technician creating additional variations (augmentation) of the sound data used in training (for example, adding to the set of options that include distortion or noise).
  • Added a library with bindings for integration with applications based on the .NET platform.
  • Redesigned documentation, which is now collected on a separate site deepspeech.readthedocs.io.

Source: opennet.ru

Add a comment