Mozilla Common Voice 7.0 Voice Update

NVIDIA and Mozilla have released an update to their Common Voice datasets, which include 182 people's speech samples, up 25% from 6 months ago. The data is published as public domain (CC0). The proposed sets can be used in machine learning systems to build speech recognition and synthesis models.

Compared to the previous update, the size of the speech material in the collection has increased from 9 to 13.9 thousand hours of speech. The number of supported languages ​​has increased from 60 to 76, including for the first time support for Belarusian, Kazakh, Uzbek, Bulgarian, Armenian, Azerbaijani and Bashkir languages. The set for the Russian language covers 2136 participants and 173 hours of speech material (there were 1412 participants and 111 hours), and for the Ukrainian language - 615 participants and 66 hours (there were 459 participants and 30 hours).

More than 75 thousand people took part in the preparation of materials in English, dictating 2637 hours of confirmed speech (there were 66 thousand participants and 1686 hours). Interestingly, the language in second place in terms of the amount of accumulated data is Rwanda, for which 2260 hours have been collected. This is followed by German (1040), Catalan (920) and Esperanto (840). Among the most dynamically increasing the size of voice data are the Thai language (20-fold increase in the base, from 12 to 250 hours), Luganda (from 8 to 80 hours), Esperanto (from 100 to 840 hours) and Tamil (from 24 to 220 hours). hours).

As part of its participation in the Common Voice project, NVIDIA prepared ready-made trained models for machine learning systems (supported by PyTorch) based on the collected data. The models are distributed as part of the free and open NVIDIA NeMo toolkit, which, for example, is already used in the automated voice services of MTS and Sberbank. The models are intended for use in speech recognition, speech synthesis, and natural language processing systems, and may be useful for researchers building voice-activated dialogue systems, transcription platforms, and automated call centers. Unlike previously available projects, the published models are not limited to English language recognition and cover a variety of languages, accents and forms of speech.

Recall that the Common Voice project is aimed at organizing joint work to accumulate a database of voice patterns that takes into account all the diversity of voices and manners of speech. Users are prompted to speak out phrases displayed on the screen or evaluate the quality of data added by other users. The accumulated database with records of various pronunciations of typical phrases of human speech without restrictions can be used in machine learning systems and in research projects.

According to the author of the Vosk continuous speech recognition library, the disadvantages of the Common Voice set are the one-sidedness of the voice material (the predominance of males 20-30 years old, and the lack of material with the voice of women, children and the elderly), the lack of vocabulary variability (repetition of the same phrases) and distributing recordings in the distorting MP3 format.

Source: opennet.ru

Add a comment