Mozilla Common Voice 8.0 Voice Update

Mozilla has unveiled an update to the Common Voice voice data sets, which include pronunciation examples from around 200 people. Data released as public domain (CC0). The proposed sets can be used in machine learning systems to build speech recognition and synthesis models. Compared to the last update, the volume of speech material in the collection has increased by 30% - from 13.9 to 18.2 thousand hours of speech. The number of supported languages ​​has increased from 67 to 87.

The set for the Russian language includes 2452 participants and 193 hours of speech material (there were 2136 participants and 173 hours), for the Belarusian language - 6160 participants and 987 hours (there were 3831 participants and 356 hours), for the Ukrainian language - 684 participants and 76 hours ( there were 615 participants and 66 hours). More than 79 thousand people took part in the preparation of materials in English, dictating 2886 hours of confirmed speech (there were 75 thousand participants and 2637 hours).

Recall that the Common Voice project is aimed at organizing joint work to accumulate a database of voice patterns that takes into account all the diversity of voices and manners of speech. Users are prompted to speak out phrases displayed on the screen or evaluate the quality of data added by other users. The accumulated database with records of various pronunciations of typical phrases of human speech without restrictions can be used in machine learning systems and in research projects. According to the author of the Vosk continuous speech recognition library, the disadvantages of the Common Voice set are the one-sidedness of the voice material (the predominance of males 20-30 years old, and the lack of material with the voice of women, children and the elderly), the lack of vocabulary variability (repetition of the same phrases) and distribution of the recordings in the distorting MP3 format.

Additionally, we can note the release of the NVIDIA NeMo 1.6 toolkit, which provides machine learning methods for creating systems for speech recognition, speech synthesis, and information processing in natural language. NeMo includes pre-built, trained PyTorch machine learning models prepared by NVIDIA using Common Voice speech data and covering various languages, accents, and speech forms. The models can be useful for researchers building voice dialogue systems, transcription platforms, and automated call centers. For example, NVIDIA NeMo is used in automated voice services of MTS and Sberbank. The NeMo code is written in Python using PyTorch and distributed under the Apache 2.0 license.

Source: opennet.ru

Add a comment