NVIDIA Invests $1.5M in Mozilla Common Voice Project

NVIDIA is investing $1.5 million in the Mozilla Common Voice project. Interest in speech recognition systems is linked to the prediction that in the next ten years, voice technology will become one of the main ways people interact with various devices, from computers and phones, to digital assistants and kiosks for selling goods.

The performance of voice systems is highly dependent on the amount and variety of voice data available for training machine learning models. Today's voice technologies are mostly focused on English language recognition and do not cover a huge number of languages, accents and speech patterns. The investment will help accelerate the growth of public voice data, engage more communities and volunteers, and increase the number of staff involved in the project during regular business hours.

Recall that the Common Voice project is aimed at organizing joint work to accumulate a database of voice patterns that takes into account all the diversity of voices and manners of speech. Users are prompted to speak out phrases displayed on the screen or evaluate the quality of data added by other users. The accumulated database with records of various pronunciations of typical phrases of human speech without restrictions can be used in machine learning systems and in research projects.

The Common Voice set currently includes over 164 pronunciations. Accumulated about 9 thousand hours of voice data in 60 different languages. The set for the Russian language covers 1412 participants and 111 hours of speech material, and for the Ukrainian language - 459 participants and 30 hours. For comparison, more than 66 thousand people took part in the preparation of materials in English, dictating 1686 hours of confirmed speech. The proposed sets can be used in machine learning systems to build speech recognition and synthesis models. Data released as public domain (CC0).

According to the author of the Vosk continuous speech recognition library, the disadvantages of the Common Voice set are the one-sidedness of the voice material (the predominance of males 20-30 years old, and the lack of material with the voice of women, children and the elderly), the lack of vocabulary variability (repetition of the same phrases) and distributing recordings in the distorting MP3 format.

Source: opennet.ru

Add a comment