Mozilla announced the release of the 18th Common Voice dataset, which is now available for download. This dataset is part of Mozilla's strategy to make voice technologies more accessible. It is a free dataset of multilingual voice clips and associated text data, licensed under the CC0 (public domain). The creation of the dataset is a community collaboration, including voice and text contributors, language activists, technologists, academics, and other members of the Common Voice community.
The Common Voice dataset now contains 31,841 hours of speech data, including 20,789 hours of community-verified speech data. This represents an increase of 700 hours of speech data compared to the last dataset release and 381 hours of newly verified data. The 18th dataset contains samples in 129 languages, including five new languages added in this release.
The new dataset introduces five new languages: Xhosa (South Africa), Kalenjin (Kenya), Kidaw'ida (Kenya), Dholuo (Kenya and Tanzania), and Tswana (Botswana, Zimbabwe, Namibia, South Africa). These languages are spoken by hundreds of millions of people worldwide and can now benefit from enhanced voice technology support.
If you're interested in Common Voice, there are many ways to join the community. You can share your voice or write and contribute original proposals in your language to help create the next dataset. If your language isn't yet in Common Voice, you can request its addition using the dedicated form. Technical contributions to the open-source project on Github are also welcome.
Mozilla is always happy to receive feedback on new releases. You can reach them on the Common Voice forums, chat with them on Matrix, or email the team directly at commonvoice@mozilla.comThey are particularly interested in studying what dataset users create or explore using the dataset. A better understanding of the needs of dataset users can help them determine a direction that better meets their needs.
Source: linux.org.ru
