Facebook has published a model for machine translation that supports 200 languages

Facebook (banned in the Russian Federation) has published the developments of the NLLB (No Language Left Behind) project, aimed at creating a universal machine learning model for direct translation of text from one language to another, bypassing the intermediate translation into English. The proposed model covers more than 200 languages, including rare African and Australian languages. The ultimate goal of the project is to provide a means of communication for all people, regardless of the language they speak.

The model is available under a Creative Commons BY-NC 4.0 license, which allows copying, distribution, inclusion in your projects and creation of derivative works, but subject to attribution, retention of the license and use only for non-commercial purposes. The Modeling Tool is licensed under the MIT license. To stimulate development using the NLLB model, it was decided to allocate 200 thousand dollars to provide grants to researchers.

To simplify the creation of projects using the proposed model, the code of applications used to test and evaluate the quality of models (FLORES-200, NLLB-MD, Toxicity-200), the code for training models and encoders based on the LASER3 library (Language-Agnostic SEntence representation). The final model is offered in two versions - full and reduced. The reduced version requires fewer resources and is suitable for testing and use in research projects.

Unlike other machine learning-based translation systems, Facebook's solution is notable for offering one common model for all 200 languages, covering all languages ​​and not requiring separate models for each language. Translation is carried out directly from the source to the target language, without intermediate translation into English. To create universal translation systems, an additional LID-model (Language IDentification) is proposed, which allows determining the language used. Those. the system can automatically recognize the language in which the information is provided and translate it into the user's language.

Translation is supported in any direction, between any of the supported 200 languages. To confirm the quality of translation between any languages, the FLORES-200 reference test set was prepared, which showed that the NLLB-200 model, in terms of translation quality, is on average 44% superior to previously proposed research systems based on machine learning when using BLEU metrics that compare machine translation with standard human translation. For rare African languages ​​and Indian dialects, the superiority in quality reaches 70%. It is fashionable to visually assess the quality of the translation on a specially prepared demo site.

Source: opennet.ru

Add a comment