Meta Open-Sources 200 Language Translation AI NLLB-200

Meta AI recently made NLLB-200 open source, an AI model that can translate between more than 200 languages. NLB-200 is a 54.5B parameter Mixture of Experts (MoE) model trained on a dataset containing over 18 billion sentence pairs. In benchmark evaluations, the NLLB-200 outperforms other state-of-the-art models by up to 44%.

The model was developed as part of Meta’s No Language Left Behind (NLLB) project. This project aims to provide machine translation (MT) support for resource-poor languages: languages ​​with less than one million publicly available translated sentences. To develop NLLB-200, the researchers collected several multilingual training data sets by hiring professional human translators and by mining data from the Internet. The team has also created and open sourced an extensive benchmark dataset, FLORES-200, which can evaluate MT models in more than 40k translation directions. According to Meta,

Translation is one of the most exciting areas in AI due to its impact on people’s everyday lives. NLLB is about much more than just giving people better access to content on the web. It will make it easier for people to contribute and share information in different languages. We have more work ahead of us, but we are energized by our recent progress….

Meta AI researchers have been working on the problems of neural machine translation (NMT) and low-resource languages ​​for many years. In 2018, Meta released Language-Agnostic SEntence Representations (LASER), a library for converting text into an embedding space that preserves sentence meaning in 50 languages. In 2019, the first iteration of the FLORES assessment dataset was released, which was expanded to 100 languages ​​by 2021. In 2020, InfoQ covered the release of Meta’s M2M-100, the first single model that could translate between any pair of 100 languages.

As part of the latest release, the FLORES benchmark has been updated to cover 200 languages. The researchers hired professional translators to translate the FLORES sentences into each new language, with an independent group of translators reviewing the work. In general, the benchmark includes translations of 3k sentences sampled from the English version of Wikipedia.

For training the NLLB-200 model, Meta has created several multilingual training datasets. NLLB-MD, a dataset to evaluate the generalization of the model, contains 3k sentences from four non-Wikipedia sources, also professionally translated into six low-resource languages. NLLB-Seed contains 6k sentences from Wikipedia that have been professionally translated into 39 low-resource languages ​​and is used for bootstrapping model training. Finally, the researchers built a data mining pipeline to generate a multilingual training dataset containing more than 1B sentence pairs in 148 languages.

The final NLLB-200 model is based on the Transformer encoder-decoder architecture; however, with every 4th Transformer block, the feed-forward layer has been replaced with a Sparsely Gated Mixture of Experts layer. To compare the model to existing state-of-the-art performance, the team evaluated it against the older FLORES-101 benchmark. The NLLB-200 outperformed other models on average by 7.3 BLEU points from a phrase, a 44% improvement.

Several members of the NLLB team took part in a Reddit “Ask Me Anything” session to answer questions about the work. When a user asked about the challenges of low-resource languages, research scientist Philipp Koehn replied:

Our main push was towards languages ​​that were not previously served by machine translation. We tend to have fewer pre-existing translated texts or even texts for them – which is a problem for our data-driven machine learning methods. Several scripts are a problem, especially for translating names. But there are also languages ​​that express less explicit information (such as time or gender), so translating from those languages ​​requires inferences about a broader context.

The NLLB-200 models and training code, as well as the FLORES-200 benchmark, are available on GitHub.

Leave a Comment