Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.
Authors: Fan et al. Facebook AI 2020
–
Paper Summary
Introduction
- Multilingual Machine Translation (MMT): single model to translate between any pair of languages
- Issue: Existing MMT systems have not performed as well as bilingual models when trained on the same language pairs, as model capacity necessarily must be split between many languages
- Previous Solutions: (1) increase model capacity (2) focus on English-Centric datasets
- Goal: create more diverse multilingual machine translatio models by building a large-scale Many-to-Many dataset for 100 languages
- automatic consttruction of parallel corpora using a novel data mining strategy
- backtranslation to improve quality of our model on zero-shot and low resource language pairs
- progress in scaling to train models & ohter scaling strategies to scale the number of parameters
- Result: direct translation between 100 languages without pivoting through English at a performance that is competitive with bilingual models on many competitive benchmarks
Preliminaries
- SentencePiece: train model with 0.9995 character coverage
- Multilingual Dictionary: SentencePiece produces subword units depending on their frequency in the training dataset. Naively applying it to our corpora would result in low resource languages and languages written in less frequent scripts being underrepresented in the resulting dictionary. We circumvent this problem by adding monolingual data for low resource languages and by using temperature sampling with $T = 5$. The resulting dictionary contains 128k unique tokens that are well distributed across languages.
- Transformers: sequence-to-sequence Transformer architecture using the encoder and decoder. The encoder transforms the source token sequence into a sequence of embeddings of the same length. Then, the decoder sequentially produces the target sentence, token by token, or autoregressively.
- Target language token: MMT target language not fixed – add special token in the encoder indicating the source language and a special token in the decoder indicating the target language
- Training: 12 Encoder and 12 Decoder layers, with 8192 FFN size and 1024 embedding dimension; 1.2B parameters
- Training Data: split into 256 shards to manage memory consumption, divided based on resource level such that high resource languages have more shards and the lowest resource languages only have one shard.
Building a Many-to-Many Parallel Dataset for 100 Languages
Creating a Multilingual Benchmark
Language Selection
- widely-spoken languages from geographically diverse language families with a diversity of scripts and resource levels
- languages for which public evaluation data
- languages for which monolingual data is available, as monolingual data is a critical resource for large-scale mining
Evaluation Benchmarks
- WMT, WAT, IWSLT, FLORES, TED, Autshumato, Tatoeba
- Evaluation using BLEU score
- Tokenized using
moses
tokenizer for most languages
Covering the Language Matrix by Mining Relevant Parallel Data
In this work, we leverage and extend the corpus provided by two of these mining projects: CCMatrix (Schwenk et al., 2019) and CCAligned17 (El-Kishky et al., 2020)
Mining parallel data with LASER Generic data mining pipeine:
- a large corpus of text is preprocessed and divided into different languages,
- candidate pairs of aligned sentences are embedded and stored in a index,
- indexed sentences are compared to form potential pairs,
- the resulting candidate pairs are filtered in post-processing.
- focus on embeddings generated by the LASER encoder, which enables the comparison of sentences in 94 different languages
- retrieve parallel corpora efficiently using FAISS indexing
CCMatrix Mining Approach
- global approach
- all unique sentences in one language are compared with all unique sentences in another language.
- Advantage: considering all possible documents when searching for the translation of a sentence
- Disadvantage: computationally demanding even with fast indexing from FAISS
- Apply to selected subset of relevant pairs
CCAligned Mining Approach
- local hierarchical approach
- document-level language identification find whole documents that are likely to contain mutual translations -> Parallel sentences are then mined using LASER-based alignment within the paired documents only.
- Advantage: very fast, scalable, and retrieves parallel sentences with high precision; each English document is aligned to many non-English documents – mining non-English pairs can be quickly performed by joining non-English documents paired to the same score
Bridge Language Group Mining Strategy
- Mining data for each and every language pair is prohibitive – sparse mining or mining only a select subset of pairs. Randomly seleting pairs to mine is straightforward but doesn’t take linguistic information into consideration.
- Alternative: group 100 languages into 14 language groupings and all languages within a grouping are mined against each other
- people living in areas that speak multiple languages in the same grouping tend to communicate a lot with each other and would benefit from high quality direct translation
- systematically mining languages of the same grouping is helpful for training language-specific parameter models
- languages are grouped by linguistic similarity but size of groupings vary greatly and less mined data for languages in the smallest groups; langues are also gropued by geographic and cultural proximity to reduce this data discrepancy
- define 1-3 bridge languages in each grouping, usually thoes with the most resources
- all 100 languages are mined against English
Training Data Statistics: final dataset has 7.5B parallel sentences, corresponding to 2200 directions
- 5–10 times more parallel data can be mined if using a Many-to-Many strategy, compared to an English-Centric one
Augmenting Bitext Data with Backtranslation
Backtranslation (BT) creates synthetic bitexts from unaligned monolingual data. The core idea is to translate monolingual sentences in the backward direction, and add the obtained synthetic translations to the training set.
Goal: use backtranslation on specific pairs to supplement mining data where needed bc BT is time consuming
Selection: Since back-translation is computationally intensive, we focus on 100 directions with a BLEU score of between 2 and 10. For 50 of these directions, we do not have any bitext at all as we did not mine all 4,450 possible language pairs.
Training a Multilingual Model with Additional Backtranslations: For each of the 100 target languages, randomly sample 50 million unique monolingual sentences from CommonCrawl corpus and generate synthetic translations with 1.2B MMT model
Impact of Backtranslated Data: Backtranslation almost always improves performance for any direction, regardless of the original BLEU score.
Balancing Languages in a Many-to-Many Setting
The data distribution produced by large-scale mining is not balanced between languages, so training a model would favor over-represented languages.
- Standard Solution: Temperature Sampling but not applicable in Many-to-Many case bc language distributions are more inderdependent
- Solution: Sinkhorn Temperature Sampling, which directly samples a pair of languages from a matrix of pair probabilities such that marginal distributions of languages corresponds to our target distribution. In practice, this means that each row and column of the matrix should sum to the probability of the corresponding language.
- Result: constant improvement of 0.5 in BLEU.
Many-to-Many Compared to English Centric
Performance is aggregated over 150 directions for To English and From English, and over 2500 directions for Non-English. On the pairs including English, both models achieve similar performance, suggesting that a 1.2B model does not underfit even though the additional non-English data represents 98% of the directions and 74% of the data.
While this result is not surprising, it confirms that a purely English-Centric model has limited potential on non-English pairs, and there is a fundamental need for training on Many-to-Many data.
Understanding the Source of Improvement
- Zero-shot: Many-to-Many performs nearly 11 BLEU better than the English-Centric model for direct translation, and with pivoting the gain remains over 6 BLEU.
- Impact of the quantity of training data: A hypothesis to explain the gain between English-Centric and Many-to-Many models is the effect of additional source and target side training data.
- Which Pairs Improve the Most? Pairs that have a large quantity of mined data, such as Spanish-Portuguese, greatly benefit from our Many-to-Many dataset. A second source of improvement is observed on languages for which the Many-to-Many dataset contains a large amount of data across many pairs. Finally, we also observe a third type of improvements from the similarity in vocabulary and syntax from related languages. (Eng-Belarusian)
Understanding the Performance of English-Centric Systems
- English- Centric model improves the most over bilingual models on the directions into English, while improvement in the other directions (From English) remain more modest.
- A hypothesis to explain this discrepancy between directions from and to English is that the decoder of an English-Centric model learns a better English language model by leveraging the aggregated English data across all through-English directions.
Components for Scaling Multilingual Translation Models
Dense Scaling
- How to Scale: Wide or Deep? Increase # of layers vs. dimensions of each layer… wider models scale better than deeper models
- Performance as a function of scale: as we increase the number of parameters, we observe that the performance increases, even on low-resource pairs. However, improvements increase roughly logarithmically in the number of parameters, and we need to scale model size by an order of magnitude to improve by a few BLEU points. As we scale models densely, their runtime and memory usage becomes too prohibitive to justify the gain in performance, and so, we consider alternatives to increase the capacity of our models more efficiently.
Scaling Model Capacity with Language-Specific Parameters
- we introduce a layer whose parameters are split by language or language group based on similarity in vocabulary.
- Each translation direction only accesses a subset of these parameters, allowing the model capacity to scale without significantly affecting the training and inference time.
- we focus on allocating entire language-specific layers and using this to scale model size while maintaining training speed.
- Parallel Transformer Layer: replace seq2seq Transformer layer with a set of parallel Transformer layers, one for each pre-defined group of languages
- if parallel layer is in the encoder, we select the sublayer according to source language
- if parallel layer is in the decoder, we select the sublayer according to target language
- In practice, we only add these layers to either the encoder or decoder, not both. This enables us to split translations along with their sublayers per GPU, leading to faster training and efficient memory usage.
- Grouping Languages by Frequency and Similarity: We group languages based on two criteria: the amount of training data and their overlapping vocabulary.
- First, each language with more than 100M sentences forms its own group and hence has its own sublayer. We have 28 languages that fit this criteria
- Second, we group the remaining languages by vocabulary overlap, leading to 18 additional groups.
- We cluster the remaining languages together and roughly balance the amount of training data for each group
- In total, we form 46 groups, each with its own sublayer in a language-specific layer
- Random Re-Routing between Sublayers: Deterministic routing does not share information between similar languages if not associated with the same sublayer. We mitigate this shortcoming by random re-routing of translations, i.e., randomly picking another sublayer instead of the designated one. This shares information between languages associated with different sublayers, benefiting low resource languages by training on similar high resource languages.
- Adding Language-Specific layers to Pre-Trained Transformers: We can integrate a language-specific layer into an already pre-trained Transformer by adding it either at the end of the decoder or at the beginning of the encoder.
- These additional language-specific layers train rapidly as the rest of the model already has strong performance. This strategy means it is straightforward to adapt pre-trained networks to a new domain or language by training a small number of dedicated parallel layers, and could easily be extended to various other applications.
- Evaluation:
- language-specific parameters are generally more effective when applied to the decoder network.
- With a re-routing rate of 20%, an improvement of about 0.8 BLEU can be achieved over no re-routing for low and mid resource languages
- language-specific layers improve results compared to baselines of similar parameter size, particularly for mid and high resource languages where there is sufficient data to train the language-specific sublayers.
- We show that adding language-specific layers for five languages improves results on the WMT evaluation datasets.
My Thoughts
- A high-level look at Facebook’s M2M-100. The paper doesn’t include all the nitty gritty details of scaling and mining and makes the process seem easier than it actually probably is. While I was reading the paper, I kept thinking about how laborious this effort must have been… to collect that much data across a 100 languages even if you are using an algorithm that mines it for you, must have been an arduous and onerous task. Even just thinking about keeping track of that much data makes my head hurt. But I’m sure they had a whole team of engineers working on each language…
- I feel like there isn’t really a fantastic breakthrough in this paper in terms of theory or model architecture but the authors were able to cleverly leverage all the past and recent techniques in mining/translation/scaling to create this M2M-100. This gigantic task I’m sure is one that many thought about doing but couldn’t get up the courage to start and Facebook just managed to get their butts off their chairs and actually did it–really, really well.
- Reading the paper, you can really tell that a lot of thought and care went into the model architecture. They had great explanations and reasons for all of the model architecture decisions they made. It really seemed as if they were leaving nothing up to chance. I was especially impressed by the bridge language modeling technique and the amount of linguistic information that went into consideration in its formation.
- The paper also reminded me of the importance of data mining as a field, especially for specific data such as parallel datasets. I wonder if this mining method could be applicable in monolingual settings if you wanted data for a certain field, such as medical or conversation data. If it could be, it would be helpful for data augmentation in pretraining or even for fine-tuning for mid and low resource languages, including Korean.
- The paper wasn’t too difficult to comprehend because it was well written and high-level, but I would like to look up more on the following: CCAligned Mining Matrix, Temperature Sampling, Backtranslation