In this article, I explain how to select, check, and split datasets to make a machine translation system. I show with examples what are the most important properties of a dataset for machine translation and how to set the trade-off between the quality and the quantity of data, depending on the objective of the machine translation systems.
Train, validate, and evaluate
To build a machine translation system, we need as much data as possible for:
Training: A machine translation system must be trained to learn how to translate. If we plan to use a neural model, this step is by far the most costly one in terms of data and compute resources.
Validation: A validation dataset can be used during training to monitor the performance of the model being trained. For instance, if the performance doesn’t improve after some time, we can decide to stop the training early. Then, if we have saved models at different training steps, we can select the one performing the best on the validation data, and use this model for evaluation.
Evaluation: This step automatically yields the performance of our selected model on a dataset that is as close as possible to the text our system will translate once deployed. If the performance is satisfying, then we can deploy our model. If not, we would have to retrain the model with different hyperparameters or training data.
All these datasets are parallel corpora in source and target languages, and ideally in the target domain.
That’s a lot of keywords in a single sentence. Let’s explain them one by one.
Source language: This is the language of the text that will be translated by our machine translation system.
Target language: This is the language of the translation generated by the machine translation system.
Target domain: This notion is more complex to define. Let’s say that the data used to build our system should look as close as possible to the data that the system will be translated once deployed: the same style, genre, and topic for instance. If we want our system to translate tweets, it would be much better if trained on tweets than if it was trained on scientific abstracts. It may seem obvious, but usually finding a large dataset in the target domain is challenging so we have to approximate it.
Parallel corpora: This is usually in the form of sentences or segments in the source language paired with their translations in the target language. We use parallel data to teach the system how to translate. This type of data has many other names: parallel data, bilingual corpora, bitext, etc. “Parallel data” is probably the most common one.
For example, the following dataset is parallel:
Quality
To get the best machine translation system, we need a large parallel corpus to train the system. But we shouldn’t sacrifice quality for quantity.
Depending on whether we talk about training or validation/evaluation data, the quality of the data used will have a different impact.
But first, let’s define what are the most important characteristics of a parallel data of good quality to build a system from scratch.
Correct
The translations in the parallel data should be correct and natural. Ideally, it means that the translations should have been produced from scratch (i.e., not post-edited) by professional translators and independently checked. Very often parallel corpora are produced via crowdsourcing by non-professional translators. The data can also be simply crawled from the web and automatically paired which is definitely not perfect especially for domains and language pairs with only small data available. Even though the quality of such datasets is far from optimal, we may not have a choice but to use them when they are the only resource available for a given language pair.
Aligned
The segments, or documents, in the parallel data, should be correctly aligned. If segments are not paired properly, the system will learn wrong translations at training time.
Original
The source side of the parallel data should not be a translation from another language. This point is maybe a bit complex to fully understand. We want our system to learn how to translate text in the source language. But, if at training time, we provide our system with text that was not originally in the source language, i.e., text that is already a translation from another source language, then it would learn how to translate translations better than original text. I’ll detail why this is important below.
In-domain
The data should be in the target domain. This is arguable and suits the ideal scenario. We can train a very good system on an out-of-domain dataset and fine-tune it later on a smaller dataset in the target domain.
Raw
The data should be close to raw. Using an already pre-processed dataset is often a bad idea. By pre-processing, I mean any process that altered the original text. It can be tokenization, truecasing, punctuation normalization, etc. Very often, all these pre-processing steps are under-specified with the consequence that we can’t exactly reproduce them on the text our system will actually translate once deployed. It is way safer, and sometimes faster, to define our own pre-processing steps.
To have a rough idea about the quality of the dataset, we should always know where the data comes from and how it was created. I’ll write more about this below.
At training time, the machine translation system will learn the properties of the parallel data. Neural models are rather robust to noise but if our training data is very noisy, i.e., misaligned or with many translation errors, the system will learn to generate translations with errors.
At validation/evaluation time, the quality of the parallel data used is even more critical. If our dataset is of poor quality, the evaluation step will only tell us how good our system is at poorly translating. In other words, it would be a useless evaluation, but that may convince us to deploy a machine translation system poorly trained.
Quantity
In addition to quality, the quantity of data used is also critical.
“Quantity” often refers to the number of parallel segments in the parallel corpora. I will use this definition here.
For training, using as much data as possible is a good rule of thumb provided that the data is of a reasonable quality. I classify training scenarios into 3 categories:
low-resource: The training data contains less than 100,000 parallel segments (or so-called sentences)
medium-resource: The training data contains between 100,000 and 1,000,000 parallel segments
high-resource: The training data contains more than 1,000,000 parallel segments
For validation and evaluation, using a lot of data may seem to be the right choice to get an accurate evaluation of our models, but usually, we actually prefer to use more data for training rather than for validation and evaluation.
If you look at best practices in research and development, you will find that validation and evaluation datasets for machine translation usually contain between 1,000 and 3,000 parallel segments. Keep in mind here that the quality of these datasets is much more important than its quantity, in contrast to the training dataset. We want the evaluation dataset perfectly translated and as close as possible to the text our system will translate.
Monolingual data
Monolingual data, as opposed to the parallel data I described above, are texts in a single language. It can be the source or the target language.
Since this data is monolingual, it is far easier to collect in very large quantities than parallel data.
It is usually exploited to generate synthetic parallel data that is then used to augment the training parallel data.
There are many strategies to generate synthetic data, such as backtranslation and forward translation. They can be quite complex techniques with a negative impact in training if not handled properly.
I’ll discuss them in detail in another blog post. Stay tuned!
Data leakage prevention
If you are familiar with machine learning, you probably already know what data leakage is.
We want the training data to be as close as possible to the validation and evaluation data but without any overlapping.
If there is an overlap, we talk about data leakage.
It means that our system is partly trained on data also used for validation/evaluation. This is a critical issue since it artificially improves the results obtained for validation/evaluation. The system would be indeed particularly good at translating its validation/evaluation data since it saw it at training time, while once in production the system will likely be exposed to unseen texts to translate.
Preventing data leakage is much more difficult than it sounds, and to make things more complicated there are many different levels of data leakage.
The most obvious case of data leakage is when pairs of segments, or documents, from the evaluation data are also in the training data. These segments should be excluded.
Another form of data leakage is when training and evaluation data were made from the same documents. For instance, shuffling the order of the segments of a dataset, and then picking the first 95% for training and the last 5% for validation/evaluation can lead to data leakage. In this situation, we are potentially using pairs of segments that were originally from the same documents, probably created by the same translator, in both training and validation/evaluation data. It is also possible that segments in the training data were directly used as context to create the translations of the segments in the validation/evaluation data. Consequently, the validation/evaluation data artificially becomes easier to translate.
To prevent data leakage, always know where the data comes from, and how the data was made and split into training/validation/evaluation datasets.
A word about translationese
Parallel corpora have two sides. Ideally, the source side is an original text written by a native speaker of the source language and the target side is a translation produced by native speakers of the target language.
The target side is not an original text: It is a translation. A translation can have errors. Studies have also demonstrated that translations are lexically less diverse and syntactically more simple than original texts. These translation artifacts define “translationese.”
Why is it important in machine translation?
Let’s say you have a parallel corpus with an original source side in Spanish and its translation in English. This is perfect for a Spanish-to-English machine translation system.
But if you want an English-to-Spanish system, you may be tempted to just swap both sides of the parallel corpus: The original text would be on the target side and the translation on the source side.
Then, your system will learn to translate… translations! Since translations are easier to translate than original text, the task is much simpler to learn for the neural network. But then, the machine translation system will be underperforming when translating the original texts input by the users.
The bottom line is: Check the origin of the data to be sure, at least, that you don’t have translations on the source side.
Note that sometimes this situation is inevitable, especially when tackling low-resource languages.
Sources of parallel corpora
Fortunately, there are many parallel corpora available online in various domains and languages.
I mainly use the following websites to get what I need:
OPUS: This is probably the most extensive source of parallel corpora. There are dozens of corpora available for 300+ languages. They are downloadable in plain text (2 files: 1 for the source language and 1 for the target language) or in the TMX format which is an XML format often used in the translation industry. For each corpus, the size and length (in number of segments and tokens) is also given.
Dataset from Hugging Face: This one is not specialized in resources for machine translation but you can find there a lot of parallel corpora if you select the “translation” tag. The intersection between OPUS and Dataset is huge, but you will find some parallel corpora that are not available on OPUS.
This is by far the two biggest sources of parallel corpora. If you know others, please indicate them in the comments.
Be aware that most of the parallel corpora you will find there can be used for research and academic purposes, but not for commercial purposes. OPUS doesn’t show the license for each dataset. If you need to know it, you will have to directly check the original source of the dataset or contact the people who created it.
Examples
Now let’s be more practical and manipulate some datasets. I created two tasks for which I need parallel data:
Task 1: A general machine translation system to translate Spanish into English (Es→En)
Task 2: A specialized machine translation system to translate COVID-19-related content from Swahili to English (Sw→En)
We will first focus on Task 1.
We can start to search on OPUS to find whether there are parallel corpora for this task.
Fortunately, Es→En is a high-resource task. Plenty of parallel corpora are available in various domains. For instance, from OPUS we can get:
The first one, “ParaCrawl v9” is one of the largest. It has been automatically created but is good enough to train a machine translation system. We should always check the license to be sure we can use it for our target application. As I mentioned above, OPUS doesn’t provide license information, but it does provide the source of the dataset once you click on it. For license information, we have to check the original source of the data: https://www.paracrawl.eu/. This corpus is provided under a CC0 license. Academic and commercial uses are allowed.
This is a huge corpus containing 264M pairs of segments. This is more than enough to split it into train/validation/evaluating datasets. I would split the data like this to avoid data leakage:
Since this is a lot of segments, we can split the data into consecutive chunks of 10M segments. I would extract a chunk, the last one for instance, that I would resplit into smaller consecutive chunks of 1M. Finally, I would randomly extract 3,000 segments for validation, from the first smaller chunk, and another 3,000 segments for evaluation, from the last smaller chunk.
There is enough distance between training, validation, and evaluation datasets. This is a very simple way to do it but far from optimal. It doesn’t prevent data leakage if the segments in the corpus are already shuffled.
There are other methods, that I won’t discuss here, to better guarantee the absence of data leakage while extracting the most useful segment pairs for each dataset.
For training, you can begin for instance with the first 2 chunks of 10M segments. If you are not satisfied by the translation quality you can add more chunks into your training data.
If the quality of the translation doesn’t improve much, it means that you may not need to use the remaining 200M+ segment pairs.
Task 2 is much more challenging.
We want to translate Swahili. African languages are notoriously low-resource. In addition, we target a relatively new domain, COVID-19, so we can expect the data available for this task to be extremely small.
As expected, on OPUS far fewer datasets are available:
A good point here is that Paracrawl is also available for Sw→En, but is fairly small with its 100,000 segment pairs. Yet, this is one of the largest resources available with a CC0 license. I would use it for training, and then try to add other sources of data (such as CCMatrix or CCAligned) to observe how the performance improves.
But how to evaluate a machine translation system specialized for translating COVID-19 content?
Following the COVID-19 outbreak, an effort has been made by the research community to make translation resources available in many languages. The TICO-19 corpus is one of them and is provided with a CC0 license. It is available on OPUS. It is small but provides the translations of 3,100 segments in Swahili and English. This is enough to make validation/evaluation datasets. Here, I would take the 1,000 for validation and the remaining segments for evaluation. Then, you will know how your system trained on Paracrawl performs in translating COVID-19 content.
Note that I didn’t talk about translationese for these two tasks. Paracrawl is very likely to have non-original Spanish and Swahili on its source side. The TICO-19 corpus has been created in English. The Swahili side is non-original. In other words, we can’t avoid translationese for these two tasks.
Conclusion
In this article, I described how to select and split your datasets to make your own machine translation system.
To conclude, I would say that the most important point is to find the best trade-off between quality and quantity, especially if you target low-resource languages. Also, it is critical to know your datasets very well. If left unchecked, you may obtain a system that totally misses its target while being biased and unfair.
In the next article, I will show you how to pre-process these datasets to improve them and facilitate the training of machine translation: