GPT-3.5 Translates Paragraphs Better
And outperforms Google Translate for the translation of literary works
According to previous studies, GPT models perform as well as standard machine translation systems, e.g., Google Translate.
These studies mostly focused on sentence-level translation: The default approach used in machine translation that translates sentences one by one without any context.
Translating paragraphs or entire documents represents very difficult challenges for standard machine translation systems. These systems usually have to split the input or undergo heavy engineering to accept and leverage a longer input.
Yet, intuitively, and following the workflow of human translators, we can expect machine translation systems to perform better with context, e.g., translating entire documents or paragraphs.
This is where large language models such as GPT models can shine. They can take as input prompts significantly longer than typical machine translation systems.
But it remains to evaluate the following:
Whether exploiting more context is useful for improving GPT’s machine translation quality.
The performance of GPT models when translating long text compared to standard machine translation systems.
The evaluation of large language models for translating paragraphs poses several challenges.
The automatic metrics used for machine translation evaluation are not designed for paragraph-level evaluation.
The evaluation data must not have been seen during the training of the evaluated systems.
The evaluation should be conducted on a diverse set of language pairs to have an accurate overview of the large language model’s translation quality.
Prompts must be designed to exploit an entire paragraph, i.e., not only sentences as done in previous work.
These challenges are all tackled by Karpinska and Iyyer (2023): “Large language models effectively leverage document-level context for literary translation, but critical errors persist”.
In this blog article, I review and comment on their work. We will see how their evaluation of GPT-3.5 shows that “LLMs produce better translations when provided with paragraph-level context” and can achieve better translation quality than state-of-the-art neural machine translation systems for very diverse language pairs.
Human evaluation of paragraph translations
The automatic evaluation metrics that are commonly used in machine translation are unsuitable. Their correlation with human judgments is unknown when evaluating paragraph-level translations.
We can’t rely on automatic metrics here.
Human evaluation remains the primary choice to have an evaluation of high credibility, so the authors of this study mainly relied on an MQM framework (Lommel et al., 2014):
Mark translation error spans and categorize them
Make preference judgments of which of the two translations is of higher quality
Provide free-form justifications for their preference judgments.
For this evaluation, they collected a total of 720 pairs of translated paragraphs for 18 language pairs.
That’s a lot of data! I can’t wait to have a look at the dataset. It will be released on GitHub, here.
Machine translation of literary works
For evaluation, this work chose to focus on translating literary works. It may seem like an odd choice since most previous work in machine translation focuses on other genres/domains (news, user-generated texts, …).
Machine translation of literary texts is understudied and extremely challenging, especially with machine translation systems working at sentence-level.
In this type of text, contextual nuances are important but can’t be captured if the system translates sentences independently. Often, human translators have to restructure entire paragraphs to accurately translate into the target language.
Translation of literary texts is intuitively a task where a system taking a document or a paragraph as input would perform better than a system only accepting shorter input.
However, a major constraint that we have when we evaluate large language models is that the data used for evaluation must be recent. This is important for the credibility of the evaluation. By using recently published data for evaluation, we avoid translating texts that could have been used for training the evaluated model, i.e., we avoid data contamination.
In this work, most of the translations used for evaluation were published after 2021. These particular translations were very probably absent from the training data of GPT-3.5 which have been trained on data published before 2022 according to OpenAI.
However, the original texts that are translated are much older (published from 1884 to 2020). They were very likely seen by the systems evaluated in this work (GPT-3.5 and Google Translate).
Also, while it is unlikely that the evaluated systems have seen these specific translations, they may have seen other translations in other languages, or the same language but published earlier.
The data contamination is limited but happens. I don’t think there is a better way to completely prevent it for literary texts. But for other genres, such as news, this is possible.
A very diverse set of language pairs
This is one of the strongest points of this work: The authors evaluated very diverse language pairs.
As source languages, they selected languages from various families: Indo-European (Romance, Germanic, Slavic), Sino-Tibetan, and Japonic. This way, they ensure that the evaluation will be able to identify more precisely the strengths and weaknesses of GPT-3.5 in translating languages with different morphological features and writing systems.
The languages to translate used for evaluation are English (en), Polish (pl), Russian (ru), Czech (cs), French (fr), German (de), Japanese (ja), and Chinese (zh).
For the target languages, they selected languages to create pairs of source-target languages that are “easy” (similar languages) and “difficult” (dissimilar languages).
For instance, Czech-Polish is an easy language pair since these languages have a lot in common. On the other hand, Japanese-Polish is an extremely difficult language pair since these two languages are from very distant language families with different grammar and writing systems. There are also a very limited number of machine translation studies for this language pair.
The selected target languages for each source language are English (en), Japanese (ja), and Polish (pl).
Prompt engineering for translating with GPT-3.5
One of the most critical steps when evaluating large language models is designing prompts.
There are many possible prompts for machine translation. Ideally, we should extensively evaluate several of them to assess how impactful is the choice of the prompt.
We also have to keep in mind that the conclusions made by a scientific work may only be valid for the very particular prompts that we evaluate.
Including many prompts in an evaluation is costly since we have to run the inference with a large language model for each prompt. In practice, it means that we can only select a limited number of prompts to conduct the evaluation.
They used 5-shot in-context learning to translate with GPT-3.5. There are 5 examples of translations in the prompt to indicate more precisely what is expected from GPT-3.5.
The selected translation examples have a critical impact on the translation quality of a language model. As demonstrated by Vilar et al. (2022), the translation quality of examples is what is the most important.
About the example selection, they wrote:
We manually curate the five demonstrations from literary texts for each of the 18 language pairs, resulting in 90 total demonstration examples. These demonstrations are sourced from novels that are not part of our translation dataset, resulting in potential differences in topic and style […]
This is not very detailed. Especially, here I have no idea what “curate” involves. The curation criteria are not provided.
Once selected, they included the examples in three prompts that exploit contexts of different sizes.
Sentence-level Prompt Template
With this template, the sentences of the paragraphs to translate are given to GPT one by one. This is how standard sequence-to-sequence neural machine translation systems work.
Original text in [SRC LANG]:
source sentence
Translation into [TRG LANG]:
target sentence
Note: [SRC LANG] and [TRG LANG] denote the source and target languages, respectively.
Sentence-level Translation with Context Prompt Template
The translation is still performed at sentence-level but the sentences are given with their context to GPT-3.5: What precedes and what follows the sentence in the paragraph are both in the prompt.
Original text in [SRC LANG]:
source prefix
<translate> src sent </translate>
source suffix
Translation into [TRG LANG]:
target prefix
<translated> trg sent </translated>
I found this design quite imaginative but also risky. In my experience, the GPT models can easily be confused if we don’t explicitly define the tags. In this situation, I wouldn’t be surprised if GPT just translates everything including the tags (<translate> and <translated>).
Paragraph-level Prompt Template
The template is the same as the first one, but here they provide entire paragraphs instead of sentences.
Original text in [SRC LANG]:
source paragraph
Translation into [TRG LANG]:
target paragraph
Now that we have our prompts, we can use them to evaluate the translation quality of GPT-3.5.
Evaluation of GPT-3.5 for Paragraph Translations
This evaluation mainly aims to answer two questions:
Are large language models such as GPT-3.5 better at translation when translating entire paragraphs instead of sentences?
How does GPT-3.5 perform compared to Google Translate when translating entire paragraphs?
For this evaluation, the authors mainly rely on human evaluation using the MQM framework.
If you are familiar with my work, you already know how critical I can be when writing about machine translation evaluation.
For this work, the authors evaluated their machine translation systems with very high scientific credibility. If you are searching for an example of a good machine translation evaluation, this is one of them. Note: I also recommend reading “Prompting PaLM for Translation: Assessing Strategies and Performance” (Vilar et al., 2022) which is another good example as I detailed in my blog article “How Good Is Google PaLM at Translation?”.
They didn’t rely on automatic metrics but still provided metric scores for more analysis. All the details to replicate the scores are also provided. This is extremely rare.
They have even tested the statistical significance of their human evaluation.
The results:
GPT-3.5 is better at translating paragraphs than individual sentences
GPT-3.5 is better than Google Translate
However, these results vary across language pairs.
For the German-to-Japanese translation direction, translating individual sentences yields better results. This is the only exception. According to the authors, this is because the data used for this translation direction has very long sentences.
What is most surprising to me is that GPT-3.5 is also better than Google Translate when translating individual sentences.
Automatic metrics also yield very similar results: COMET, BLEURT, BERTScore, and COMET-QE all agree that GPT-3.5 is better than Google Translate with any of the 3 prompt templates.
The paper presents a very extended analysis of their human evaluation. I won’t discuss it more in this article but invite you to read it. It’s very insightful.
Limitations of GPT models for translation
The paper has a “limitations” section (Section 7) where the authors discuss the limits of using GPT models for translation.
The authors note that the translation errors made when translating paragraphs are different from the errors made when translating individual sentences.
When translating paragraphs, GPT-3.5 sometimes skips and forgets part of the content of the paragraph, leading to incorrect translation. I also observed similar behavior when playing with ChatGPT for translation.
This problem could be corrected by fine-tuning GPT-3.5 for machine translation. Note: Let’s not forget that the GPT-3.5 model evaluated here has not been fine-tuned for machine translation.
Other than that, GPT-3.5 still makes some more common type errors such as mistranslations and grammatical errors, but much less than Google Translate, as shown by the evaluation.
Limitations of this work
I struggled to find limitations for this work but there is at least one in my opinion.
The impact of the prompt templates is not clear. The specific template chosen for paragraph translation performs better than the template chosen for sentence translation.
But can we conclude with this setting that GPT-3.5 performs better when translating entire paragraphs?
If we change the templates, do we still draw the same conclusion?
We can’t easily answer this question. I expect this limitation to be shared by all future work evaluating language models for machine translation.
Also, this work focuses on translating literary texts. We can’t be sure that this work's conclusion would apply to other genres. I’m eager to read future work that will address this gap.
Conclusion
This work is a milestone in machine translation.
It shows with very high scientific credibility that a large language model can outperform more standard neural machine translation systems such as Google Translate. It also demonstrates that paragraph-level translation with a large language model yields better translation quality than sentence-level translation.
With this work and the previous study of PaLM’s translation quality, we have more and more evidence that the future of machine translation will be based on large language models.