In translation, OpenAI GPT-3 has been shown to perform on par with state-of-the-art machine translation systems (Brown et al., (2020)).
It only requires a few translation examples for GPT-3 to learn how to translate reasonably well. It is a suitable alternative to standard machine translation systems for translating languages and domains for which there is not much training data available.
An impressive performance since GPT-3 has been trained mainly on English text.
In this article, I will describe how to get the best of GPT-3 for translation with different prompting strategies. I will focus on use cases where we only have a few or zero translation examples available.
Using GPT-3 is not free and the cost estimation of a translation is not straightforward, so I will also provide you a method to estimate the cost of a translation by yourself.
This article doesn’t require any prior knowledge of machine learning.
Prompting GPT-3
GPT-3 is not trained for any specific natural language processing tasks.
Yet, if you submit to GPT-3 the description of a task, in natural language, GPT-3 will try to do it, with more or less success depending on the task’s complexity.
For instance, if you want GPT-3 to translate some English text into French, you have to describe this translation task in plain text and submit this description to GPT-3 along with the text to translate.
The set {task description,text} is called a prompt. For a translation task, the prompt submitted to GPT-3 could be:
Prompt:
Translate from English to French:
The cat is in the kitchen. =>
And then, ideally, GPT-3 would respond:
Response:
Le chat est dans la cuisine.
In this case, translation is a zero-shot task for GPT-3. We ask GPT-3 to perform a task for which it has not been trained.
The zero-shot configuration is especially adapted for use cases where we don’t have any training data available. You can see it as an unsupervised machine translation task.
To improve the results, we can modify the prompt to include an example of the task, such as:
Prompt:
Translate from English to French:
I am in the kitchen. => Je suis dans la cuisine.
The cat is in the kitchen. =>
In this case, GPT-3 has seen one translation example. It becomes a one-shot task for GPT-3. If you provide more than one example, we usually talk about a few-shot task.
One-shot and few-shot configurations are useful for machine translation tasks for which we only have a few examples of translations. Few-shot translation with GPT-3 can perform as well as, or even better than, a state-of-the-art machine translation system for some languages.
Choosing a Model and Estimating a Translation Cost
OpenAI has set up a web page where the models and pricing are detailed.
Costs are given for 1,000 tokens and for each model:
Ada
Babbage
Curie
Davinci
Note: If you are curious to know how these models perform in natural language processing, EleutherAI evaluated them in various tasks (but not in translation).
It is possible to do translation with all these models, but the cheaper ones won’t give you good results as shown by Lin et al. (2022).
Curie or Davinci are likely to give you the best results.
Note that the costs are given for tokens, i.e., not words. OpenAI estimates that 1 token = 0.75 words. So if you have a text of 1,000 words to translate, GPT-3 will tokenize it first into approximately 1,250 tokens.
If you use Curie, you will approximately pay $0.0025 (1.25*0.002) for just submitting 1,000 words. This is for English text. Expect the ratio token/word to be higher for other languages since GPT-3 is likely to split words into smaller pieces for languages that were less represented in its original training data.
This is the cost for what you will submit to GPT-3. Unfortunately, this is not the cost of the translation itself.
Remember, we have to describe to GPT-3 what we want to do. Along with the text to translate, we have to describe the task. OpenAI charges for the entire prompt.
For instance, if you do zero-shot translation, you will pay for the task description “Translate from English to French:” for each request you submit to GPT-3. If you do a few-shot translation, you have to add the cost of submitting all the examples of translation given in your prompt. As we will see in the following parts of this article, the cost to obtain the best of GPT-3 can quickly rise if you don’t carefully design your prompts.
Then, OpenAI also charges for the generated tokens. The number of generated tokens can’t be predicted but there is an option to limit it.
To summarize, the cost of a query sent to GPT-3 will be determined given the number of tokens in:
the description of the task
the translation examples provided (if any)
the text to translate
the translation generated
Note that the “description of the task” could be skipped if you provide several translation examples in the prompt for a few-shot translation.
With the best model, Davinci, GPT-3 can process up to 4,000 tokens (or 2,048 with Curie). If the length of the prompt added to the maximum length of the GPT-3’s response, in terms of tokens, is greater than what the model can handle, GPT-3 returns an error asking to reduce the size of your prompt (or to change some parameters). This limit also means that you can’t submit a huge amount of text at once.
Setting Up the API
Note: If you are not comfortable coding in Python, you can skip this part and use instead the “Playground” web interface provided by OpenAI that is available once you sign up on their website.
I will do some demonstrations of GPT-3 translation using the API. If you also want to practice on your side, you will need to create an OpenAI account. You will get free credits ($18 credits at the time I’m writing this article) to practice with GPT-3.
You will also need to install the openai package:
pip install openai
To use OpenAI’s API, you will need an API key. You can generate one in your OpenAI account. Keep this key private.
Then I send my requests to the API as in the following Python script:
import os
import openai
#Your API key is loaded here
#It should be exported as an environment variable before running this script: export OPENAI_API_KEY=your_key
openai.api_key = os.getenv("OPENAI_API_KEY")
#Description of the task in natural language
task_description = "Translate English to French:"
#The text you want to translate
to_translate = "The cat is in the kitchen =>"
#The prompt is created by concatenating the task description and the text to translate
prompt = task_description+"\n"+to_translate
#The API call
#respone will contain the generated translation
response = openai.Completion.create(
model="text-curie-001",
prompt=prompt,
temperature=0.3,
max_tokens=2048,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0
)
#Print the JSON response from GPT-3
print(response)
More precisely, the call to the API is done with openai.Completion.create. You can find the API documentation on OpenAI’s website to better understand the parameters. In this article, I won’t modify them, except for the “prompt”.
The response given by GPT-3 is formatted in JSON, as follows:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": " Le chat est dans la cuisine"
}
],
"created": 1668449187,
"id": "xxxx",
"model": "text-curie-001",
"object": "text_completion",
"usage": {
"completion_tokens": 7,
"prompt_tokens": 14,
"total_tokens": 21
}
}
The most important keys here are the “text”, which contains the translation generated by GPT-3, and “total_tokens”, which is the number of tokens for which you will be billed.
This API call costs $0.000042 (0.002*21/1000).
Translation Settings
For this demonstration, I will use the ECB dataset compiled by TILDE (CC-BY).
I will translate the following 6 sentences from English to French (my native language, so I can manually evaluate the translations):
Today also marks the opening of the online "Euro Run" competition for children aged 9 to 12 living in the euro area.
Unlike Denmark and the United Kingdom, Croatia does not have the right to opt out of adopting the single currency.
When the number of NCB Governors exceeds 15 they will exercise a voting right on the basis of a rotation system.
Write-downs amounted to €64 million in 2015 (2014: €8 million).
€500 banknote remains legal tender and will always retain its value
While there are fewer than 22 euro area Member States the rotation will take place in accordance with the following rules:
Note that the sentences in ECB are not consecutive, i.e., they should be considered independently.
The reference translations, in French, produced by a human translator are as follows:
C'est aujourd'hui également qu'est lancé le concours en ligne Euro Run, destiné aux enfants âgés de 9 à 12 ans habitant dans la zone euro.
Contrairement au Danemark et au Royaume-Uni, la Croatie ne bénéficie pas d'une clause d'exemption concernant l'adoption de la monnaie unique.
Lorsque le nombre des gouverneurs des BCN sera supérieur à quinze, ces derniers exerceront leur droit de vote selon un système de rotation.
Les moins-values latentes se sont établies à 64 millions d'euros en 2015 (contre 8 millions d'euros en 2014).
Le billet de 500 euros continue d'avoir cours légal et gardera toujours sa valeur
Tant que la zone euro comptera moins de vingt-deux États membres, la rotation fonctionnera selon les règles suivantes:
Note that GPT-3 is trained on data published on the Web until 2021. The ECB dataset has been compiled in 2017 so it is very likely that GPT-3 has seen during training the English text to translate and maybe its translation in French. This would make the task artificially easier and prevent any performance comparisons with other models (so I won’t do it in this article).
To compare the zero-shot, one-shot, and few-shot performances, I will do manual and automatic evaluations of the translations using the metric COMET (Apache License 2.0).
Zero-Shot Translation
For zero-shot translation, we only need to describe the task in natural language. To show the impact of the prompt, I tried 3 different descriptions.
Precisely, I only change the content of “task_description” from the Python script above each time I change the prompt.
I name each prompt “prompt_x” so that I can refer to it later in this article.
prompt_1:
Translate English to French:
[source] =>
In prompt_1 and all the following prompts, “[source]” is the English text to translate. This prompt is the one used by OpenAI in the paper presenting GPT-3.
Watch for the spaces. GPT-3 doesn’t like prompts that end with a space.
Of course, you can replace “English” and “French” with the languages you want.
I also tried the following variation, omitting the name of the language to translate.
prompt_2
Translate to French:
[source] =>
The third prompt is minimalist and doesn’t mention the task.
prompt_3
English
[source]
French
[source] can be formatted in various ways. I tried many possibilities but retained only the following for this article:
unit: Send only one sentence to translate per API call.
batch: Send all the sentences at once. This way is optimal to reduce your bill since you won’t have to repeat the task description for each sentence to translate. It is also ideal if the sentences are supposed to be consecutive. However, in the ECB corpus, I chose for this article, the sentences are unordered, and thus GPT-3 may be misled when translating a sentence by looking at the previously translated ones that were picked from different contexts.
Results
With prompt_1 and unit:
Note: In the following example, “ — — — — — ” indicates that what follows has been generated by another call to the API. I also manually added “Response:” to indicate where the GPT-3’s response begins.
Translate English to French:
Today also marks the opening of the online "Euro Run" competition for children aged 9 to 12 living in the euro area. =>
Response:
Aujourd'hui, c'est aussi le début du concours "Euro Run" destiné aux enfants de 9 à 12 ans vivant dans l'euro zone.
----------
Translate English to French:
Unlike Denmark and the United Kingdom, Croatia does not have the right to opt out of adopting the single currency. =>
Response:
Contrairement à la Danemark et à la Grande-Bretagne, la Croatie ne dispose pas de droit de déclencher une optique pour la monnaie unique.
----------
Translate English to French:
When the number of NCB Governors exceeds 15 they will exercise a voting right on the basis of a rotation system. =>
Response:
Lorsque le nombre de gouverneurs de NCB dépasse 15, ils exerceront un droit de vote en fonction d'un système de rotation.
----------
Translate English to French:
Write-downs amounted to €64 million in 2015 (2014: €8 million). =>
Response:
Les écritures ont atteint 64 millions d'euros en 2015 (2014: 8 millions).
----------
Translate English to French:
€500 banknote remains legal tender and will always retain its value =>
Response:
Le 500 € billets de banque restent toujours valables et garderont toujours leur valeur.
----------
Translate English to French:
While there are fewer than 22 euro area Member States the rotation will take place in accordance with the following rules: =>
Response:
Les États membres de l'euro zone qui ne seront pas présents à la réunion du Conseil européen du 22 juin 2012 auront droit à une rotation annuelle.
Total completion tokens: 48+42+47+24+30+50 = 241
Total prompt tokens: 35+30+31+25+21+30 = 172
Total tokens: 83+72+78+49+51+80 = 413
Total cost: 0.002*413/1000 = $0.000826
COMET score: 20.74
With prompt_1 and batch:
Translate English to French:
Today also marks the opening of the online “Euro Run” competition for children aged 9 to 12 living in the euro area.
Unlike Denmark and the United Kingdom, Croatia does not have the right to opt out of adopting the single currency.
When the number of NCB Governors exceeds 15 they will exercise a voting right on the basis of a rotation system.
Write-downs amounted to €64 million in 2015 (2014: €8 million).
€500 banknote remains legal tender and will always retain its value
While there are fewer than 22 euro area Member States the rotation will take place in accordance with the following rules:
Response:
Le concours européen de course en ligne "Euro Run" pour les enfants de 9 à 12 ans vivant dans l'espace européen ouvre ses portes aujourd'hui.
Contrairement à la Norvège et au Royaume-Uni, la Croatie n'a pas le droit de choisir d'opter pour la monnaie unique.
Lorsque le nombre de gouverneurs de la Banque centrale européenne dépasse 15, ils exerceront un droit de vote en fonction d'un système de rotation.
Le montant des dépréciations atteint €64 millions en 2015 (2014: €8 millions).
Le 500 euros de la note banque reste toujours une monnaie légale et conserve toujours son prix.
Même si il y a moins de 22 États membres de l'espace européen, le rythme de rotation se fera selon les règles suivantes :
Total completion tokens: 256
Total prompt tokens: 136
Total tokens: 392
Total cost: 0.002*392/1000 = $0.000784
COMET score: 37.45
For prompt_2 and prompt_3, I observed the following.
Note: I don’t write here the prompts and outputs for these two other prompts to shorten this article. If you want to see them, I’ll post them in the comments below.
With prompt_2 and unit:
Total completion tokens: 60+43+46+24+24+52 = 249
Total prompt tokens: 34+29+30+24+20+29 = 166
Total tokens: 94+72+76+48+44+81 = 415
Total cost: 0.002*415/1000 = $0.000830
COMET score: 5.41
With prompt_2 and batch:
Total completion tokens: 220
Total prompt tokens: 136
Total tokens: 356
Total cost: 0.002*356/1000 = $0.000712
COMET score: 43.82
With prompt_3 and unit:
Total completion tokens: 50+43+46+28+20+67 = 254
Total prompt tokens: 33+28+29+23+19+28 = 160
Total tokens: 83+71+75+51+39 +95 = 414
Total cost: 0.002*414/1000 = $0.000828
COMET score: -24.69
With prompt_3 and batch:
Total completion tokens: 227
Total prompt tokens: 135
Total tokens: 362
Total cost: 0.002*362/1000 = $0.000724
COMET score: 35.09
If we look at the COMET scores, it clearly appears that the “batch” configurations yield far better translations, according to COMET than the “unit” configurations. I confirmed it through manual evaluation.
My assumption is that the “block” configuration gives more context to GPT-3 during generation. In other words, while the French context is growing during generation, GPT-3 becomes more confident that it should generate French text.
As for the prompt, prompt_3 yields significantly lower results and doesn’t seem to perform the task for some sentences. This is intuitive since we removed “translate” from the prompt. The difference between the translations obtained with prompt_1 and prompt_2 is not significant according to my manual evaluation.
What COMET doesn’t tell you here is that all these translations are terrible. Some translations are partially correct, while others are fluent but completely inaccurate.
Zero-shot GPT-3 generates French translations that are far from the quality obtained with standard machine translation systems. Note: You may get somewhat better results at zero-shot translation if you use the Davinci model and for a translation into English.
I presented it to show you that even if GPT-3 is not trained to do translation, it is actually able to translate if we provide it with a simple description of the task.
If we compare the cost, in dollars, between all the configurations, we can’t see much difference.
One-Shot Translation
In the one-shot configuration, we add one example of an English-to-French translation produced by a human translator.
Here, I only present the results with prompt_1 since the results with prompt_2 are very similar.
With prompt_1 and unit:
Translate English to French:
The competition gives us a unique opportunity to have a direct dialogue with thousands of students across the euro area. => Le jeu-concours nous offre une possibilité unique de dialoguer directement avec des milliers d'élèves à travers la zone euro.
Today also marks the opening of the online "Euro Run" competition for children aged 9 to 12 living in the euro area. =>
Reponse:
Aujourd'hui, c'est aussi le début du concours « Euro Run » destiné aux enfants de 9 à 12 ans vivant dans la zone euro.
----------
Translate English to French:
The competition gives us a unique opportunity to have a direct dialogue with thousands of students across the euro area. => Le jeu-concours nous offre une possibilité unique de dialoguer directement avec des milliers d'élèves à travers la zone euro.
Unlike Denmark and the United Kingdom, Croatia does not have the right to opt out of adopting the single currency. =>
Reponse:
Contrairement à la Danemark et à la Grande-Bretagne, la Croatie ne dispose pas de droit de décider de ne pas adopter le système monétaire unique.
----------
Translate English to French:
The competition gives us a unique opportunity to have a direct dialogue with thousands of students across the euro area. => Le jeu-concours nous offre une possibilité unique de dialoguer directement avec des milliers d'élèves à travers la zone euro.
When the number of NCB Governors exceeds 15 they will exercise a voting right on the basis of a rotation system. =>
Reponse:
Lorsque le nombre de gouverneurs NCB dépasse 15, ils exerceront un droit de vote en fonction d'un système de rotation.
----------
Translate English to French:
The competition gives us a unique opportunity to have a direct dialogue with thousands of students across the euro area. => Le jeu-concours nous offre une possibilité unique de dialoguer directement avec des milliers d'élèves à travers la zone euro.
Write-downs amounted to €64 million in 2015 (2014: €8 million). =>
Reponse:
Les écritures à terme atteignaient 64 millions d'euros en 2015 (2014: 8 millions).
----------
Translate English to French:
The competition gives us a unique opportunity to have a direct dialogue with thousands of students across the euro area. => Le jeu-concours nous offre une possibilité unique de dialoguer directement avec des milliers d'élèves à travers la zone euro.
€500 banknote remains legal tender and will always retain its value =>
Reponse:
Le billet de 500 euros reste toujours légalement valable et gardera toujours son prix
----------
Translate English to French:
The competition gives us a unique opportunity to have a direct dialogue with thousands of students across the euro area. => Le jeu-concours nous offre une possibilité unique de dialoguer directement avec des milliers d'élèves à travers la zone euro.
While there are fewer than 22 euro area Member States the rotation will take place in accordance with the following rules: =>
Reponse:
Bien que les 22 États membres de la zone euro soient les seuls participants, le tour de rotation se déroulera selon les règles suivantes :
Total completion tokens: 47+46+45+27+26+42 = 233
Total prompt tokens: 98+93+94+88+84+93 = 550
Total tokens: 145+139+139+115+110+135 = 783
Total cost: 0.002*783/1000 = $0.001566
COMET score: 49.01
With prompt_1 and batch:
Translate English to French:
The competition gives us a unique opportunity to have a direct dialogue with thousands of students across the euro area. => Le jeu-concours nous offre une possibilité unique de dialoguer directement avec des milliers d’élèves à travers la zone euro.
Today also marks the opening of the online “Euro Run” competition for children aged 9 to 12 living in the euro area.
Unlike Denmark and the United Kingdom, Croatia does not have the right to opt out of adopting the single currency.
When the number of NCB Governors exceeds 15 they will exercise a voting right on the basis of a rotation system.
Write-downs amounted to €64 million in 2015 (2014: €8 million).
€500 banknote remains legal tender and will always retain its value
While there are fewer than 22 euro area Member States the rotation will take place in accordance with the following rules: =>
Reponse:
Aujourd’hui aussi est marqué le lancement du concours « Euro Run » destiné aux enfants de 9 à 12 ans vivant dans la zone euro.
Comme la Norvège et le Royaume-Uni, la Croatie n'a pas le droit de choisir de ne pas adopter le système monétaire unique.
Lorsque le nombre de gouverneurs de la Banque centrale européenne dépasse 15, ils exerceront un droit de vote en fonction d'un système de rotation.
Les write-downs atteignirent €64 millions en 2015 (2014: €8 millions).
Le 500 euros de banknote reste toujours valable monnaie et gardera toujours sa valeur.
Environ 22 États membres de la zone euro ont adhéré à la monnaie unique, tandis que la rotation aura lieu selon les règles suivantes :
Total completion tokens: 240
Total prompt tokens: 200
Total tokens: 440
Total cost: 0.002*440/1000 = $0.000880
Comet score: 44.88
For both configurations, the one-shot translation is better than the zero-shot translation, as expected. GPT-3 better “understood” the task thanks to the translation examples.
Note also that with one-shot translation, the unit configuration performed better than batch.
The translations improved a lot compared to the zero-shot configuration, but some of them strangely remind me of the translations that were generated by phrase-based statistical machine translation systems, 10 years ago.
Few-Shot Translation
For a few-shot translation, I put 10 examples of translations (selected from the ECB corpus) in the prompt. This is the only difference with the one-shot prompt.
I found that the batch configuration returns strange results in this configuration, and most notably missed translating some of the sentences. This is probably due to the prompt that became very long. It seems that GPT-3 doesn’t properly find where the text to translate begins.
I won’t display the prompt and the response from GPT-3 since the examples used for few-shot learning would take a large space in this article.
With prompt_1 and unit:
Total completion tokens: 48+44+45+23+29+41 = 230
Total prompt tokens: 641+636+637+631+627+636 = 3,808
Total tokens: 689+680+682+654+656+677 =4,038
Total cost: 0.002*4038/1000 = $0.008076
COMET score: 59.39
As expected, we have another improvement in the translation quality. But I found it is still not as good as a standard machine translation system.
Note that I only use a few examples here. To obtain the translation quality mentioned by the authors of GPT-3, you would need to use at least 30 examples of translations. Moreover, you can further improve the results by switching to Davinci.
However, note that the cost has already increased a lot compared to the zero-shot configuration. Translating 6 sentences costs $0.008076, in other words, 11 times more than for the zero-shot configuration. This is approximately $1.3 for 1,000 sentences. If you switch to Davinci, you have to multiply the cost by 10, i.e., $13 for 1,000 sentences.
This is because we included the translation examples for each sentence to translate. It’s not ideal, but this is the only way to obtain reasonably good results without fine-tuning GPT-3.
What About Fine-Tuning?
OpenAI proposes to fine-tune GPT-3 through the API.
Fine-tuning GPT-3 on translations would surely yield significantly better translations.
However, I wouldn’t recommend it unless you have the specific need to use GPT-3.
Fine-tuning GPT-3 and then using the fine-tuned model is more expensive and you would likely have to try several configurations before obtaining better results than with few-shot translation.
Conclusion
To conclude, it is not easy nor cheap to obtain good translations from GPT-3, but it is definitely possible.
If you have a budget large enough, I would recommend using GPT-3 for translating specific domains and languages that are not well covered by other standard machine translation systems. For such use cases, you may get better translations than with any other machine translation systems.
Note that I wouldn’t consider GPT-3 as a machine translation system unless you fine-tune it to be a machine translation system.
GPT-3 is free in its interpretation of the prompt. This freedom has some consequences. GPT-3 may:
Output the wrong language
Skip sentences to translate
Not perform translation at all
To avoid these pitfalls, selecting the right prompt and providing useful translation examples is crucial.