With the recent releases of ChatGPT and GPT-4, GPT models have drawn a lot of interest from the scientific community. These new versions of OpenAI’s GPT models are so powerful and versatile that it may take a lot of time before we can exploit their full potential.
Even though they are impressive, what you may not know is that the main ideas and algorithms behind GPT models are far from new.
Whether you are a seasoned data scientist or just someone curious about GPT, knowing how GPT models evolved is particularly insightful on the impact of data and what to expect for the coming years.
In this article, I explain how GPT models became what they are today. I’ll mainly focus on how OpenAI scaled GPT models over the years. I’ll also give some pointers if you want to get started using GPT models.
Generative pre-trained language models
GPT models are language models.
Language models have existed for more than 50 years.
The first generation of language models was “n-gram based”. They modeled the probability of a word given some previous words.
For instance, if you have the sentence:
The cat sleeps in the kitchen.
With n=3, you can get from a 3-gram language model the probability of having “in” following “cat sleeps”.
n-gram models remained useful in many natural language and speech processing tasks until the beginning of the 2010s.
They suffer several limitations. The computational complexity dramatically increases with a higher n. So these models were often limited to n=5 or lower.
Then, thanks to neural networks and the use of more powerful machines, this main limitation was alleviated and it became possible to compute the probability for much longer n-grams, for instance for n=20 or higher.
Generating text with these models was also possible but their outputs were of a so poor quality that they were rarely used for this purpose.
Then, in 2018, OpenAI proposed the first GPT model.
GPT stands for “generative pre-trained”. “Pre-trained” means that the model was simply trained on a large amount of text to model probabilities without any other purpose than language modeling. GPT models can then be fine-tuned, i.e., further trained, to perform more specific tasks.
For instance, you can use a small dataset of news summaries to obtain a GPT model very good at news summarization. Or fine-tune it on French-English translations to obtain a machine translation system capable of translating from French to English.
Note: The term “pre-training” suggests that the models are not fully trained and that another step is needed. With recent models, the need for fine-tuning tends to disappear. The pre-trained models are now directly used in applications.
GPT models are now very good in almost all natural language processing tasks. I particularly studied their ability to do machine translation, as you can read in the following article:
The scale of the training, and the Transformer neural network architecture that they exploit, are the main reasons why they can generate fluent text.
Since 2018 and the first GPT, several versions and subversions of GPT followed.
4 versions and many more subversions
GPT and GPT-2
GPT-2 came out only a few months after the first GPT was announced. Note: The term “GPT” was never mentioned in the scientific paper describing the first GPT. Arguably, we could say that “GPT-1” never existed. To the best of my knowledge, it was also never released.
What is the difference between GPT and GPT-2?
The scale. GPT-2 is much larger than GPT.
GPT was trained on the BookCorpus which contains 7,000 books. The model has 120 million parameters.
What’s a parameter?
A parameter is a variable learned during the model training. Typically, a model with more parameters is bigger and better.
120 million was a huge number in 2018.
With GPT-2, OpenAI proposed an even bigger model containing 1.5 billion parameters.
It was trained on an undisclosed corpus called WebText. This corpus is 10 times larger than BookCorpus (according to the paper describing GPT-2).
OpenAI gradually released 4 versions of GPT-2:
small: 124 million parameters
medium: 355 million parameters
large: 774 million parameters
xl: 1.5 billion parameters
They are all publicly available and can be used in commercial products.
While GPT-2-XL excels at generating fluent text in the wild, i.e., without any particular instructions or fine-tuning, it remains far less powerful than more recent GPT models for specific tasks.
The release of GPT-2-XL was the last open release of a GPT model by OpenAI. GPT-3 and GPT-4 can only be used through OpenAI’s API.
GPT-3
GPT-3 was announced in 2020. With its 175 billion parameters, it was an even bigger jump from GPT-2 than GPT-2 from the first GPT.
Today, there are 7 GPT-3 models available through OpenAI’s API but we only know little about them.
With GPT-3, OpenAI demonstrated that GPT models can be extremely good for specific language generation tasks if the users provide a few examples of the task they want the model to achieve.
GPT-3.5
With the GPT-3 models running in the API and attracting more and more users, OpenAI could collect a very large dataset of user inputs.
They exploited these inputs to further improve their models.
They used a technique called reinforcement learning from human feedback (RLHF). I won’t explain the details here but you can find them in a blog post published by OpenAI.
In a nutshell, thanks to RLHF, GPT-3.5 is much better at following user instructions than GPT-3. OpenAI denotes this class of GPT models as “instructGPT”.
With GPT-3.5, you can “prompt” the model to perform a specific task without the need to give it examples of the task. You just have to write the “right” prompt to get the best result. This is where “prompt engineering” becomes important and why skilled prompt engineers are receiving incredible job offers.
GPT-3.5 is the current model used to power ChatGPT.
GPT-4
GPT-4 has been released in March 2023.
We know almost nothing about its training.
The main difference with GPT-3/GPT-3.5 is that GPT-4 is bimodal: It can take as input images and text.
It can generate text but won’t directly generate images. Note: GPT-4 can generate the code that can generate an image, or retrieve one from the Web.
At the time of writing these lines, GPT-4 is still in a “limited beta”.
ChatGPT
ChatGPT is just a user interface with chat functionalities. When you write something with ChatGPT, it’s a GPT-3.5 model that generates the answer.
A particularity of ChatGPT is that it’s not just taking as input the current query of the user as an out-of-the-box GPT model would do. To properly work as a chat engine, ChatGPT must keep track of the conversation: What has been said, what is the user goal, etc.
OpenAI didn’t disclose how it does that. Given that GPT models can only accept a prompt of a limited length (I’ll explain this later), ChatGPT can’t just concatenate all the dialogue turns together to put them in the same prompt. This kind of prompt could be way too large to be handled by GPT-3.5.
How to use GPT models?
You can easily get GPT-2 models online and use them on your computer. If you want to run large language models on your machine, you may be interested in reading my tutorial:
For GPT-3 and GPT-3.5, we have no other choice than to use OpenAI’s API. You will first need to create an OpenAI account on their website.
Once you have an account, you can start playing with the models inside the “playground” which is a sandbox that OpenAI proposes to experiment with the models. You can access it only when you are logged in.
If you want to directly use the models in your application, OpenAI and the open-source community offer libraries in many languages, such as Python, Node.js, and PHP, to call the models using OpenAI API.
You can create and get your OpenAI API key in your OpenAI account. Note: Keep this key secret. Anyone who has it can consume your OpenAI credits.
Each model has different settings that you can adjust. Be aware that GPT models are non-deterministic. If you prompt a model twice with the same prompt there is a high chance that you will have two close but different answers.
Note: If you want to reduce the variations between answers given the same prompt, you can set to 0 the “temperature” parameter of the model. As a side effect, it will also significantly decrease the diversity of the answers, in other words, the generated text may be more redundant.
You will also have to care about the “maximum content length”. This is the length of your prompt in addition to the length of the answer generated by GPT. For instance, GPT-3.5-turbo has a “maximum content length” of 4,096 tokens.
A token is not a word.
A token is the minimal unit of text used by the GPT models to generate text. Yes, GPT models are not exactly word generators but rather token generators. A token can be a character, a piece of word, a word, or even a sequence of words for some languages.
OpenAI gives an example in the API documentation.
"ChatGPT is great!"
is encoded into six tokens:["Chat", "G", "PT", " is", " great", "!"]
.
As a rule of thumb, count that 750 English words yield 1,000 tokens.
In my opinion, managing the “maximum content length” is the most tedious part of working with the OpenAI API. First, there is no straightforward way to know how many tokens your prompt contains. Then, you can’t know in advance how many tokens will be in the answer of the model.
You have to guess. And you can only guess right if you have some experience with the models. I recommend experimenting a lot with them to better gauge how long can be the answers given your prompts.
If your prompt is too long, the answer will be cut-off.
I won’t give more details about the API here as it can become quite technical.
Limitations of GPT models
GPT models are only token generators trained on the Web. They are biased by the content they were trained on and thus cannot be considered fully safe.
Since GPT-3.5, OpenAI has trained its model to avoid answering harmful content. To achieve this, they used machine learning techniques and consequently, this “self-moderation” of the model can’t be 100% trusted.
This self-moderation may work for a given prompt, but may then completely fail after just changing one word in this prompt.
I also recommend reading the Terms of Use of OpenAI products. In this document, the limitations of GPT models appear more clearly in my opinion.
If you plan to build your application with the API, you should particularly pay attention to this point:
You must be at least 13 years old to use the Services. If you are under 18 you must have your parent or legal guardian’s permission to use the Services. If you use the Services on behalf of another person or entity, you must have the authority to accept the Terms on their behalf. You must provide accurate and complete information to register for an account. You may not make your access credentials or account available to others outside your organization, and you are responsible for all activities that occur using your credentials.
Italy temporarily banned ChatGPT because it may generate inappropriate answers for people under 18, among other reasons.
If you are a developer building an application on top of OpenAI API, you must check the age of your users.
OpenAI also published a list of usage policies pointing out all the prohibited usage of the models.
Conclusion
GPT models are very simple models and their architecture hasn’t evolved much since 2018. But when you train a simple model at a large scale on the right data and with the right hyperparameters, you can get extremely powerful AI models such as GPT-3 and GPT-4.
They are so powerful that we have not nearly explored all their potential.
While recent GPT models are not open-source, they remain easy to use with OpenAI’s API. You can also play with them through ChatGPT.