Recent base large language models (LLM) are pre-trained on trillions of tokens. The pre-training data are text usually extracted from the Web without targeting any specific domains or tasks. In contrast, fine-tuning a base LLM requires much less data and exploits data for specific tasks or domains.
“Continued” pre-training is yet another step that can be executed between pre-training and fine-tuning. Continuing pre-training is especially helpful when we want to teach a pre-trained LLM a new language or very specific domains for which we have millions of tokens. You can see it as a fine-tuning but without any particular tasks in mind.
In this article, I show how to continue pre-training LLMs. We will review the main differences between fine-tuning and continued pre-training. I use Llama 3 8B and a recipe proposed by Unsloth to make it possible on consumer hardware.
Code examples of continued pre-training with Llama 3 are available in this notebook: