The embedding model plays a key role in many applications such as in Retrieval-Augmented Generation (RAG) for large language models (LLMs). In RAG systems, it encodes both the knowledge base and the user query. I explained the RAG concept in this article:
Using an embedding model that is trained or fine-tuned on the same domain as the LLM can greatly improve a RAG system. With LLM2Vec, we can extract an inaccurate embedding model directly from the LLM. Then, we can improve this model with a two-stage training including masked next-token prediction and contrastive learning. We saw how to do this in previous articles with Llama 3 8B. However, because Llama 3 8B is a large model, it produces high-dimensional text embeddings, which can be costly to train and deploy in downstream tasks.
In this article, we will see how to make text embeddings from Llama 3.2 1B. We will see in detail all the steps: masked next-token prediction training, contrastive learning, and then how to evaluate the resulting embeddings. I used an RTX 3090 from RunPod (currently at $0.22/hour) (referral link) for the training steps and evaluation.
The notebook showing how to turn Llama 3.2 into an embedding model is here: