The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
RAG for Mistral 7B Instruct with LlamaIndex and Transformers

RAG for Mistral 7B Instruct with LlamaIndex and Transformers

RAG on budget

Benjamin Marie's avatar
Benjamin Marie
Mar 25, 2024
∙ Paid
13

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
RAG for Mistral 7B Instruct with LlamaIndex and Transformers
10
Share
Generated with DALL-E

While techniques like LoRA and QLoRA have dramatically reduced the cost of fine-tuning LLMs, regularly fine-tuning again and again adapters to keep the LLM’s knowledge up-to-date is impractical. In the extreme case where the knowledge base is continuously updated, keeping the LLM’s knowledge up-to-date through fine-tuning is even impossible.

We could include all knowledge, unseen during fine-tuning, in the context of the LLM and instruct the LLM to exploit it when appropriate. Even though recent techniques such as LongLoRA, FlashAttention, or LongRoPE improve the efficiency and accuracy of LLMs in dealing with very long contexts, e.g., more than 2 million tokens with LongRoPE, handling long prompts is costly and significantly slows down inference.

Use FlashAttention-2 for Faster Fine-tuning and Inference

Use FlashAttention-2 for Faster Fine-tuning and Inference

Benjamin Marie
·
November 16, 2023
Read full story

Instead, retrieval augmented generation (RAG) retrieves relevant information from a database, for a given query, and only injects this information into the LLM’s context. It keeps the context relatively short and significantly helps the LLM focus on what is relevant.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I explain the basics of RAG. Then, I show how to set up a simple RAG system using Mistral 7B instruct as a base model. For this purpose, I use LlamaIndex and Hugging Face Transformers. Thanks to GPTQ quantization, you can reproduce my setup on consumer hardware.

The notebook implementing RAG for Mistral 7B is available here:

Get the notebook (#55)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share