Simple and Quick Fine-Tuning of Falcon Models with QLoRA
A one command line tool to adapt a Falcon model to your data
The Falcon models are the large language models that are among the most popular now for various reasons:
They are very good, especially at problem-solving
They are smaller than other LLMs while performing better
They are entirely free (Apache 2.0 License)
They are available in several versions, including instruct-version that mimic the behavior of ChatGPT
With recent techniques like QLoRa, you can fine-tune Falcon models on consumer hardware. I’ve already discussed QLoRa and Falcon fine-tuning in previous articles.
Fine-tuning Falcon models with QLoRa is relatively easy with Hugging Face libraries. Yet, there is an easier way that requires even less coding: Falcontune.
Falcontune is an open-source project (Apache 2.0 license) developed by Rumen Mihaylov. We can read on the project page:
falcontune allows finetuning FALCONs (e.g., falcon-40b-4bit) on as little as one consumer-grade A100 40GB
Fine-tuning a 40b parameters model on 40GB VRAM sounds great. “4bit” tells us that QLoRa is used. But I wouldn't call the A100 40GB a “consumer-grade” GPU. That’s still a $5,000+ GPU. On the other hand, the 7B parameter version of Falcon that we will use here can definitely fit on a consumer GPU, e.g., an RTX 3060 with 12GB of VRAM.
Fine-tuning Falcon-7B and Falcon-40B with one command line
Note: The following commands are written for Falcon-7B. Replace “7B” with “40B” if you want to run them for Falcon-40B.
Requirements
I ran and tested everything on a free Google Colab instance.
We first need to get Falcontune
git clone https://github.com/rmihaylov/falcontune
Then to install all its dependencies
cd falcontune
pip install -r requirements.txt
python setup.py install
Finally, we will need the Falcon model itself. I used the Falcon-7B by TheBloke for this article:
wget https://huggingface.co/TheBloke/falcon-7b-instruct-GPTQ/resolve/main/gptq_model-4bit-64g.safetensors
(The 40B version is here: https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ/resolve/main/gptq_model-4bit--1g.safetensors)
Let’s get also some toy datasets.
wget https://github.com/gururise/AlpacaDataCleaned/raw/main/alpaca_data_cleaned.json
And we are now ready.
The command line for fine-tuning
We did “setup.py install” earlier to get a “falcontune” command.
You simply need to run the following command to fine-tune Falcon-7b on the alpaca data:
falcontune finetune \
--model=falcon-7b-instruct-4bit \
--weights=./gptq_model-4bit-64g.safetensors \
--dataset=./alpaca_data_cleaned.json \
--data_type=alpaca \
--lora_out_dir=./falcon-7b-instruct-4bit-alpaca/ \
--mbatch_size=1 \
--batch_size=2 \
--epochs=3 \
--lr=3e-4 \
--cutoff_len=256 \
--lora_r=8 \
--lora_alpha=16 \
--lora_dropout=0.05 \
--warmup_steps=5 \
--save_steps=50 \
--save_total_limit=3 \
--logging_steps=5 \
--target_modules='["query_key_value"]' \
--backend=triton
This should be quite slow (24 hours on a free Google Colab instance with split runtimes since it would disconnect after 12 hours). The Alpaca dataset is large. You may want to reduce its size for testing. We actually fine-tuned only 2,359,296 parameters thanks to LoRa.
If you want to use your own dataset, just have a look at the file “alpaca_data_cleaned.json” to see what data format is expected by falcontune.
During fine-tuning, CPU RAM and GPU VRAM consumption peaked at 4.0 GB and 8.3 GB, respectively. This is a very affordable configuration for homemade fine-tuning.
Remember that if you use the 40B version of Falcon you would need a much bigger machine.
To test inference you can run:
falcontune generate \
--interactive \
--model=falcon-7b-instruct-4bit \
--weights=./gptq_model-4bit-64g.safetensors \
--lora_apply_dir falcon-7b-instruct-4bit-alpaca/ \
--max_new_tokens 50 \
--use_cache \
--do_sample \
--instruction "How to prepare pasta?" \
--backend triton
And that’s it! You have now a very cheap chat model on your machine.