Microsoft released phi-1.5, a new large language model (LLM) with 1.3 billion parameters.
It’s 5.4 times smaller than the smallest Llama 2 model (Llama 2 7B). Yet, according to the evaluation conducted by Microsoft, and published on arXiv, phi-1.5 significantly outperforms Llama 2 on several tasks.
Given its relatively small size and the claimed performance, phi-1.5 is a good candidate LLM for affordable AI.
In this article, we will see what could explain this performance: how the model was trained and what training data has been used. I also show how to fine-tune, quantize, and run the model. I benchmark its memory consumption and inference speed.
phi-1.5: The Power of Distillation
In the paper describing phi-1.5, Microsoft presents 3 models trained on different datasets:
phi-1.5: They have only released this model.
phi-1.5-web: Trained on the same data as phi-1.5 but augmented with heavily curated dataset crawled from the web.
phi-1.5-web-only: Trained only on the heavily curated dataset crawled from the web.
Microsoft didn’t release phi-1.5-web and phi-1.5-web-only. I think they have only trained and evaluated them for constrastive experiments showing that they don’t need to augment the training data used by phi-1.5.
First, let’s have a closer look at the claimed performance of the models:
For almost all the benchmarks they have tried, phi-1.5 appears to perform better than Llama 2 7B
How is this possible?
The model is much smaller but achieves better performance. It’s neural architecture is quite common, except for the use of mixformer:
{
"_name_or_path": "phi-1.5-half",
"activation_function": "gelu_new",
"architectures": [
"MixFormerSequentialForCausalLM"
],
"auto_map": {
"AutoConfig": "configuration_mixformer_sequential.MixFormerSequentialConfig",
"AutoModelForCausalLM": "modeling_mixformer_sequential.MixFormerSequentialForCausalLM"
},
"embd_pdrop": 0.0,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "mixformer-sequential",
"n_embd": 2048,
"n_head": 32,
"n_inner": null,
"n_layer": 24,
"n_positions": 2048,
"resid_pdrop": 0.0,
"rotary_dim": 32,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.32.1",
"vocab_size": 51200
}
As for the training hyperparameters, they didn’t even care to do warm-up training steps:
We train phi-1.5 starting from random initialization with constant learning rate 2e-4 (no warm up), weight decay 0.1. We use Adam optimizer with momentum 0.9, 0.98, and epsilon 1e−7. We use fp16 with DeepSpeed ZeRO Stage 2 [RRRH20]. We use batch size 2048, and train for 150B tokens, with 80% from the newly created synthetic data and 20% from phi-1 ’s training data.
Extract from the technical report
The only remaining possible source of this surprising performance is the training data. It could be that the training dataset is of extremely good quality, or that it’s a very relevant dataset to the evaluation tasks used in the benchmarks, or both: good quality and relevant to the evaluation tasks.
Unfortunately, Microsoft didn’t release the training data. We only know that it’s small and almost exclusively synthetic.
Indeed, the data only contains 30 billion tokens against 2 trillion tokens for the training data used by Llama 2. This data has been generated but Microsoft doesn’t exactly precise how.
The fact that phi-1.5 outperforms a much larger model while being trained on a small synthetic dataset shows that it may be a case of knowledge distillation. Microsoft likely used a much bigger model to generate this dataset.
We already know that knowledge distillation works very well for LLMs. Alpaca or Vicuna are both very good examples of relatively small but good LLMs trained on data generated by OpenAI’s GPTs.
Here, the goal of Microsoft is to demonstrate how small the training data and the model can be while outperforming much larger models.
They have found that if the data are well curated and relevant to our target tasks, we only need a small dataset of synthetic data to train small LLMs performing as well as much larger LLMs.
Notebook to Run, Quantize, and Fine-tune phi-1.5
I wrote a notebook implementing all the following sections. It runs on the free instance of Google Colab or on a computer with at least 6 GB of CPU RAM and 12 GB of VRAM* (or 6 GB of VRAM if you don’t run fine-tuning).
You will find the notebook here:
Running phi-1.5 on Your Computer
phi-1.5 is a relatively small model. It can run on most computers. If you have a GPU with at least 5 GB of VRAM, it can entirely fit in the GPU memory. But it wouldn’t be possible to do batch decoding. I recommend 8 GB or 12 GB of VRAM for batch decoding. The NVIDIA RTX 4060 12 GB and 16 GB* are good GPUs under $500.
To run, quantize, and fine-tune phi-1.5, we need to install the following packages with pip:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U einops
Import them:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model
import torch
Then, with Hugging Face transformers, we can simply load the model and its tokenizer like this:
base_model_id = "microsoft/phi-1_5"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, device_map={"": 0})
Note that the model will run some code during loading. We need to pass “trust_remote_code=True”.
In contrast to Llama 2, phi-1.5 doesn’t require you to sign the license agreement. In this regard, it’s a more accessible model. You also don’t need to be connected to your Hugging Face account.
Once the model is loaded, it should consume around 5 GB of memory. This is a lot for a model with only 1.3 billion parameters. The model on the hard drive weighs 2.84 GB. Its memory consumption is twice that because the model is serialized with fp16 precision but loaded by default with fp32.
We can load with fp16 by just adding “torch_dtype=torch.float16” as follows:
model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})
It should only consume 3.6 GB of VRAM.
The model is serialized with pickle by PyTorch. In a previous article, I explained why we should avoid loading pickled models and use safer and more efficient formats such as safetensors.
I uploaded on the Hugging Face Hub a safetensors version of phi-1.5. You can find it here:
It’s the same as the original model but I have serialized it with:
model.save_pretrained("phi1.5_safetensors", safe_serialization=True)
I tried several prompts that you can find in the notebook. It seems to answer particularly well to prompts beginning with “write”, for instance:
Write the recipe for a chicken curry with coconut milk.
The model generates:
Answer:
Ingredients:
- 1 chicken breast, cut into small pieces
- 1 onion, chopped
- 2 cloves of garlic, minced
- 1 tablespoon of curry powder
- 1 tablespoon of tomato paste
- 1 tablespoon of coconut milk
- Salt and pepper to taste
Instructions:
1. Heat a non-stick pan over medium heat.
2. Add the chicken and cook until browned on all sides.
3. Add the onion and garlic and cook until softened.
4. Add the curry powder and cook for another minute.
5. Add the tomato paste and cook for another minute.
6. Add the coconut milk and stir to combine.
7. Season with salt and pepper to taste.
8. Serve hot.
Exercise 2:
Write a recipe for a vegetable stir-fry with brown rice.
Answer:
Ingredients:
- 1 pound of mixed vegetables (such as broccoli, carrots, bell peppers, and onions)
- 1 cup of brown rice
- 1 tablespoon of olive oil
- 1 tablespoon of soy sauce
[...]
Note: I manually truncated the response to save space.
However, because the model has not been fine-tuned, it will continue generating until the maximum length has been reached (2,048 tokens). It forces the model to generate extra tokens that are mostly irrelevant to our prompt, e.g., everything from “Exercise 2”.
Note: For each prompt, I also report in the notebook the generation speed (number of tokens generated per second).
It’s not particularly fast. I observed between 20 and 30 tokens/seconds using the T4 GPU of Google Colab. I recommend exploring TGI or vLLM for faster inference.
Is phi-1.5 a Good Candidate for Quantization?
AutoGPTQ and ExLlama(V2) don’t support mixformer. Note: At the end of the notebook, you will see that AutoGPTQ expects some attributes that mixformer doesn’t have by default.
It leaves us a few options, such as bitsandbytes nf4, to quantize phi-1.5.
bitsandbytes nf4 works as expected by decreasing the memory consumption of phi-1.5 to almost 2 GB. Even a very cheap GPU with a small VRAM can run phi-1.5 quantized with nf4. Note that you still need a recent GPU since 4-bit quantization is not well supported by GPUs older than the RTX 3xxx generation.
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
base_model_id, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}
)
Fine-tuning phi-1.5 with QLoRA
Now that we have loaded and quantized it, we can fine-tune phi-1.5 on consumer hardware thanks to QLoRA. However, this is not as easy as fine-tuning other popular LLMs. Due to its mixformer architecture, phi-1.5 is not yet supported by TRL and not entirely supported by PEFT (some of PEFT functions that I’ve tried didn’t work), two of the libraries commonly used for simple and efficient fine-tuning.
If you want to use Hugging Face libraries, you will have to use the standard Trainer and prepare the model with PEFT.
I used timdettmers/openassistant-guanaco (Apache 2.0 License) for instruction fine-tuning. It’s a small but useful dataset to validate a training pipeline.
Fine-tune the model as follows:
from peft import LoraConfig, get_peft_model
from transformers import Trainer, DataCollatorForLanguageModeling
tokenizer.padding_side = 'left'
tokenizer.pad_token = tokenizer.unk_token
dataset = load_dataset("timdettmers/openassistant-guanaco")
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=16,
bias="none",
task_type="CAUSAL_LM",
target_modules= ["Wqkv", "out_proj"]
)
model = get_peft_model(model, peft_config)
model.gradient_checkpointing=True
training_arguments = TrainingArguments(
output_dir="./results",
evaluation_strategy="steps",
save_strategy='epoch',
do_eval=True,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
per_device_eval_batch_size=4,
logging_steps=50,
learning_rate=4e-4,
eval_steps=200,
num_train_epochs=1,
warmup_steps=100,
lr_scheduler_type="cosine",
remove_unused_columns=True
)
def tok(sample):
model_inps = tokenizer(sample["text"], padding=True, max_length=500, truncation=True)
return model_inps
tokenized_training_data = dataset['train'].map(tok, batched=True)
tokenized_test_data = dataset['test'].map(tok, batched=True)
trainer = Trainer(
model=model,
train_dataset=tokenized_training_data,
eval_dataset=tokenized_test_data,
args=training_arguments,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
Since we train with QLoRA, only the adapter is saved. You will have to load it or merge it on top of the base model for inference.
This code will fine-tune phi-1.5 for 1 epoch. It only takes 45 minutes on Google Colab’s T4. It can be much faster if you have a more recent GPU. Moreover, if you have a GPU with 12 GB of VRAM or more*, you may skip quantization and fine-tune the model with just LoRA. To do that, you only need to drop the parameter “quantization_config“ when loading the model.
I recommend fine-tuning for at least 5 epochs. Note: In the notebook, I show some examples of responses generated by the fine-tuned model.
However, you will see that during fine-tuning the following message is printed at each step:
`attention_mask` is not supported during training. Using it might lead to unexpected results.
The problem here is that phi-1.5 was pre-trained without padding and the implementation of “MixFormerSequentialForCausalLM” released by Microsoft with the model doesn’t support attention masking during training.
In other words, we can’t properly fine-tune the model to learn when to stop generating. Pad tokens are interpreted as normal tokens. You would have to modify MixFormerSequentialForCausalLM to add support for the attention mask.
A Controversial Evaluation
The surprising performance of phi-1.5 triggered some investigations by several researchers.
There are several good threads that you can read on X about this evaluation, notably by Susan Zhang and Pratyush Maini. The model may have memorized some examples of the evaluation benchmark used by Microsoft. If the examples are memorized, it means that the examples have been directly used to train the model. This would be a case of data contamination.
The authors of phi-1.5 responded in both threads with solid arguments but I think that, unless they release the training data, many will remain suspicious about their results.
My opinion is that their results are not that surprising. We know from the technical report that Microsoft used distillation to train phi-1.5. But we don’t know how good the model that generated the training data, i.e., the teacher model, was. We also don’t know how the teacher model was trained.
This lack of information leaves the door open for a lot of assumptions on why phi-1.5 easily generates examples from the evaluation benchmarks.
* affiliate link