42 Comments

When I run the colab notebook, the printed output is:

rompt --- 13.577 tokens/seconds ---

GPU memory occupied: 7952 MB.

None

Write the recipe for a chicken curry with coconut milk.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Average --- 13.577 tokens/seconds ---

Instead of the actual recipe as shown in the article.

Advice?

Expand full comment

Thanks for your feedback. It seems that Phi-2 has been heavily updated recently. I recommend to check that you have set revision="refs/pr/23" when loading the model for inference.

Expand full comment

Thanks for the quick response. That cleared up other errors as well.

Expand full comment

Just to say that I had the same issue as Steve, but your revision assignment fixed it, so seconded thanks.

Expand full comment

I guess the reliance on GPT is what made open-sourcing this model further towards commercial usage hard...

Expand full comment

It looks like there are some issues with the Phi-2 model autocasting at this point.

I get "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half" on the forward pass using your notebook. It might be related to https://huggingface.co/microsoft/phi-2/discussions/109 but I can't quite figure it out. Would love your thoughts if you have an idea about what's going on. I'm fooling around with 12 GB VRAM so I can't work in float32. Would love any advice you have on ~2B models worth using for local only projects.

Expand full comment

Did you try to wrap "generate" with autocast? something like:

with torch.autocast(device_type='cuda', dtype=torch.float16):

output = model.generate(**model_inputs....

As for 2B models, Phi1-5 or Phi-2 are the best. Qwen1.5 1.8B might also be good.

Expand full comment

Yeah - I tried it myself, and then with the notebook you provide. Both your notebook and my attempt had the same error I'm afraid. It might be a recent-ish disabling of autocast per the discussion on huggingface?

A little more detail:

with torch.autocast(device_type='cuda',dtype=torch.float16, enabled=True):

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", trust_remote_code=True)

Works but loads the full model (GPU usage goes to 11.5 GB)

with torch.autocast(device_type='cuda',dtype=torch.float16, enabled=True):

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2",torch_dtype=torch.float16, trust_remote_code=True)

Fails with the error that I get with your notebook also (RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half)

However the first part of your notebook

"model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})

print(print_gpu_utilization())"

works fine and loads the model in ~6GB of VRAM

Expand full comment

And if you do autocast only for "model.generate" rather than when loading the model?

Another solution would be to use an older commit of the model, to use a version before they modified the .py files

Expand full comment

Ah well - it looks like float16 just doesn't work (you can get it to work by modifying the model code to allow autocasting, but the output is garbage). bfloat16 will work, but for reasons that I cannot figure out, the model when it loads is only ~200 MB smaller than the float32 version.

I may have hit a deadend with Phi-2 for now. Thank you for helping me try and figure this out. You might want to update your notebook for others if my issues arn't local to my machine (tldr: bfloat16 might work, float16 is unlikely to)

Expand full comment

I just retried to run my notebook. I obtained the same error that you mentioned but after removing this line:

with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):

the inference is done and the VRAM consumption remains below 7 GB of VRAM. I have changed nothing else. I used the V100 of Colab (using fp16) and the output looks good.

Expand full comment

I have never modified the model code, used float16, and it worked very well. It doesn't work now because Microsoft modified the model code later. I didn't try since this modification but now plan to.

Thank you for the feedback! That's very useful.

Expand full comment

Quick related question here: in a vanilla LoRA situation (not QLoRA) if we load the model in bf16, do we need to set bf16=True in TrainingArguments? This is now causing an error like the one described above. I can load the model in bf16 and get the the trainer to run if I don't set bf16=True, but I assume that's not a good idea. Additionally, given we are using paged_adamw_8bit as the optimizer, what exactly does setting bf16=True do here? I believe bf16=True instantiates mixed-precision training which has to do with the activations, whereas paged_adamw_8bit holds the gradients in 8-bit, correct?

Expand full comment

Hi Benjamin, please do let us know where you wind up with this. I was able to do a fine-tuning with float32 as I have the resources, but that seemed needlessly wasteful.

Expand full comment

Many thanks Benjamin. A couple of quick questions:

1. For tokenization, do we also need to set tokenizer.pad_token_id=tokenizer.eos_token_id ?

2. If we were to save this model for later use, would it be advised to save the tokenizer with all the adjustments as well?

Many thanks!

Expand full comment

1. It is not necessary I think but it might more clean to tokenizer.pad_token_id=tokenizer.eos_token_id

2. Yes, saving the tokenizer is better.

Expand full comment

Hi, i am new here.

I have 1 million emails that i want to train/finetune one of the opensource huggingface leaderboard top 7b or smaller model, and then instruct tune it after if needed to use it as a chatbot. I wanna do it on my pc(16gb gpu ram(cuda), and 128gb cpu ram). Can you point me to a notebook or tutorial/article that you created which is most relevant? Thanks in advance!

Expand full comment

You can follow this tutorial for fine-tuning on 16 GB of VRAM:

https://kaitchup.substack.com/p/mistral-7b-recipes-for-fine-tuning

And then to align it (if you want):

https://kaitchup.substack.com/p/fine-tune-your-own-instruct-version

The main difference with what you want to do is that in my tutorials I always use prepared dataset ready to download. In your case, you will have to format the dataset by yourself. I recommend to format it in JSON where each training example is in a different row named "text". Then, you just have to call Dataset.from_json("mydata.json") to load it and pass the result to the SFTTrainer.

Expand full comment

Thanks a lot for the fast reply.

But, can i train/fine-tune on unstructured emails also?

Lets say i clean 1 million email body's(remove html syntax, and remove email history/thread so one email only appears once in the dataset). I will then have a .json with 1 million lines, 1 email body per line.

I want the model to know about our companies products, as for example chatgpt doesnt know anything about our products, as there isnt that much to find online about the products.

1. Find a good base model on huggingface leaderboard

2. Train further on my 1 million emails

3. Finetune on instruct (to be able to chat withe model)

4. Setup vector-db with all our manuals and faq's.

Can it be done this way?

Thanks in advance

Expand full comment

Step 1 and 2 are straightforward. For step 3, you will need to transform your dataset into an instruction dataset, not sure how. For step 4, I don't have any experience with vector dbs.

Expand full comment

Ah, so i can do step 2 with your notebook, but only use one column(emailbody) and no syntax?

For step 3, cant i finetune to instruct on a existing instruct dataset made to transform a model to become a chat model?

Step 4, i think i should manage to fix my self. (Can share my code with you when i am done if you like)

Expand full comment

For step 2, if you use my code, the one column should be named 'text'.

For step 3, if you do that, your model will forget what you did at step 2. I think you can skip step 2. And then directly do step 3 by fine-tuning on your dataset transformed into an instruction dataset (basically a dataset of examples of questions/answers related to your email database. That's not simple I think).

Expand full comment

Are you sure that it will forget about step 2?

It doesnt forget stuff trained in the base model... And i just want to continue that base training(unstructured).

Its not that simple to make a instruction dataset of 1 millipn emails :p will take incredible long time. Could probably automate, but that is really not what i wanted.

Step 3, i was thinking of just use a pre made instruction dataset, so i dont need to make anything.

Expand full comment

I've been targeting all linear layers. Should I not?

`target_modules = ['fc1', 'fc2', 'Wqkv', 'out_proj', 'linear']`

Expand full comment

It's a good idea to target all the modules. It yields better results but then you have more parameters to train, i.e., you need more training data or more training epochs to get a good adapter.

Expand full comment

Okay, thanks. When I get some free time, I'll train Phi-2 on the VMware/open-instruct dataset, merge the adapter, and then finetune with my wimpy 10k dataset.

Expand full comment

You usually set `tokenizer.pad_token = tokenizer.unk_token`. Is there a particular reason you chose the eos_token this time?

Expand full comment

I didn't have look at how the attention mask is implemented. But if it's like phi1-5, then pad tokens are not masked and are interpreted as normal tokens. If we set the eos token as pad token, the model will see that all the examples end with many eos token, which could help to teach the model when to stop generating.

But this is only an assumption.

Expand full comment

FYI, Looks like flash_attn=True, flash_rotary=True, fused_dense=True

are not supported by the version of AutoModelForCausalLM.from_pretrained() that I installed today (transformers 4.36.2). The call work when those are commented out.

Expand full comment

NM, I see the medium article says they are only valid for Ampere processors

Expand full comment

after the latest update of phi-2 when i merge and unload the model after finetuning in that manner i got the broken model it answer with trash nonsense symbols? Any ideas?

Expand full comment

Merging an adapter fine-tuned with QLoRA produces a model that has an unknown performance. Most of the time, the resulting model will perform closely to the original model with the loaded adapter but this is unpredictable.

I explain here why we shouldn't merge adapter fine-tuned with QLoRA:

https://kaitchup.substack.com/p/dont-merge-your-lora-adapter-into

Expand full comment

Okay, I'm training with 300k examples as we speak. When I launched this session, I saw this in the terminal: "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained."

I didn't add any new tokens, though. I did specify `tokenizer.pad_token = tokenizer.eos_token`, but I've been doing that Mistral and Llama for a long time now...without seeing that warning in the terminal.

Many people say that you need to include `lm_head` in your target modules when adding new tokens. Another person, who actually did add new tokens, added `lm_head` and `embed_tokens` to the `modules_to_save` parameter of LoraConfig. https://medium.com/@geronimo7/phinetuning-2-0-28a2be6de110

Not sure what to do in this particular situation.

Expand full comment

Yes, I have similar warnings with Phi-2. I simply ignore it but for a clean fine-tuning, we should indeed retrain the token embeddings. This means resizing the embeddings and then make the entire token embeddings trainable in the LoRA config (hence the suggestion you read about "modules_to_save").

If we do that:

- we need much more VRAM

- we need much more training data to retrain good embeddings for the tokens

Such a fine-tuning would get very costly.

In the case of Phi-2, the tokens that are added are technical (I think it's for compatibility with HF transformers). We can ignore them.

Expand full comment

Wow, same result after all that. I'm going to give up and stick with Mistral for now. Maybe I'll take a crack at Phi-3 when it comes out. Thank you for your help.

Expand full comment

Woo hoo! I finetuned the new Phi-3 Mini, and it stops generating correctly!! However, it took several attempts. It was spewing garbage like Phi-2 and Llama-3 do for me—until I changed the already-supplied pad_token to <unk> and set the padding_side to 'right'.

Expand full comment

Sounds great! One question: What data type do you use? bfloat16 or float16?

Expand full comment

I always use bfloat16—unless I'm forced to use float16 for some reason.

Expand full comment