When I run the colab notebook, the printed output is:
rompt --- 13.577 tokens/seconds ---
GPU memory occupied: 7952 MB.
None
Write the recipe for a chicken curry with coconut milk.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Average --- 13.577 tokens/seconds ---
Instead of the actual recipe as shown in the article.
Thanks for your feedback. It seems that Phi-2 has been heavily updated recently. I recommend to check that you have set revision="refs/pr/23" when loading the model for inference.
It looks like there are some issues with the Phi-2 model autocasting at this point.
I get "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half" on the forward pass using your notebook. It might be related to https://huggingface.co/microsoft/phi-2/discussions/109 but I can't quite figure it out. Would love your thoughts if you have an idea about what's going on. I'm fooling around with 12 GB VRAM so I can't work in float32. Would love any advice you have on ~2B models worth using for local only projects.
Yeah - I tried it myself, and then with the notebook you provide. Both your notebook and my attempt had the same error I'm afraid. It might be a recent-ish disabling of autocast per the discussion on huggingface?
A little more detail:
with torch.autocast(device_type='cuda',dtype=torch.float16, enabled=True):
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", trust_remote_code=True)
Works but loads the full model (GPU usage goes to 11.5 GB)
with torch.autocast(device_type='cuda',dtype=torch.float16, enabled=True):
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2",torch_dtype=torch.float16, trust_remote_code=True)
Fails with the error that I get with your notebook also (RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half)
Ah well - it looks like float16 just doesn't work (you can get it to work by modifying the model code to allow autocasting, but the output is garbage). bfloat16 will work, but for reasons that I cannot figure out, the model when it loads is only ~200 MB smaller than the float32 version.
I may have hit a deadend with Phi-2 for now. Thank you for helping me try and figure this out. You might want to update your notebook for others if my issues arn't local to my machine (tldr: bfloat16 might work, float16 is unlikely to)
I just retried to run my notebook. I obtained the same error that you mentioned but after removing this line:
with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
the inference is done and the VRAM consumption remains below 7 GB of VRAM. I have changed nothing else. I used the V100 of Colab (using fp16) and the output looks good.
I have never modified the model code, used float16, and it worked very well. It doesn't work now because Microsoft modified the model code later. I didn't try since this modification but now plan to.
Quick related question here: in a vanilla LoRA situation (not QLoRA) if we load the model in bf16, do we need to set bf16=True in TrainingArguments? This is now causing an error like the one described above. I can load the model in bf16 and get the the trainer to run if I don't set bf16=True, but I assume that's not a good idea. Additionally, given we are using paged_adamw_8bit as the optimizer, what exactly does setting bf16=True do here? I believe bf16=True instantiates mixed-precision training which has to do with the activations, whereas paged_adamw_8bit holds the gradients in 8-bit, correct?
Hi Benjamin, please do let us know where you wind up with this. I was able to do a fine-tuning with float32 as I have the resources, but that seemed needlessly wasteful.
I have 1 million emails that i want to train/finetune one of the opensource huggingface leaderboard top 7b or smaller model, and then instruct tune it after if needed to use it as a chatbot. I wanna do it on my pc(16gb gpu ram(cuda), and 128gb cpu ram). Can you point me to a notebook or tutorial/article that you created which is most relevant? Thanks in advance!
The main difference with what you want to do is that in my tutorials I always use prepared dataset ready to download. In your case, you will have to format the dataset by yourself. I recommend to format it in JSON where each training example is in a different row named "text". Then, you just have to call Dataset.from_json("mydata.json") to load it and pass the result to the SFTTrainer.
But, can i train/fine-tune on unstructured emails also?
Lets say i clean 1 million email body's(remove html syntax, and remove email history/thread so one email only appears once in the dataset). I will then have a .json with 1 million lines, 1 email body per line.
I want the model to know about our companies products, as for example chatgpt doesnt know anything about our products, as there isnt that much to find online about the products.
1. Find a good base model on huggingface leaderboard
2. Train further on my 1 million emails
3. Finetune on instruct (to be able to chat withe model)
4. Setup vector-db with all our manuals and faq's.
Step 1 and 2 are straightforward. For step 3, you will need to transform your dataset into an instruction dataset, not sure how. For step 4, I don't have any experience with vector dbs.
For step 2, if you use my code, the one column should be named 'text'.
For step 3, if you do that, your model will forget what you did at step 2. I think you can skip step 2. And then directly do step 3 by fine-tuning on your dataset transformed into an instruction dataset (basically a dataset of examples of questions/answers related to your email database. That's not simple I think).
It doesnt forget stuff trained in the base model... And i just want to continue that base training(unstructured).
Its not that simple to make a instruction dataset of 1 millipn emails :p will take incredible long time. Could probably automate, but that is really not what i wanted.
Step 3, i was thinking of just use a pre made instruction dataset, so i dont need to make anything.
It's a good idea to target all the modules. It yields better results but then you have more parameters to train, i.e., you need more training data or more training epochs to get a good adapter.
Okay, thanks. When I get some free time, I'll train Phi-2 on the VMware/open-instruct dataset, merge the adapter, and then finetune with my wimpy 10k dataset.
I didn't have look at how the attention mask is implemented. But if it's like phi1-5, then pad tokens are not masked and are interpreted as normal tokens. If we set the eos token as pad token, the model will see that all the examples end with many eos token, which could help to teach the model when to stop generating.
FYI, Looks like flash_attn=True, flash_rotary=True, fused_dense=True
are not supported by the version of AutoModelForCausalLM.from_pretrained() that I installed today (transformers 4.36.2). The call work when those are commented out.
after the latest update of phi-2 when i merge and unload the model after finetuning in that manner i got the broken model it answer with trash nonsense symbols? Any ideas?
Merging an adapter fine-tuned with QLoRA produces a model that has an unknown performance. Most of the time, the resulting model will perform closely to the original model with the loaded adapter but this is unpredictable.
I explain here why we shouldn't merge adapter fine-tuned with QLoRA:
Okay, I'm training with 300k examples as we speak. When I launched this session, I saw this in the terminal: "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained."
I didn't add any new tokens, though. I did specify `tokenizer.pad_token = tokenizer.eos_token`, but I've been doing that Mistral and Llama for a long time now...without seeing that warning in the terminal.
Many people say that you need to include `lm_head` in your target modules when adding new tokens. Another person, who actually did add new tokens, added `lm_head` and `embed_tokens` to the `modules_to_save` parameter of LoraConfig. https://medium.com/@geronimo7/phinetuning-2-0-28a2be6de110
Yes, I have similar warnings with Phi-2. I simply ignore it but for a clean fine-tuning, we should indeed retrain the token embeddings. This means resizing the embeddings and then make the entire token embeddings trainable in the LoRA config (hence the suggestion you read about "modules_to_save").
If we do that:
- we need much more VRAM
- we need much more training data to retrain good embeddings for the tokens
Such a fine-tuning would get very costly.
In the case of Phi-2, the tokens that are added are technical (I think it's for compatibility with HF transformers). We can ignore them.
Wow, same result after all that. I'm going to give up and stick with Mistral for now. Maybe I'll take a crack at Phi-3 when it comes out. Thank you for your help.
Woo hoo! I finetuned the new Phi-3 Mini, and it stops generating correctly!! However, it took several attempts. It was spewing garbage like Phi-2 and Llama-3 do for me—until I changed the already-supplied pad_token to <unk> and set the padding_side to 'right'.
When I run the colab notebook, the printed output is:
rompt --- 13.577 tokens/seconds ---
GPU memory occupied: 7952 MB.
None
Write the recipe for a chicken curry with coconut milk.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Average --- 13.577 tokens/seconds ---
Instead of the actual recipe as shown in the article.
Advice?
Thanks for your feedback. It seems that Phi-2 has been heavily updated recently. I recommend to check that you have set revision="refs/pr/23" when loading the model for inference.
Thanks for the quick response. That cleared up other errors as well.
Just to say that I had the same issue as Steve, but your revision assignment fixed it, so seconded thanks.
I guess the reliance on GPT is what made open-sourcing this model further towards commercial usage hard...
It looks like there are some issues with the Phi-2 model autocasting at this point.
I get "RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half" on the forward pass using your notebook. It might be related to https://huggingface.co/microsoft/phi-2/discussions/109 but I can't quite figure it out. Would love your thoughts if you have an idea about what's going on. I'm fooling around with 12 GB VRAM so I can't work in float32. Would love any advice you have on ~2B models worth using for local only projects.
Did you try to wrap "generate" with autocast? something like:
with torch.autocast(device_type='cuda', dtype=torch.float16):
output = model.generate(**model_inputs....
As for 2B models, Phi1-5 or Phi-2 are the best. Qwen1.5 1.8B might also be good.
Yeah - I tried it myself, and then with the notebook you provide. Both your notebook and my attempt had the same error I'm afraid. It might be a recent-ish disabling of autocast per the discussion on huggingface?
A little more detail:
with torch.autocast(device_type='cuda',dtype=torch.float16, enabled=True):
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", trust_remote_code=True)
Works but loads the full model (GPU usage goes to 11.5 GB)
with torch.autocast(device_type='cuda',dtype=torch.float16, enabled=True):
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2",torch_dtype=torch.float16, trust_remote_code=True)
Fails with the error that I get with your notebook also (RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half)
However the first part of your notebook
"model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})
print(print_gpu_utilization())"
works fine and loads the model in ~6GB of VRAM
And if you do autocast only for "model.generate" rather than when loading the model?
Another solution would be to use an older commit of the model, to use a version before they modified the .py files
Ah well - it looks like float16 just doesn't work (you can get it to work by modifying the model code to allow autocasting, but the output is garbage). bfloat16 will work, but for reasons that I cannot figure out, the model when it loads is only ~200 MB smaller than the float32 version.
I may have hit a deadend with Phi-2 for now. Thank you for helping me try and figure this out. You might want to update your notebook for others if my issues arn't local to my machine (tldr: bfloat16 might work, float16 is unlikely to)
I just retried to run my notebook. I obtained the same error that you mentioned but after removing this line:
with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
the inference is done and the VRAM consumption remains below 7 GB of VRAM. I have changed nothing else. I used the V100 of Colab (using fp16) and the output looks good.
I have never modified the model code, used float16, and it worked very well. It doesn't work now because Microsoft modified the model code later. I didn't try since this modification but now plan to.
Thank you for the feedback! That's very useful.
Quick related question here: in a vanilla LoRA situation (not QLoRA) if we load the model in bf16, do we need to set bf16=True in TrainingArguments? This is now causing an error like the one described above. I can load the model in bf16 and get the the trainer to run if I don't set bf16=True, but I assume that's not a good idea. Additionally, given we are using paged_adamw_8bit as the optimizer, what exactly does setting bf16=True do here? I believe bf16=True instantiates mixed-precision training which has to do with the activations, whereas paged_adamw_8bit holds the gradients in 8-bit, correct?
Hi Benjamin, please do let us know where you wind up with this. I was able to do a fine-tuning with float32 as I have the resources, but that seemed needlessly wasteful.
Many thanks Benjamin. A couple of quick questions:
1. For tokenization, do we also need to set tokenizer.pad_token_id=tokenizer.eos_token_id ?
2. If we were to save this model for later use, would it be advised to save the tokenizer with all the adjustments as well?
Many thanks!
1. It is not necessary I think but it might more clean to tokenizer.pad_token_id=tokenizer.eos_token_id
2. Yes, saving the tokenizer is better.
Hi, i am new here.
I have 1 million emails that i want to train/finetune one of the opensource huggingface leaderboard top 7b or smaller model, and then instruct tune it after if needed to use it as a chatbot. I wanna do it on my pc(16gb gpu ram(cuda), and 128gb cpu ram). Can you point me to a notebook or tutorial/article that you created which is most relevant? Thanks in advance!
You can follow this tutorial for fine-tuning on 16 GB of VRAM:
https://kaitchup.substack.com/p/mistral-7b-recipes-for-fine-tuning
And then to align it (if you want):
https://kaitchup.substack.com/p/fine-tune-your-own-instruct-version
The main difference with what you want to do is that in my tutorials I always use prepared dataset ready to download. In your case, you will have to format the dataset by yourself. I recommend to format it in JSON where each training example is in a different row named "text". Then, you just have to call Dataset.from_json("mydata.json") to load it and pass the result to the SFTTrainer.
Thanks a lot for the fast reply.
But, can i train/fine-tune on unstructured emails also?
Lets say i clean 1 million email body's(remove html syntax, and remove email history/thread so one email only appears once in the dataset). I will then have a .json with 1 million lines, 1 email body per line.
I want the model to know about our companies products, as for example chatgpt doesnt know anything about our products, as there isnt that much to find online about the products.
1. Find a good base model on huggingface leaderboard
2. Train further on my 1 million emails
3. Finetune on instruct (to be able to chat withe model)
4. Setup vector-db with all our manuals and faq's.
Can it be done this way?
Thanks in advance
Step 1 and 2 are straightforward. For step 3, you will need to transform your dataset into an instruction dataset, not sure how. For step 4, I don't have any experience with vector dbs.
Ah, so i can do step 2 with your notebook, but only use one column(emailbody) and no syntax?
For step 3, cant i finetune to instruct on a existing instruct dataset made to transform a model to become a chat model?
Step 4, i think i should manage to fix my self. (Can share my code with you when i am done if you like)
For step 2, if you use my code, the one column should be named 'text'.
For step 3, if you do that, your model will forget what you did at step 2. I think you can skip step 2. And then directly do step 3 by fine-tuning on your dataset transformed into an instruction dataset (basically a dataset of examples of questions/answers related to your email database. That's not simple I think).
Are you sure that it will forget about step 2?
It doesnt forget stuff trained in the base model... And i just want to continue that base training(unstructured).
Its not that simple to make a instruction dataset of 1 millipn emails :p will take incredible long time. Could probably automate, but that is really not what i wanted.
Step 3, i was thinking of just use a pre made instruction dataset, so i dont need to make anything.
I've been targeting all linear layers. Should I not?
`target_modules = ['fc1', 'fc2', 'Wqkv', 'out_proj', 'linear']`
It's a good idea to target all the modules. It yields better results but then you have more parameters to train, i.e., you need more training data or more training epochs to get a good adapter.
Okay, thanks. When I get some free time, I'll train Phi-2 on the VMware/open-instruct dataset, merge the adapter, and then finetune with my wimpy 10k dataset.
You usually set `tokenizer.pad_token = tokenizer.unk_token`. Is there a particular reason you chose the eos_token this time?
I didn't have look at how the attention mask is implemented. But if it's like phi1-5, then pad tokens are not masked and are interpreted as normal tokens. If we set the eos token as pad token, the model will see that all the examples end with many eos token, which could help to teach the model when to stop generating.
But this is only an assumption.
FYI, Looks like flash_attn=True, flash_rotary=True, fused_dense=True
are not supported by the version of AutoModelForCausalLM.from_pretrained() that I installed today (transformers 4.36.2). The call work when those are commented out.
NM, I see the medium article says they are only valid for Ampere processors
after the latest update of phi-2 when i merge and unload the model after finetuning in that manner i got the broken model it answer with trash nonsense symbols? Any ideas?
Merging an adapter fine-tuned with QLoRA produces a model that has an unknown performance. Most of the time, the resulting model will perform closely to the original model with the loaded adapter but this is unpredictable.
I explain here why we shouldn't merge adapter fine-tuned with QLoRA:
https://kaitchup.substack.com/p/dont-merge-your-lora-adapter-into
Okay, I'm training with 300k examples as we speak. When I launched this session, I saw this in the terminal: "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained."
I didn't add any new tokens, though. I did specify `tokenizer.pad_token = tokenizer.eos_token`, but I've been doing that Mistral and Llama for a long time now...without seeing that warning in the terminal.
Many people say that you need to include `lm_head` in your target modules when adding new tokens. Another person, who actually did add new tokens, added `lm_head` and `embed_tokens` to the `modules_to_save` parameter of LoraConfig. https://medium.com/@geronimo7/phinetuning-2-0-28a2be6de110
Not sure what to do in this particular situation.
Yes, I have similar warnings with Phi-2. I simply ignore it but for a clean fine-tuning, we should indeed retrain the token embeddings. This means resizing the embeddings and then make the entire token embeddings trainable in the LoRA config (hence the suggestion you read about "modules_to_save").
If we do that:
- we need much more VRAM
- we need much more training data to retrain good embeddings for the tokens
Such a fine-tuning would get very costly.
In the case of Phi-2, the tokens that are added are technical (I think it's for compatibility with HF transformers). We can ignore them.
Wow, same result after all that. I'm going to give up and stick with Mistral for now. Maybe I'll take a crack at Phi-3 when it comes out. Thank you for your help.
Woo hoo! I finetuned the new Phi-3 Mini, and it stops generating correctly!! However, it took several attempts. It was spewing garbage like Phi-2 and Llama-3 do for me—until I changed the already-supplied pad_token to <unk> and set the padding_side to 'right'.
Sounds great! One question: What data type do you use? bfloat16 or float16?
I always use bfloat16—unless I'm forced to use float16 for some reason.