Hi Benjamin, I am facing trouble while testing the model after training 3 epochs for my custom dataset. I have followed same steps as you did.
While generation in test time, the code throws an error:
ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to call `tokenizer.padding_side = 'left'` before tokenizing the input.
Hi Ben, no I'm not referring to the warning at the beginning of fine-tuning.
I'm seeing this ValueError at inference time whenever I set attn_implementation="flash_attention_2", no matter what value I set for model.config.use_cache:
```
File "/home/tianlu_zhang/.conda/envs/finetune/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 992, in forward
raise ValueError(
ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to call `tokenizer.padding_side = 'left'` before tokenizing the input.
```
If I just load the base model without specifying the usage of flash_attention_2, then the inference code works properly:
Are you using the most recent version of Transformers? I also noticed some bugs when using FlashAttention with Mistral 7B, last weeks, but I don't have these bugs anymore.
In your case, the error message mentionned that you didn't set the padding side to left. Did you? Your error doesn't seem related to the cache.
Hi Benjamin! I'm a newbie in quantization. Can I ask a very basic and a very general question? When I follow your instructions from this article and load a quantized model with
model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config
)
the loaded model has an expected size of 3.84GB, but number of model parameters are surprising to me:
Trainable parameters: 262410240
Total parameters: 3752071168
How come quantization reduced the total number of parameters from 7.2B to 3.7B? Shouldn't the total number of model parameters stay the same after quantization?
Hi Benjamin, I noticed you used Guanaco format for fine-tune Mistral . I do see different fine tune tutorials using different format. E.g using [INST] & [/INST]. I wonder if that will somehow change the fine tune performance? Or as long as using the same format for training & inferencing, I will be fine?
And I am assuming if I want to fine tune on a fine-tuned model like Zephyr, in this case, I will have to follow the format they used for fine tune Mistral base model? Thanks.
The format of the prompt is not very important, Choose a format that is clear and easy to use for your application. The only requirement is that it must be same for fine-tuning and inference.
Zephyr is already fine-tuned. Fine-tuning on new data will possibly undo the previous fine-tuning. So if you fine-tune it again with a new prompt format, it will likely forget Zephyr's format.
hi, great post! i just have a quick question about generation part. I check the notebook and the generated result does not seem to know when to end until it reaches the maximum number of tokens. I also face similar problem in my fine tuning project, do you know how to fix this kind of issue? Many thanks!
I think I have fixed it somehow, the problem seems to be max_seq_length was set too long for my trained data. I changed from 1024 to 512 and the generated results of new model knows how to stop. Still, I wonder if having too many padding tokens would affect my training?
I see. I'm not sure why decreasing the max_seq_lentgh helps.
The reverse would have been understandable: for intance, if your training examples are all more than 512 tokens, then the EOS token might truncated (depending on the implementation, ie, if EOS is added before or after truncation), in that case the model would not see EOS during training. In this situation, increasing to 1024 would help.
Padding tokens are basically ignored during training. Their number has no effect.
That's interesting. I also usually set device map to auto for multi GPU setting. Mistral implements various optimisations for inference maybe that's why we need to explicitly call accelerate.
Hi Benjamin, I am facing trouble while testing the model after training 3 epochs for my custom dataset. I have followed same steps as you did.
While generation in test time, the code throws an error:
ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to call `tokenizer.padding_side = 'left'` before tokenizing the input.
How to fix this?
Hi!
Setting GenerationConfig(padding_side="left"... should work.
Thanks for pinging back, I tried that and it did not work. Apparently there is a bug caused by use_catch=True..
-> https://github.com/huggingface/trl/issues/1217#issuecomment-1889282654
I set use_catch to False in both base model and inference and tried it. Its working now.
I have a new question now, please pardon my ignorance I am new to this field..
I want to finetune this 7b model for text generation related to fitness suggestions..
My input prompt will have around 200 words and output will have 170 words max.
How many train samples will be needed for this task?
I think one thousand samples would work. But more is better of course.
How did you set `use_cache=False` for both base and lora weights?
I tried putting "model.config.use_cache" before and after loading the lora weights:
# load base model weights ...
model.config.use_cache = False
# load lora weights along with base weights
model = PeftModel.from_pretrained(model, "./results/checkpoint-20/")
model.config.use_cache = False
But the same error still persisted.
Are you referring to the warning printed at the beginning of fine tuning? I think we can't get rid of this warning, even if we disable caching.
Hi Ben, no I'm not referring to the warning at the beginning of fine-tuning.
I'm seeing this ValueError at inference time whenever I set attn_implementation="flash_attention_2", no matter what value I set for model.config.use_cache:
```
File "/home/tianlu_zhang/.conda/envs/finetune/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 992, in forward
raise ValueError(
ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to call `tokenizer.padding_side = 'left'` before tokenizing the input.
```
If I just load the base model without specifying the usage of flash_attention_2, then the inference code works properly:
model = AutoModelForCausalLM.from_pretrained(
model_name, device_map={"": 0},
torch_dtype=torch.bfloat16)
Are you using the most recent version of Transformers? I also noticed some bugs when using FlashAttention with Mistral 7B, last weeks, but I don't have these bugs anymore.
In your case, the error message mentionned that you didn't set the padding side to left. Did you? Your error doesn't seem related to the cache.
Hi Benjamin! I'm a newbie in quantization. Can I ask a very basic and a very general question? When I follow your instructions from this article and load a quantized model with
model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config
)
the loaded model has an expected size of 3.84GB, but number of model parameters are surprising to me:
Trainable parameters: 262410240
Total parameters: 3752071168
How come quantization reduced the total number of parameters from 7.2B to 3.7B? Shouldn't the total number of model parameters stay the same after quantization?
Great question!
Quantization does not reduce the number of parameters. It's some kind of bug in the parameter counting. More information here:
https://github.com/huggingface/transformers/issues/25978
Hi Benjamin, I noticed you used Guanaco format for fine-tune Mistral . I do see different fine tune tutorials using different format. E.g using [INST] & [/INST]. I wonder if that will somehow change the fine tune performance? Or as long as using the same format for training & inferencing, I will be fine?
And I am assuming if I want to fine tune on a fine-tuned model like Zephyr, in this case, I will have to follow the format they used for fine tune Mistral base model? Thanks.
The format of the prompt is not very important, Choose a format that is clear and easy to use for your application. The only requirement is that it must be same for fine-tuning and inference.
Zephyr is already fine-tuned. Fine-tuning on new data will possibly undo the previous fine-tuning. So if you fine-tune it again with a new prompt format, it will likely forget Zephyr's format.
Thanks for your clarification
hi, great post! i just have a quick question about generation part. I check the notebook and the generated result does not seem to know when to end until it reaches the maximum number of tokens. I also face similar problem in my fine tuning project, do you know how to fix this kind of issue? Many thanks!
Yes, you have to set add_eos_token=True when you call AutoTokenizer.from_pretrained. but only for fine-tuning, not for inference.
I think I have fixed it somehow, the problem seems to be max_seq_length was set too long for my trained data. I changed from 1024 to 512 and the generated results of new model knows how to stop. Still, I wonder if having too many padding tokens would affect my training?
I see. I'm not sure why decreasing the max_seq_lentgh helps.
The reverse would have been understandable: for intance, if your training examples are all more than 512 tokens, then the EOS token might truncated (depending on the implementation, ie, if EOS is added before or after truncation), in that case the model would not see EOS during training. In this situation, increasing to 1024 would help.
Padding tokens are basically ignored during training. Their number has no effect.
yeah, i think i did do that, but the generation still does not stop. really not sure what's going on :( here's the code for tokenizer:
tokenizer = AutoTokenizer.from_pretrained(
base_model,
padding_side="right",
add_eos_token=True,
add_bos_token=True,
fast_tokenizer=True,
)
tokenizer.pad_token = tokenizer.unk_token
Hi kz, did you figure out what was the reason in the end? Run into similar issue....Thanks.
Change to device_map='auto' when you load the model with from_pretrained
That's interesting. I also usually set device map to auto for multi GPU setting. Mistral implements various optimisations for inference maybe that's why we need to explicitly call accelerate.
Thank you for pointing out this solution.
Hi, Benjamin, it seems the original question was deleted, could you give a high level summary about what was the issue if you still recall...thanks