24 Comments

Hi Benjamin, I am facing trouble while testing the model after training 3 epochs for my custom dataset. I have followed same steps as you did.

While generation in test time, the code throws an error:

ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to call `tokenizer.padding_side = 'left'` before tokenizing the input.

How to fix this?

Expand full comment

Hi!

Setting GenerationConfig(padding_side="left"... should work.

Expand full comment

Thanks for pinging back, I tried that and it did not work. Apparently there is a bug caused by use_catch=True..

-> https://github.com/huggingface/trl/issues/1217#issuecomment-1889282654

I set use_catch to False in both base model and inference and tried it. Its working now.

I have a new question now, please pardon my ignorance I am new to this field..

I want to finetune this 7b model for text generation related to fitness suggestions..

My input prompt will have around 200 words and output will have 170 words max.

How many train samples will be needed for this task?

Expand full comment

I think one thousand samples would work. But more is better of course.

Expand full comment

How did you set `use_cache=False` for both base and lora weights?

I tried putting "model.config.use_cache" before and after loading the lora weights:

# load base model weights ...

model.config.use_cache = False

# load lora weights along with base weights

model = PeftModel.from_pretrained(model, "./results/checkpoint-20/")

model.config.use_cache = False

But the same error still persisted.

Expand full comment

Are you referring to the warning printed at the beginning of fine tuning? I think we can't get rid of this warning, even if we disable caching.

Expand full comment

Hi Ben, no I'm not referring to the warning at the beginning of fine-tuning.

I'm seeing this ValueError at inference time whenever I set attn_implementation="flash_attention_2", no matter what value I set for model.config.use_cache:

```

File "/home/tianlu_zhang/.conda/envs/finetune/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 992, in forward

raise ValueError(

ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to call `tokenizer.padding_side = 'left'` before tokenizing the input.

```

If I just load the base model without specifying the usage of flash_attention_2, then the inference code works properly:

model = AutoModelForCausalLM.from_pretrained(

model_name, device_map={"": 0},

torch_dtype=torch.bfloat16)

Expand full comment

Are you using the most recent version of Transformers? I also noticed some bugs when using FlashAttention with Mistral 7B, last weeks, but I don't have these bugs anymore.

In your case, the error message mentionned that you didn't set the padding side to left. Did you? Your error doesn't seem related to the cache.

Expand full comment

Hi Benjamin! I'm a newbie in quantization. Can I ask a very basic and a very general question? When I follow your instructions from this article and load a quantized model with

model = AutoModelForCausalLM.from_pretrained(

model_name, quantization_config=bnb_config

)

the loaded model has an expected size of 3.84GB, but number of model parameters are surprising to me:

Trainable parameters: 262410240

Total parameters: 3752071168

How come quantization reduced the total number of parameters from 7.2B to 3.7B? Shouldn't the total number of model parameters stay the same after quantization?

Expand full comment

Great question!

Quantization does not reduce the number of parameters. It's some kind of bug in the parameter counting. More information here:

https://github.com/huggingface/transformers/issues/25978

Expand full comment

Hi Benjamin, I noticed you used Guanaco format for fine-tune Mistral . I do see different fine tune tutorials using different format. E.g using [INST] & [/INST]. I wonder if that will somehow change the fine tune performance? Or as long as using the same format for training & inferencing, I will be fine?

And I am assuming if I want to fine tune on a fine-tuned model like Zephyr, in this case, I will have to follow the format they used for fine tune Mistral base model? Thanks.

Expand full comment

The format of the prompt is not very important, Choose a format that is clear and easy to use for your application. The only requirement is that it must be same for fine-tuning and inference.

Zephyr is already fine-tuned. Fine-tuning on new data will possibly undo the previous fine-tuning. So if you fine-tune it again with a new prompt format, it will likely forget Zephyr's format.

Expand full comment

Thanks for your clarification

Expand full comment

hi, great post! i just have a quick question about generation part. I check the notebook and the generated result does not seem to know when to end until it reaches the maximum number of tokens. I also face similar problem in my fine tuning project, do you know how to fix this kind of issue? Many thanks!

Expand full comment

Yes, you have to set add_eos_token=True when you call AutoTokenizer.from_pretrained. but only for fine-tuning, not for inference.

Expand full comment

I think I have fixed it somehow, the problem seems to be max_seq_length was set too long for my trained data. I changed from 1024 to 512 and the generated results of new model knows how to stop. Still, I wonder if having too many padding tokens would affect my training?

Expand full comment

I see. I'm not sure why decreasing the max_seq_lentgh helps.

The reverse would have been understandable: for intance, if your training examples are all more than 512 tokens, then the EOS token might truncated (depending on the implementation, ie, if EOS is added before or after truncation), in that case the model would not see EOS during training. In this situation, increasing to 1024 would help.

Padding tokens are basically ignored during training. Their number has no effect.

Expand full comment

yeah, i think i did do that, but the generation still does not stop. really not sure what's going on :( here's the code for tokenizer:

tokenizer = AutoTokenizer.from_pretrained(

base_model,

padding_side="right",

add_eos_token=True,

add_bos_token=True,

fast_tokenizer=True,

)

tokenizer.pad_token = tokenizer.unk_token

Expand full comment

Hi kz, did you figure out what was the reason in the end? Run into similar issue....Thanks.

Expand full comment
Comment deleted
Nov 17, 2023
Comment deleted
Expand full comment

Change to device_map='auto' when you load the model with from_pretrained

Expand full comment
Comment deleted
Nov 18, 2023
Comment deleted
Expand full comment
Comment deleted
Nov 18, 2023
Comment deleted
Expand full comment
Comment deleted
Nov 18, 2023Edited
Comment deleted
Expand full comment

That's interesting. I also usually set device map to auto for multi GPU setting. Mistral implements various optimisations for inference maybe that's why we need to explicitly call accelerate.

Thank you for pointing out this solution.

Expand full comment

Hi, Benjamin, it seems the original question was deleted, could you give a high level summary about what was the issue if you still recall...thanks

Expand full comment