21 Comments
Apr 19Liked by Benjamin Marie

Thanks for the update. What do you think the right pad/unk token we should use for llama3?

Expand full comment
author

I would do:

tokenizer.pad_token = tokenizer.unk_token

Expand full comment
Apr 20Liked by Benjamin Marie

However there is no unk token

Expand full comment
author

Indeed! For my current experiments with Llama 3, I'm setting the EOS token as pad token, it seems to work well.

There isn't any cheap alternative I think.

Expand full comment
Apr 21·edited Apr 21Liked by Benjamin Marie

When LoRA-tuning, I've found that using the eos_token as the pad_token makes an LLM unable to stop generating properly. It'll just keep spewing nonsense once your question has been addressed, until it reaches `max_tokens`. What if we instead used one of the 250 reserved special tokens in the Llama3 tokenizer?

tokenizer.pad_token = '<|reserved_special_token_250|>'

tokenizer.pad_token_id = 128255

This worked for me the first time I tried it on Thursday. I'm trying it again as we speak with a much more complex dataset.

I also tried adding a <|pad|> token to the tokenizer, calling model.resize_token_embeddings(len(tokenizer)), targeting lm_head, and saving the lm_head & embed_tokens modules, but that didn't go so well. I must have done something wrong, being my first time trying that.

Expand full comment
Apr 21Liked by Benjamin Marie

Argh. Using <|reserved_special_token_250|> didn't work with this second attempt. It too is unable to stop generating properly.

Expand full comment
author

Do you fine-tune the model for long enough? Using the EOS token should work fine. In my current experiments, inference seems to stop when appropriate.

Expand full comment

Nice weekly summary. Just some comments/Qs on your notebook 14:

1. Do you have a reference for the dequanting code? Or, did you have to develop it from scratch?

2. I notice this line in the dequanting function:

```

def dequantize_model(model, to='./dequantized_model', dtype=torch.float16, device="cuda"):

"""

'model': the peftmodel you loaded with qlora.

```

When you say 'model', do you in fact mean the base model OR the peftmodel? It seems to me the dequanting function expects a base model (but maybe it works with a peft model too?)

3. The dequantization and merging cell doesn't specify an adapter as an input (so I assume the adapter has been specified earlier in the code). I wonder if it would be better to explicitly set (or reset) the adapter in that cell, to make things more clear?

Expand full comment
author

Thanks!

1. The reference for the code is in the article. Following your comment, I also added it in the notebook.

2. This is the base model. The comment is misleading here.

3. I added in the cell the initialisation of the expected variables: base model, adapter, and compute dtype

Expand full comment

I'm getting nowhere fast with LoRA-tuning Llama3-8B. I'm going to give it a rest for now—until I see a notebook of yours. In the name of science, I might also try again with a product like Axolotl, Unsloth, or AutoTrain. Hmm, I had these same problems trying to train Phi. Maybe there's something fundamentally wrong with me and my roll-your-own "LoraTuner" class? Fine-tuning Mistral sure does work flawlessly for me, though—for all five of my PEFT adapters.

Expand full comment

I set padding_side='right' because SFTTrainer complains if you don't. Not sure why I set add_bos_token=True. Why would you want an eos token without a bos token, though? Heh, I guess if you're conflating eos and pad tokens, then it could very well be a moot point?

Expand full comment

Oh, duh. If you're padding on the left side with eos tokens, then the actual eos_token on the right wouldn't be conflated with padding tokens....unless you're working with a right-to-left language like Arabic.

Expand full comment
author

In theory, yes. But in practice, no. Most frameworks completely ignores the pad token whatever it is. You can try it with HF transformers: if you pad left with eos tokens, it shouldn't generate anything in theory but actually it works fine.

Expand full comment
Apr 21Liked by Benjamin Marie

/home/matt/miniconda3/envs/nlp/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.

Expand full comment
author

I completely ignores this warning. I don't see any problem padding left when using float32 or bfloat16 data types for training.

Expand full comment

I did four epochs with 13,000 examples, which is more than enough when fine-tuning Mistral.

Expand full comment
Apr 21Liked by Benjamin Marie

learning_rate = 2e-4

lr_scheduler_type = 'linear'

target_modules = 'all-linear'

Expand full comment
author

How long are your examples and what is your max_seq_len? If it's too short, the EOS token will be truncated.

Also, I assume that you set "add_eos_token=True," when instantiating the tokenizer.

Expand full comment

I set max_seq_len = 1_024 because my prompts and their responses are both rather short. Works great with Mistral.

self.tokenizer = AutoTokenizer.from_pretrained(

self.model_id,

trust_remote_code = True,

add_bos_token = True,

add_eos_token = True,

padding_side = 'right'

)

Expand full comment
author

I only see two significant differences with my current config. I set padding side to left (to use FlashAttention) and I don't add the bos token. Is there a particular reason for add_bos_token = True?

Expand full comment