Discussion about this post

User's avatar
Matt's avatar

Yep, I had already done that, but the problem remains. In your Medium article about Phi-1.5, you mentioned this:

"The problem here is that phi-1.5 was pre-trained without padding and the implementation of MixFormerSequentialForCausalLM released by Microsoft with the model doesn’t support attention masking during training. In other words, we can’t properly fine-tune the model to learn when to stop generating. Pad tokens are interpreted as normal tokens. You would have to modify MixFormerSequentialForCausalLM to add support for the attention mask."

Is the same true with Phi-2?

https://medium.com/@bnjmn_marie/how-to-fine-tune-quantize-and-run-microsoft-phi-1-5-e14a1e22ec12

Expand full comment
Matt's avatar

I just LoRA-tuned Phi-2, but it refuses to stop generating until `max_new_tokens` is reached. Phi-1.5 suffered from the same problem. Do you know how to correct it?

Expand full comment
15 more comments...

No posts