Discussion about this post

User's avatar
Sachin's avatar

I am looking to run a fine-tuned small language model on an edge device. The edge device is limited, so I am obviously looking to quantize.

To be efficient, I prefer to keep the quantized base model on the hardware and if I need to push updates or adjust the finetune, solely push LoRA Adapters and allow the ‘merge’ or ‘apply’ process take place on the edge.

This eliminates me having to push an entire base model + LoRA Adapter and simplifies the send to only the LoRA Adapter.

I know this is possible, but I want to limit the degradation of performance since applying a naïve adapter to a quantized model has repercussions as you’ve noted.

Expand full comment
Trelis Research's avatar

Could you expand on this comment: "Note: If you plan to release your model fine-tuned with LoftQ, you will need to release the model along with your adapter. The model itself is also modified."

In what way is the base model changed? Could you link to the source for this?

Expand full comment
5 more comments...

No posts