10 Comments
User's avatar
Matt's avatar

Have you ever tried quantizing a Bert or Roberta classifier model with GGUF? I'm curious how the performance compares on CPU to that of an unquantized classifier on GPU.

Benjamin Marie's avatar

No, I have never tried. I guess these classifiers would be very fast on CPU but I would be worried about the accuracy of the models after quantization since BERT and RoBERTa are already small.

Matt's avatar

Good point. Do you know of a way to make a Roberta classifier run quickly on CPU without quantization?

Benjamin Marie's avatar

GGUF it in FP16 (so, no quantization, i.e., just run the convert hf to guff) and run it with llama.cpp. If you have a recent CPU, it should be quite fast.

Matt's avatar

ARGH! GGUF doesn't support Bert or Roberta.

NotImplementedError: Architecture 'RobertaForSequenceClassification' not supported!

Matt's avatar

Okay, thanks. I'll give that a shot...and then report back how GGUF on Threadripper 3970X compares to non-GGUF on an A6000 GPU.

Matt's avatar

Using TEI with gRPC, I classified 500 strings on CPU and GPU.

CPU: 10.313s (Threadripper 3970X)

GPU: 0.233s (RTX A6000)

I think I'll stick to the GPU. :)

Benjamin Marie's avatar

Yes, I think the CPU is only a good option if the GPU can't load the full model.