10 Comments

Have you ever tried quantizing a Bert or Roberta classifier model with GGUF? I'm curious how the performance compares on CPU to that of an unquantized classifier on GPU.

Expand full comment
author

No, I have never tried. I guess these classifiers would be very fast on CPU but I would be worried about the accuracy of the models after quantization since BERT and RoBERTa are already small.

Expand full comment

Good point. Do you know of a way to make a Roberta classifier run quickly on CPU without quantization?

Expand full comment
author

GGUF it in FP16 (so, no quantization, i.e., just run the convert hf to guff) and run it with llama.cpp. If you have a recent CPU, it should be quite fast.

Expand full comment

ARGH! GGUF doesn't support Bert or Roberta.

NotImplementedError: Architecture 'RobertaForSequenceClassification' not supported!

Expand full comment

Okay, thanks. I'll give that a shot...and then report back how GGUF on Threadripper 3970X compares to non-GGUF on an A6000 GPU.

Expand full comment

Using TEI with gRPC, I classified 500 strings on CPU and GPU.

CPU: 10.313s (Threadripper 3970X)

GPU: 0.233s (RTX A6000)

I think I'll stick to the GPU. :)

Expand full comment
author

Yes, I think the CPU is only a good option if the GPU can't load the full model.

Expand full comment