7 Comments
User's avatar
Giampiero Recco's avatar

Hi Benjamin, thanks for the precious index and content! is there a note/notebook describing how to serve each model in the index. More specifically, for example, I'm having troubles serving google/gemma-3-27b-it-qat-q4_0-gguf on vllm, and it would be convenient to know what configuration/version you used or if there are any "preprocessing" you perform. Thank you

Expand full comment
Benjamin Marie's avatar

Do you have issues with GGUF model + vLLM specifically, or do other GGUF models work with your config? I didn't perform any specific preprocessing. All the models are evaluated with the same vLLM code.

Note that I don't "serve" the model. I directly run them locally with Python code. I load the model like this:

llm = LLM(model=model_id,tokenizer=tokenizer, load_format="gguf")

Expand full comment
Giampiero Recco's avatar

I tried other GGUF models but never successfully so I always reverted on different versions. I tried multiple versions of vllm, including from source, but no success (on a dual 4090 server). Tried also with different tokenizer and options but all returned (different) errors.

Some of the options I tried in various combintions:

gguf_path = os.path.expanduser("~/.cache/huggingface/hub/.../gemma-3-27b-it-q4_0.gguf") # Loading from local folder

# Load with vLLM

llm = LLM(

model=gguf_path,

tokenizer="google/gemma-3-27b-it",

#tokenizer="unsloth/gemma-3-27b-it",

#tokenizer="unsloth/gemma-3-27b-it-qat-unsloth-bnb-4bit",

tensor_parallel_size=2,

load_format="gguf",

dtype="bfloat16",

trust_remote_code=True,

#hf_config_path="google/gemma-3-27b-it",

#hf_config_path="unsloth/gemma-3-27b-it",

#hf_config_path="unsloth/gemma-3-27b-it-qat-unsloth-bnb-4bit",

#hf_overrides={"architectures": ["Gemma3ForCausalLM"]},

#hf_overrides={"architectures": ["Gemma3ForConditionalGeneration"]},

)

Thank you anyhow,

g.

Expand full comment
Benjamin Marie's avatar

It might because of your multi GPUs setting. This is the only difference I see compared to my config. You can also try disabling the v1 engine.

Expand full comment
Giampiero Recco's avatar

Interesting, thank you very much for confirming the config is not too far off. this helps. I'll keep investigating.

Expand full comment
Max's avatar

Awesome; this is Perfect! More than a year ago we started with your guidance for local hardware configs. Your work has become foundational to our knowledgebase deploying LLMs. Just browsed your index and in total agreement because the models we have selected for app dev as best performing for our requirements are all on your index!! Definitely agree that google/gemma-3-27b-it-qat-q4_0-gguf is a good one. And chose the unsloth quantized versions as credible for fine-tuning, so seeing on your list affirms our decision! Credibility behind the quantizing is extremely important, including protecting IP locally by not injecting security concerns i.e. making unexpected outbound or telemetry connections. a “Quantization Fidelity” metric will be valuable.

Expand full comment
Benjamin Marie's avatar

Thank you! I was actually surprised by how good is the qat version of gemma 3. I was very suspicious at first because Google didn't publish any official benchmark results when they released it. But I confirmed that this is a very good model and that qat techniques are still useful.

Expand full comment