9 Comments
User's avatar
Brian Hostetler's avatar

I got excited, thinking you had released a quantized version of Qwen 3.5 357B that would run on VLLM. But alas, I was mistaken.

Benjamin Marie's avatar

It's too large. It would probably cost somewhere between $2k and $4k for a good int4 quantization.

Brian Hostetler's avatar

I found some 4bit on HF, including Qwen's int4. I have 384G of VRAM, so I'll give it a try this week. Fingers crossed.

Benjamin Marie's avatar

Also double-check whether the linear attention is still 16-bit; if quantized to INT4, it's probably not worth trying.

Brian Hostetler's avatar

Looking at the config.json for Qwen3.5-397B-A17B-GPTQ-Int4

https://huggingface.co/Qwen/Qwen3.5-397B-A17B-GPTQ-Int4/blob/main/config.json, it's not straightforward to me what it's using. I ran the config.json through qwen3-coder-next and it thinks the bit of the weights is 4-bit (GPTQ quantization), but computation "likely" done in higher precision. Clear as mud. Do you have an article on this that I could read to help me parse and understand this more fully?

Benjamin Marie's avatar

The answer is here

dynamic": {

"lm_head": {},

"model.language_model.embed_tokens": {},

"-:.*attn.*": {},

"-:.*shared_expert.*": {},

"-:.*mtp.*": {},

"-:.*visual.*": {}

},

These are all the part left in 16bit. So, basically, only the routed expert are quantized.

Brian Hostetler's avatar

I tried running this model in 384G of VRAM (8 qty A6000) and I'm getting 0.3 tokens/s prompting and 3.7 tokens/s for generation...

Nick Jenkins's avatar

Quick follow-up question...

If Qwen3.5-397B quantizes beautifully with UD-IQ2_M, did you try Qwen3.5-122B-A10B or any of the mid-sized Qwens (35B-A3B or 27B)? Question being...does the quality largely remain intact, because if so, that could be an amazing opportunity. 2bit quantization could offset bandwidth limitations on DGX Spark, or memory size constraints on 5090)

Benjamin Marie's avatar

I didn't evaluate the 122b. The 27b and 35b do well at Q4. q3 can be bad. Q2 is really bad.