Looking at the config.json for Qwen3.5-397B-A17B-GPTQ-Int4
https://huggingface.co/Qwen/Qwen3.5-397B-A17B-GPTQ-Int4/blob/main/config.json, it's not straightforward to me what it's using. I ran the config.json through qwen3-coder-next and it thinks the bit of the weights is 4-bit (GPTQ quantization), but computation "likely" done in higher precision. Clear as mud. Do you have an article on this that I could read to help me parse and understand this more fully?
If Qwen3.5-397B quantizes beautifully with UD-IQ2_M, did you try Qwen3.5-122B-A10B or any of the mid-sized Qwens (35B-A3B or 27B)? Question being...does the quality largely remain intact, because if so, that could be an amazing opportunity. 2bit quantization could offset bandwidth limitations on DGX Spark, or memory size constraints on 5090)
I got excited, thinking you had released a quantized version of Qwen 3.5 357B that would run on VLLM. But alas, I was mistaken.
It's too large. It would probably cost somewhere between $2k and $4k for a good int4 quantization.
I found some 4bit on HF, including Qwen's int4. I have 384G of VRAM, so I'll give it a try this week. Fingers crossed.
Also double-check whether the linear attention is still 16-bit; if quantized to INT4, it's probably not worth trying.
Looking at the config.json for Qwen3.5-397B-A17B-GPTQ-Int4
https://huggingface.co/Qwen/Qwen3.5-397B-A17B-GPTQ-Int4/blob/main/config.json, it's not straightforward to me what it's using. I ran the config.json through qwen3-coder-next and it thinks the bit of the weights is 4-bit (GPTQ quantization), but computation "likely" done in higher precision. Clear as mud. Do you have an article on this that I could read to help me parse and understand this more fully?
The answer is here
dynamic": {
"lm_head": {},
"model.language_model.embed_tokens": {},
"-:.*attn.*": {},
"-:.*shared_expert.*": {},
"-:.*mtp.*": {},
"-:.*visual.*": {}
},
These are all the part left in 16bit. So, basically, only the routed expert are quantized.
I tried running this model in 384G of VRAM (8 qty A6000) and I'm getting 0.3 tokens/s prompting and 3.7 tokens/s for generation...
Quick follow-up question...
If Qwen3.5-397B quantizes beautifully with UD-IQ2_M, did you try Qwen3.5-122B-A10B or any of the mid-sized Qwens (35B-A3B or 27B)? Question being...does the quality largely remain intact, because if so, that could be an amazing opportunity. 2bit quantization could offset bandwidth limitations on DGX Spark, or memory size constraints on 5090)
I didn't evaluate the 122b. The 27b and 35b do well at Q4. q3 can be bad. Q2 is really bad.