More Qwen3.5 GGUF Evals and Speculative Speculative Decoding (SSD)
The Weekly Kaitchup #133
Hi everyone,
In this edition of The Weekly Kaitchup:
Qwen3.5: More GGUF evaluations and ongoing experiments
Speculative Speculative Decoding (SSD): run ahead, verify in parallel
Qwen3.5: More GGUF Evaluations and Ongoing Experiments
This week, just before what looks like some self-destruction, the Qwen team released a final batch of small Qwen3.5 models. I discussed them here:
I’ve also been regularly publishing evaluations of the GGUF versions of Qwen3.5 on my X account. Here are the results for Qwen3.5 9B:
The UD-Q3_K_L performs remarkably well, though still below the Q4 variants while saving only about 1 GB.
I’d recommend avoiding the Q2 versions. Across all the Qwen3.5 GGUF evaluations I’ve run so far, Q2 was only really usable for the largest model (397B). Nothing surprising here, quantizing a model mainly to 2-bit is very hard, and much more damaging for smaller models.
I haven’t evaluated the Qwen3.5 122B version yet.
By the way, if you’re unfamiliar with GGUF and want to run these models locally, I was invited to write a full tutorial for Michael Spencer’s newsletter explaining how to do it. You can read the article here:
Next week, I’ll publish a complete summary of all my Qwen3.5 GGUF evaluations, including results with KV cache quantized to Q4 and Q8.
In another article, I’ll also release a detailed analysis of Qwen3.5 quantization, focusing on quantization recipes and evaluations. Right now I’m running evaluations on many quantized Qwen3.5 variants (9B, 27B, 35B-A3B):
FP8
NVFP4
INT4
From multiple providers:
Official Qwen releases (FP8 + INT4)
Intel (AutoRound INT4)
Cyankiwi (AWQ INT4)
My own quantizations (NVFP4 + INT4 using AutoRound and LLM Compressor)
I’m testing both thinking and non-thinking modes.
GPUs
Provided by Verda (compute sponsorship):
H200 ×1: Evaluation of Qwen3.5 27B and Qwen3.5 35B INT4 and FP8 models
RTX Pro 6000 ×1: Evaluation of Qwen3.5 9B models
B200 ×1: Evaluation of Qwen3.5 27B and Qwen3.5 35B NVFP4 models (I was very lucky to get this one. Most providers don’t have any available right now.)
I could run more experiments in parallel to speed things up, but at some point it becomes difficult to keep track of everything. I’ve already run into a few issues:
On H200 and H100, vLLM sometimes stops processing requests. The process stays alive, but nothing happens, and I have to manually restart the engine. That’s a vLLM issue.
Qwen3.5 35B (non-thinking) is broken for me on vLLM + H200: it gets stuck in a “thinking” loop for 90% of my benchmark prompts. Oddly, I can’t reproduce the issue on the RTX Pro 6000, and other Qwen3.5 models run fine. I also couldn’t find anyone else reporting the same problem, so I’m pausing debugging for now.
So next week again will be mostly focused on Qwen3.5.
And it probably won’t end there on my side. I also want to experiment with fine-tuning some of their base models with Unsloth.
Meanwhile, I’m still waiting for:
DeepSeek V4, which seems to be “coming soon” every week but never actually releases
NVIDIA’s large Nemotron 3 models (~100B and ~500B parameters; will Mamba beat Gated DeltaNet?)
Also, just when I was wondering what happened to the Microsoft Phi series, they released a new Phi-4 reasoning model with vision.
I haven’t tried it yet. The benchmark scores look good, but since they didn’t publish results on language tasks, not even in the report, I suspect it may not match recent models of similar size. The model card and config.json also mention a maximum context length of 16,384 tokens, which seems oddly small for a reasoning model. So it’s either very bad, or extremely efficient at reasoning.
Speculative Speculative Decoding (SSD): Draft and Verify in Parallel
Researchers from Stanford University, Princeton University, and Together AI have proposed a new way to speed up LLM text generation based on speculative decoding. The method, called speculative speculative decoding (SSD), overlaps the work of a small “draft” model with the work of a larger “target” model, and reports up to 2x faster decoding than optimized speculative decoding and up to ~5x faster than standard autoregressive decoding in their open-source implementation.
The Bottleneck of Standard Speculative Decoding
LLMs generate text one token at a time. Even though GPUs can do lots of work in parallel, the decoding loop is fundamentally sequential: you can’t generate token t+1 until you know token t.
Speculative decoding (SD) improved this by using:
a fast draft model to guess a short run of next tokens, then
a slow target model to verify those guesses in one parallel forward pass, accepting as many as match what the target would have produced, then continuing.
But SD still has a pipeline bubble: the draft step and the verify step alternate. The target can’t verify until the draft has produced guesses. The draft can’t start the next round until verification finishes (because it needs to know how many tokens were accepted, and what “bonus” token came next).
Can we eliminate that back-and-forth dependency?
What SSD Changes
SSD treats the verifier’s result like a “branch” you can speculate on, similar to speculative execution in CPUs. While the target model is busy verifying the previous guess, the draft model doesn’t wait. Instead it:
predicts the most likely verification outcomes, and
pre-computes draft continuations for each likely outcome, storing them in a cache.
When the target finishes verifying, it returns the actual outcome. If that outcome is in the cache, the next draft tokens are available immediately, so the next round can begin without draft latency. If the prediction is wrong, SSD falls back to a standard (synchronous) speculation path, keeping correctness intact (“lossless”).
In standard SD (baseline):
Draft model generates K candidate tokens.
Target model verifies them (fast in parallel) and decides how many to accept, then chooses the next token.
Repeat.
In SSD:
Round i: target verifies the draft tokens (same as SD).
At the same time, the draft model starts working on Round i+1, but it doesn’t know the exact prefix yet because it doesn’t know the verification outcome from Round i.
So the draft model creates a set of plausible prefixes (different “what if the verifier accepted j tokens and then picked token X next?” outcomes) and precomputes a continuation for each one.
When the verifier finishes Round i, it sends back the actual outcome and the draft process does a cache lookup:
cache hit: instantly return precomputed tokens/logits for the next verification
cache miss: run a fallback drafter and proceed (still exact, just less speedup that round)
A key point: the target model does not verify a tree (which would increase target compute). The “branching” happens on the draft side only. The verifier still processes the usual single sequence verification step.
Results
Qwen-3 32B target, Qwen-3 0.6B draft:
Average: normal AR 88.8 tok/s, SD 136.8 tok/s, SSD 203.8 tok/s
1.49x vs SD and 2.29x vs AR.
The authors also say that SSD improves the throughput–latency Pareto frontier (not just “best case latency”), especially at smaller batch sizes.
The Hard Part: Implementation
Very similar to the research in KV cache quantization, you can find countless improved speculative decoding algorithms that are, in theory, faster than the standard one. They are usually quickly forgotten because they are too hard to implement.
SSD authors probably spend most of their time optimizing their implementation so that it can be faster than a very optimized inference engine like SGLang and vLLM.
They implemented it in a custom PyTorch inference engine, using common serving optimizations (PagedAttention, continuous batching, tensor parallelism, BF16 mixed precision, torch compile, CUDA Graphs), which is very impressive for just a “research work”.
Hardware layout:
Target model split across 4 GPUs
Draft model on a separate GPU (so drafting can run truly in parallel with verification)
Communication:
Draft and target communicate once per speculation round via NCCL, sending compact metadata (like accepted-prefix length and which token followed), and returning cache-hit signals plus the next speculative tokens/logits. They explicitly note no KV cache is transferred between draft and target devices.
Draft-side branching:
To precompute multiple “possible next states,” the draft decodes many branches in parallel using a custom sparse attention mask. They use FlashAttention when possible and fall back to FlashInfer for cases that require custom masks.
Overheads/tradeoffs:
The method spends more draft compute (often “wasted” on branches that aren’t used) and maintains a speculation cache that can be hundreds of MB in practice, but this is on the small draft GPU and is designed to buy latency reductions.
Now, we want it in vLLM and SGLang.
Also, their implementation only supports Llama 3.1 and Qwen3 models for now. It’s probably easy to add more models with a similar architecture. However, for models with non-standard modules (MLA, Mamba, GDN, etc), it’s probably not worth it since they didn’t implement related optimizations. SSD could be slower than normal SD with vLLM/SGLang for these models. I’ll ask Codex to add Qwen3.5 support. We’ll see how it goes.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, we review:
⭐Reinforcement-aware Knowledge Distillation for LLM Reasoning
Recursive Think-Answer Process for LLMs and VLMs
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!










Any updates on your Codex attempt for Qwen3.5 support?