The Kaitchup – AI on a Budget
Subscribe
Sign in
Home
Notes
AI Notebooks
The Kaitchup's Book
Weekly Kaitchup
Tutorials
The Kaitchup Index
Archive
About
Latest
Top
Discussions
Summary of Qwen3.5 GGUF Evaluations + My Evaluation Method
Including evaluations with KV cache quantized
Mar 10
•
Benjamin Marie
12
4
More Qwen3.5 GGUF Evals and Speculative Speculative Decoding (SSD)
The Weekly Kaitchup #133
Mar 6
•
Benjamin Marie
8
Qwen3.5 9B, 4B, 2B & 0.8B: GPU Requirements, VRAM Usage & KV Cache Breakdown (262K Context)
How much memory do Qwen3.5 small models really need?
Mar 3
•
Benjamin Marie
11
4
1
Disable “Thinking,” Still Get Thousands of Tokens: What Instruct LLMs Are Doing
Token caps, not labels, explain many benchmark gaps.
Mar 2
•
Benjamin Marie
8
1
February 2026
Lessons from GGUF Evaluations: Ternary Qwen3.5, Bricked Minimax
The Weekly Kaitchup #132
Feb 27
•
Benjamin Marie
11
8
2
Qwen3.5 Medium Models: Dense vs. MoE
75% linear attention layers, tiny KV cache, strong results.
Feb 25
•
Benjamin Marie
6
How to Deploy Your LLM in the Cloud
The simple recipe to choose your GPU and anticipate costs
Feb 23
•
Benjamin Marie
8
Taalas HC1: Absurdly Fast, Per-User Inference at 17,000 tokens/second
The Weekly Kaitchup #131
Feb 20
•
Benjamin Marie
24
1
2
Qwen3.5: Scaling Hybrid Attention to 397B Parameters
With Qwen3.5's memory requirements and GGUF recommendations
Feb 19
•
Benjamin Marie
7
2
GLM-5 Memory Requirements Explained: MLA + DeepSeek Sparse Attention (DSA)
How GLM-5 fits 200K context without terabytes of KV cache, and what GPUs you need.
Feb 16
•
Benjamin Marie
4
2
Nanbeige4.1: Only 3B Parameters, but as Good as Qwen3 32B?
The Weekly Kaitchup #130
Feb 14
•
Benjamin Marie
6
Run GLM-4.7 Flash on One GPU: VRAM Math, Quantization Options, and Benchmark Results
And how good is it with "thinking" disabled
Feb 10
•
Benjamin Marie
10
1
1
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts