The Kaitchup – AI on a Budget
Subscribe
Sign in
Home
Notes
AI Notebooks
The Kaitchup's Book
Weekly Kaitchup
Tutorials
The Kaitchup Index
Archive
About
Latest
Top
Discussions
Nemotron 3 Super: 1M Tokens, Small KV Cache
The Weekly Kaitchup #134
Mar 13
•
Benjamin Marie
10
Qwen3.5 Quantization: Similar Accuracy, More Thinking — Best Models and Recipes
INT4, NVFP4, and FP8 evaluations — Thinking off and on
Mar 12
•
Benjamin Marie
21
5
Summary of Qwen3.5 GGUF Evaluations + My Evaluation Method
Including evaluations with KV cache quantized
Mar 10
•
Benjamin Marie
15
4
More Qwen3.5 GGUF Evals and Speculative Speculative Decoding (SSD)
The Weekly Kaitchup #133
Mar 6
•
Benjamin Marie
9
Qwen3.5 9B, 4B, 2B & 0.8B: GPU Requirements, VRAM Usage & KV Cache Breakdown (262K Context)
How much memory do Qwen3.5 small models really need?
Mar 3
•
Benjamin Marie
11
4
1
Disable “Thinking,” Still Get Thousands of Tokens: What Instruct LLMs Are Doing
Token caps, not labels, explain many benchmark gaps.
Mar 2
•
Benjamin Marie
8
1
February 2026
Lessons from GGUF Evaluations: Ternary Qwen3.5, Bricked Minimax
The Weekly Kaitchup #132
Feb 27
•
Benjamin Marie
11
9
2
Qwen3.5 Medium Models: Dense vs. MoE
75% linear attention layers, tiny KV cache, strong results.
Feb 25
•
Benjamin Marie
8
How to Deploy Your LLM in the Cloud
The simple recipe to choose your GPU and anticipate costs
Feb 23
•
Benjamin Marie
8
Taalas HC1: Absurdly Fast, Per-User Inference at 17,000 tokens/second
The Weekly Kaitchup #131
Feb 20
•
Benjamin Marie
24
1
2
Qwen3.5: Scaling Hybrid Attention to 397B Parameters
With Qwen3.5's memory requirements and GGUF recommendations
Feb 19
•
Benjamin Marie
7
2
GLM-5 Memory Requirements Explained: MLA + DeepSeek Sparse Attention (DSA)
How GLM-5 fits 200K context without terabytes of KV cache, and what GPUs you need.
Feb 16
•
Benjamin Marie
4
2
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts