The Kaitchup – AI on a Budget
Subscribe
Sign in
Home
Notes
AI Notebooks
The Kaitchup's Book
Weekly Kaitchup
Tutorials
Archive
About
Latest
Top
Discussions
MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
The Weekly Kaitchup #145
Jun 5
•
Benjamin Marie
6
DFlash vs MTP: Qwen3.6 Speculative Decoding Benchmarks with vLLM and llama.cpp
Up to 4x faster inference -- Benchmarking the speed on various task on coding, chat, and math tasks, with optimal hyperparameters
Jun 2
•
Benjamin Marie
8
1
May 2026
Qwen3.5 9B MoQ: Inside a Strong 3.6-bit GGUF
The Weekly Kaitchup #144
May 29
•
Benjamin Marie
11
Reasoning Budgets vs. Structured CoT: Controlling LLM Thinking Tokens
Evaluations of BNF grammars and reasoning budgets with Qwen3.6 27B
May 25
•
Benjamin Marie
10
6
2
Gated DeltaNet-2: Better Memory Editing for Linear Attention
The Weekly Kaitchup #143
May 22
•
Benjamin Marie
4
Train and Run DFlash Speculative Decoding
A simple method to make your local model much faster
May 18
•
Benjamin Marie
11
1
SlimQwen Compression, Elastic Models, and Aurora Optimization
The Weekly Kaitchup #142
May 15
•
Benjamin Marie
9
1
Qwen3.6 27B Quantization: FP8 vs INT4 vs NVFP4
Testing accuracy, latency, memory usage, and MTP efficiency after quantization.
May 12
•
Benjamin Marie
15
2
2
MTP Layers for Gemma 4 and My Projects in Progress
The Weekly Kaitchup #141
May 8
•
Benjamin Marie
10
7
1
Qwen3.6 27B vs Qwen3.5 27B vs Gemma 4 31B: Accuracy, Latency, Memory, and Token Efficiency Tested
Qwen3.6 improves on Qwen3.5, but Gemma 4 remains surprisingly competitive.
May 5
•
Benjamin Marie
23
2
3
Nemotron 3 Omni Explained: Architecture, Training, and How to Run It
The Weekly Kaitchup #140
May 1
•
Benjamin Marie
7
2
April 2026
How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting
Is thinking disabled + multiple retries better and still more efficient than thinking enabled?
Apr 27
•
Benjamin Marie
16
3
1
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts