The Kaitchup – AI on a Budget
Subscribe
Sign in
Home
Notes
AI Notebooks
The Kaitchup's Book
Weekly Kaitchup
Tutorials
Archive
About
Tutorials
Latest
Top
Discussions
Train and Run DFlash Speculative Decoding
A simple method to make your local model much faster
May 18
•
Benjamin Marie
10
1
How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting
Is thinking disabled + multiple retries better and still more efficient than thinking enabled?
Apr 27
•
Benjamin Marie
14
3
1
The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared
A memory-first look at four efficient open LLM architectures.
Mar 18
•
Benjamin Marie
26
2
Qwen3.5 Quantization: Similar Accuracy, More Thinking — Best Models and Recipes
INT4, NVFP4, and FP8 evaluations — Thinking off and on
Mar 12
•
Benjamin Marie
34
7
1
How to Deploy Your LLM in the Cloud
The simple recipe to choose your GPU and anticipate costs
Feb 23
•
Benjamin Marie
8
GLM-5 Memory Requirements Explained: MLA + DeepSeek Sparse Attention (DSA)
How GLM-5 fits 200K context without terabytes of KV cache, and what GPUs you need.
Feb 16
•
Benjamin Marie
5
2
Serving ExLlamaV3 Models with tabbyAPI: Accuracy, Speed, and Recommendations
With comparisons against AutoRound and GGUF models served with vLLM
Jan 19
•
Benjamin Marie
7
4-bit GLM-4.7 (358B) on a Single NVIDIA B300 with vLLM: AWQ vs NVFP4 vs INT4
Just give it enough tokens to think
Jan 12
•
Benjamin Marie
8
3
Eagle 3 Speculators: When To Use Them?
Easier and faster speculative decoding, if you are in the right settings
Dec 9, 2025
•
Benjamin Marie
3
Accelerate Models with Quantization: Recipes for NVFP4, GPTQ, AWQ, SmoothQuant, AutoRound, and FP8
Focus on 4-bit and 8-bit quantization + vLLM benchmarking with accuracy and inference throughput
Nov 24, 2025
•
Benjamin Marie
11
9
1
Unsloth's Quantization-Aware Training (QAT) vs Post-Training Quantization (PTQ) for Small Models
Can a tiny LLM stay accurate under quantization thanks to QAT?
Nov 10, 2025
•
Benjamin Marie
9
2
Advanced LoRA Fine-Tuning: How to Pick LoRA, QLoRA, DoRA, PiSSA, OLoRA, EVA, and LoftQ for LLMs
A practical guide to parameter-efficient LLM adaptation on 16-bit and 4-bit models
Nov 3, 2025
•
Benjamin Marie
14
3
1
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts