The Kaitchup – AI on a Budget | Benjamin Marie | Substack

Stop Paying for FP16 KV Cache

Near-zero quality drop, big wins for long sequences and high concurrency

READ THE LATEST

The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Weekly tutorials and news on adapting large language models (LLMs) to your tasks and hardware using the most recent techniques and models. The Kaitchup proposes a collection of 170+ AI notebooks regularly updated.

Recent posts

This Week: GLM 4.7 Flash's Huge KV Cache and LFM2.5 Thinking

The Weekly Kaitchup #127

Jan 23 • Benjamin Marie

Serving ExLlamaV3 Models with tabbyAPI: Accuracy, Speed, and Recommendations

With comparisons against AutoRound and GGUF models served with vLLM

Jan 19 • Benjamin Marie

MMLU-Pro Has an Answer Leak (and It’s Just Whitespace)

The Weekly Kaitchup #126

Jan 16 • Benjamin Marie

4-bit GLM4.7 with a Single B300: High Speed and 100% Accuracy on AIME24

Just give it enough tokens to think

Jan 12 • Benjamin Marie

LFM2.5 and Falcon H1R-7B: New Hybrid Models with Strong Benchmark Scores

The Weekly Kaitchup #125

Jan 9 • Benjamin Marie

Nemotron 3 Nano: A Very Fast Model That Doesn't Think Too Much

At last: “thinking” that’s truly optional.

Jan 5 • Benjamin Marie

Top posts

vLLM vs Ollama: Which LLM Inference Tool Should You Use?

Jul 7, 2025 • Benjamin Marie

RAG with Qwen3 Embedding and Qwen3 Reranker

Jun 19, 2025 • Benjamin Marie

RTX Pro 6000 vs H100 vs A100: Best Single-GPU Choice for Fast, Low-Cost LLM Fine-Tuning

Jun 16, 2025 • Benjamin Marie

LoRA Adapters: When a Naive Merge Leads to Poor Performance

Sep 7, 2023 • Benjamin Marie

Multimodal RAG with ColPali and Qwen2-VL on Your Computer

Sep 16, 2024 • Benjamin Marie

Recommendations

Trelis Research

Trelis Research

Trelis Research

The Salt - Curated AI

The Salt - Curated AI

Benjamin Marie

💎DiamantAI

Nir Diamant

Generative AI Publication

Generative AI Publication

Jim Clyde Monge

Artificial Ignorance

Artificial Ignorance

Charlie Guo

Tutorials

Serving ExLlamaV3 Models with tabbyAPI: Accuracy, Speed, and Recommendations

With comparisons against AutoRound and GGUF models served with vLLM

Jan 19 • Benjamin Marie

4-bit GLM4.7 with a Single B300: High Speed and 100% Accuracy on AIME24

Just give it enough tokens to think

Jan 12 • Benjamin Marie

Eagle 3 Speculators: When To Use Them?

Easier and faster speculative decoding, if you are in the right settings

Dec 9, 2025 • Benjamin Marie

Accelerate Models with Quantization: Recipes for NVFP4, GPTQ, AWQ, SmoothQuant, AutoRound, and FP8

Focus on 4-bit and 8-bit quantization + vLLM benchmarking with accuracy and inference throughput

Nov 24, 2025 • Benjamin Marie

Unsloth's Quantization-Aware Training (QAT) vs Post-Training Quantization (PTQ) for Small Models

Can a tiny LLM stay accurate under quantization thanks to QAT?

Nov 10, 2025 • Benjamin Marie

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts