Sitemap - 2024 - The Kaitchup – AI on a Budget
Deploy Your Fine-Tuned LoRA Adapters with Ollama
Fast and Memory-Efficient Text-to-SQL with Qwen2.5 Coder 32B Instruct on Your GPU
Phi-4: What's New and How to Fine-Tune It on Your Computer (+ quantized version)
Schedule-Free Optimizer: Does It Work for LLMs?
Fine-Tuning Llama 3.3 70B with a Single GPU
Quantize and Run Llama 3.3 70B Instruct on Your GPU
Multi-GPU DPO Training with FSDP: Full Training, LoRA, and QLoRA
LLM Alignment: Searching for Optimal ORPO Hyperparameters
Fine-Tune Llama 3.2 Vision, Pixtral, and Qwen2-VL on Your Computer with Unsloth
The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs
Find the Best LLM for Your GPU
DPO Full Training vs. LoRA: How Good is LoRA for DPO Training?
Fast Inference with GGUF LoRA Adapters on Your CPU
Torch Compile: 2x Faster Llama 3.2 with Low Effort
LLM as a Judge: Evaluate Your LLMs with Another LLM
The Kaitchup Pro's Access Token for AI Toolboxes
Llama 3.2 Embeddings: Training and Evaluation with LLM2Vec
The Impact of the Calibration Dataset for AutoRound and AWQ Quantization
bitnet.cpp: Efficient Inference with 1-Bit LLMs on your CPU
Generate Videos on Your Computer with Pyramid Flow
Fixing Faulty Gradient Accumulation: Understanding the Issue and Its Resolution
Train and Serve an AI Chatbot Based on Llama 3.2
Fast Speculative Decoding with Llama 3.2 and vLLM
Generate Synthetic Data from Personas to Train AI Chatbots
Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers
The Unreasonable Impact of Gradient Checkpointing for Fine-tuning LLMs
Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs
[Early Access] LLMs on a Budget, Chapter 1: Parameter-Efficient Fine-Tuning
How to Set Up a PEFT LoraConfig
transformers.js: Run Phi-3.5 & Llama 3.2 in Your Browser
Qwen2.5 QLoRA, LoRA, and Full Fine-tuning on Your Computer
Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM
Multimodal RAG with ColPali and Qwen2-VL on Your Computer
Introducing Minivoc: Faster and Memory-Efficient LLMs Through Vocabulary Reduction [WIP]
GuideLLM: Is Your Server Ready for LLM Deployment?
GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU
Falcon Mamba, Jamba, RWKV... Can You Use Them on Your Computer?
The Kaitchup's Book: LLMs on a Budget
Run Qwen2-VL on Your Computer with Text, Images, and Video, Step by Step
Run Llama 3.1 70B Instruct on Your GPU with ExLlamaV2 (2.2, 2.5, 3.0, and 4.0-bit)
Mistral-NeMo: 4.1x Smaller with Quantized Minitron
Fine-tuning Phi-3.5 MoE and Mini on Your Computer
QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU
Fine-tuning Base LLMs vs. Fine-tuning Their Instruct Version
The Best Quantization Methods to Run Llama 3.1 on Your GPU
SmolLM: Full Fine-tuning and Aligning Tiny LLMs on Your Computer
Multi-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRA
Serve Multiple LoRA Adapters with vLLM
Llama 3.1: Fine-tuning on Consumer Hardware — LoRA vs. QLoRA
Llama 3 405B: Can You Fine-tune It?
Function Calling: Fine-tuning Llama 3 and Qwen2 on xLAM
GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?
Fine-tune Gemma 2 on Your Computer with LoRA and QLoRA
Train Better Llama 3 Embeddings with Simple Contrastive Learning
Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer
rsQLoRA: Fine-tune Llama 3 with Higher Ranks and QLoRA
Florence-2: Run Multitask Vision-language Models on Your Computer
Intel AutoRound: Accurate Low-bit Quantization for LLMs
Simple QLoRA Fine-tuning with Axolotl
Continue Pre-training Llama 3 and Other LLMs on Your Computer
KV Cache Quantization for Memory-Efficient Inference with LLMs
Qwen2 vs. Llama 3: QLoRA Learning Curves and Quantization Performance
My LLM Can't Stop Generating, How to Fix It?
Fine-tune Tiny Adapters for Llama 3 with VeRA
Fine-tune Phi-3 Medium on Your Computer
1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+
Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3
From Llama 3 70B to 120B: How to Self-Augment an LLM?
Fine-tuning LLMs with a Chat Template
Avoid Quantizing Llama 3 8B with GPTQ and Use BitsandBytes Instead
Fine-tune Llama 3 70B on Your GPU with AQLM 2-bit
Fine-tune Tiny Chat Models with Apple OpenELM and ORPO
Run Llama 3 70B on Your GPU with ExLlamaV2
Phi-3 mini: Fine-tuning and Quantization on Your Computer
Turn Llama 3 into an Embedding Model with LLM2Vec
Estimate the Memory Consumption of LLMs for Inference and Fine-tuning
Fine-tune Llama 3 on Your Computer
Training, Loading, and Merging QDoRA, QLoRA, and LoftQ Adapters
Neural Speed: Fast Inference on CPU for 4-bit Large Language Models
LoftQ: Better Initialization for a Quantization-Aware LoRA
ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step
GaLore: Full Fine-tuning on Your GPU
A Guide on Hyperparameters and Training Arguments for Fine-tuning LLMs
Marlin: Nearly Ideal Inference Speed for 4-bit Models with vLLM (1k+ tokens/sec)
RAG for Mistral 7B Instruct with LlamaIndex and Transformers
Yi: Fine-tune and Run One of the Best Bilingual LLMs on Your Computer
Fine-tune a Better Google Gemma with Unsloth and Distilled DPO
Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU
DoRA vs. LoRA: Better and Faster than LoRA?
Speculative Decoding for Faster Inference with Mixtral-8x7B and Gemma
GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU
Google's Gemma: Fine-tuning, Quantization, and Inference on Your Computer
Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization
Fine-tuning and Quantization of Qwen1.5 LLMs on Your Computer
vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer
SqueezeLLM: Better 3-bit and 4-bit Quantization for Large Language Models
TinyLlama: Pre-training a Small Llama 2 from Scratch
From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy
The Mayonnaise: Rank First on the Open LLM Leaderboard with TIES-Merging
Fine-tune a Mixture of Experts on Your Computer
Maixtchup: Make Your Own Mixture of Experts with Mergekit
Optimum-Benchmark: How Fast and Memory-Efficient Is Your LLM?
Run Mixtral-8x7B on Consumer Hardware with Expert Offloading