Table of Contents
This page organizes the articles of The Kaitchup into categories. It supplements Substack’s search engine (top-left corner of the website) which is also helpful if you are searching for an article on a specific AI technique or model.
Fine-tuning
LoRA and QLoRA
Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU
QA-LoRA: Quantization-Aware Fine-tuning for Large Language Models
LQ-LoRA: Jointly Fine-tune and Quantize Large Language Models
Training, Loading, and Merging QDoRA, QLoRA, and LoftQ Adapters
Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3
Fine-tuning Base LLMs vs. Fine-tuning Their Instruct Version
Preference Optimization: DPO, IPO, ORPO, …
Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)
Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)
ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step
Reinforcement Learning with Human Feedback (RLHF)
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model
Optimization
Quantization
AutoRound
Marlin
AQLM
Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization
Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU
GPTQ
From 16-bit to 2-bit: Finding the Best Trade-off Between Memory-Efficiency and Accuracy
Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer
GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2
Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL
AWQ
Fast and Small Llama 2 with Activation-Aware Quantization (AWQ)
Simple, Fast, and Memory-Efficient Inference for Mistral 7B with Activation-Aware Quantization (AWQ)
bitsandbytes NF4
ExLlama
SqueezeLLM
HQQ and HQQ+
Efficient Loading and Inference
KV Cache Quantization for Memory-Efficient Inference with LLMs
Neural Speed: Fast Inference on CPU for 4-bit Large Language Models
GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU
vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer
Speculative Decoding for Faster Inference with Mixtral-8x7B and Gemma
Device Map: Avoid Out-of-Memory Errors When Running Large Language Models
Safe, Fast, and Memory Efficient Loading of LLMs with Safetensors
Serve Large Language Models from Your Computer with Text Generation Inference
RAG
RAG for Mistral 7B Instruct with LlamaIndex and Transformers
Train Better Llama 3 Embeddings with Simple Contrastive Learning
Pre-training
Merge and Mixture of Expert
The Mayonnaise: Rank First on the Open LLM Leaderboard with TIES-Merging
Run Mixtral-8x7B on Consumer Hardware with Expert Offloading
Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts by Mistral AI
Benchmarking
GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?
Optimum-Benchmark: How Fast and Memory-Efficient Is Your LLM?
Estimate the Memory Consumption of LLMs for Inference and Fine-tuning
LLM Focus
Llama 3
Avoid Quantizing Llama 3 8B with GPTQ and Use BitsandBytes Instead
Llama 3.1: Fine-tuning on Consumer Hardware — LoRA vs. QLoRA
Llama 2
OpenELM
Falcon
Mistral 7B
Microsoft phi-1.5, phi-2, and phi-3
Google Gemma
Google's Gemma: Fine-tuning, Quantization, and Inference on Your Computer
Fine-tune a Better Google Gemma with Unsloth and Distilled DPO
Qwen
Fine-tuning and Quantization of Qwen1.5 LLMs on Your Computer
Qwen2 vs. Llama 3: QLoRA Learning Curves and Quantization Performance
Yi
SmolLM
VLMs
Florence-2: Run Multitask Vision-language Models on Your Computer
Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer
Machine Translation
Fine-tuning
GPT
Evaluation
Traditional Versus Neural Metrics for Machine Translation Evaluation
Scientific Credibility in Machine Translation Research: Pitfalls and Promising Trends