Many online leaderboards rank the best large language models (LLMs) on public benchmarks. For example, Chatbot Arena and the OpenLLM Leaderboard are two of the most popular.
These leaderboards are useful but don’t help you figure out which model will run best on your specific GPU. Right now, you mostly have to guess if a model will work well based on its number of parameters. This can make it hard to know if you’ll be able to run or fine-tune the model using your GPU(s).
Things get even trickier when models are quantized (compressed). The number of parameters doesn’t tell the whole story, since a compressed larger model might perform better than a smaller one. For example, a 70B parameter model compressed to 4-bit may still perform better than a 40B model, even if it uses less memory.
This article introduces The Kaitchup’s Leaderboards. This leaderboard doesn’t just show how models perform on benchmarks. It also tells you:
How much memory each model needs during inference
How much memory each model will need for full fine-tuning and LoRA(coming soon!)
These memory requirements are not estimated but computed. With this extra information, you can better decide which model will work best on your hardware.
You can find the leaderboards, regularly updated with new models, on this page:
Using these data points, you will be able to pick the best model for your configuration.
Note: This is still a work in progress. I’ll add more models, especially instruct models, and benchmark scores as I unlock new computational resources.
You can find the command to replicate this LLM benchmarking (accuracy and memory consumption), in this notebook:
How to Use the Leaderboards
There are four leaderboards that group models based on required GPU sizes: 8 GB, 12 GB, 16 GB, and 24 GB. These sizes represent the recommended GPU memory needed to run each model.
I’ve only tested models that can run on consumer-grade hardware, but I might add larger models later.
How to decide if a model can run on a GPU?
This was one of the toughest questions in building the leaderboard. GPU memory usage depends on several factors, including the inference framework and specific settings like the size of the key-value (KV) cache, sequence length, and batch size.
To make things consistent, I use:
Inference framework: Hugging Face Transformers
Batch size: 1
Sequence length: 1,024
Hugging Face Transformers is not necessarily well optimized for reducing memory consumption, so I think it’s a good baseline: If the model runs with Transformers with a batch size of 1 and a sequence length of 1,024, it will run with a larger batch size or longer sequences with a better-optimized framework.
Zero-shot Evaluation of LLMs
To begin, I chose the following public benchmarks (tasks as named in the Evaluation Harness):
leaderboard_gpqa (GPQA)
leaderboard_musr (MuSR)
arc_challenge (ARC_C)
leaderboard_mmlu_pro (MMLU-PRO)
mmlu (MMLU)
These benchmarks are widely used for evaluating LLMs. Arc Challenge and MMLU were included in the first version of the Hugging Face OpenLLM Leaderboard, while newer benchmarks were added in the second version.
I believe these benchmarks give a solid overview of an LLM’s performance and are among the most affordable to run. Some benchmarks in the OpenLLM Leaderboard, like IFEval, are more expensive to compute, so I haven’t included them yet due to the costs involved.
All the models are evaluated with the exact same hyperparameters and hardware.
Why I Used Zero-Shot Evaluation
I used zero-shot evaluation, meaning that I didn’t include any example questions or answers in the prompts. I’m confident that modern LLMs are advanced enough to recognize the need to generate an answer without seeing examples first. Zero-shot evaluation has two key advantages:
Lower Costs: Without examples in the prompt, the context is shorter, which reduces memory use and speeds up the encoding process.
Better Reproducibility: Zero-shot testing removes the need to select and justify specific examples in the prompt, making the results easier to reproduce and interpret.
I’ve written an article for The Salt that compares the costs of zero-shot and few-shot evaluations in more detail:
How the Scores Are Calculated
The leaderboards show an average score for each model. This score is a simple, unweighted average of all benchmark results.
To calculate the scores, I used the Evaluation Harness. For each model included in this release, I ran:
for m in models:
m_name = m.split('/')[0]
!lm_eval --model hf --model_args pretrained={m},dtype=float16 --tasks leaderboard_gpqa,leaderboard_musr,arc_challenge,leaderboard_mmlu_pro,mmlu --device cuda:0 --num_fewshot 0 --batch_size 1 --output_path ./eval/
if "GPTQ" not in m and "gptq" not in m:
!lm_eval --model hf --model_args pretrained={m},load_in_4bit=True --tasks leaderboard_gpqa,leaderboard_musr,arc_challenge,leaderboard_mmlu_pro,mmlu --device cuda:0 --num_fewshot 0 --batch_size 1 --output_path ./eval/eval-bnb4bit/
The condition if "GPTQ" not in m and "gptq" not in m
is used to avoid scoring GPTQ models with bitsandbytes quantization, since the models are already quantized.
Inference Memory Consumption
To benchmark the inference memory consumption of the models, I used Optimum-Benchmark.
Memory consumption during inference depends on several hyperparameters, with the key factors being:
Batch Size
Sequence Length
I chose a sequence length of 1,024 tokens, as it is generally sufficient for most applications. If you plan to use longer sequences, memory usage will increase accordingly.
For batch size, I used a value of 1. Larger batch sizes will further increase memory usage.
The memory values in the tables represent the peak memory allocation during the decode phase.
Here’s the code I used for benchmarking:
from optimum_benchmark import Benchmark, BenchmarkConfig, TorchrunConfig, InferenceConfig, TrainingConfig, PyTorchConfig
from optimum_benchmark.logging_utils import setup_logging
from transformers import set_seed
import gc
set_seed(1234)
def inference_bench(model_id, bs=1, seqlen=1024, quant=False):
model_name = model_id.split('/')[1]
launcher_config = TorchrunConfig(nproc_per_node=1)
input_shapes = {"batch_size": bs, "num_choices": 1, "sequence_length": seqlen}
scenario_config = InferenceConfig(latency=True, memory=True, input_shapes=input_shapes)
if quant:
name = "benchmark_inference_report_quant"+model_name+"_bs"+str(bs)+"_seqlen"+str(seqlen)+".csv"
quantization_scheme = 'bnb'
quantization_config = {
"bnb_4bit_compute_dtype": "float16",
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": True,
"llm_int8_enable_fp32_cpu_offload": False,
"llm_int8_has_fp16_weight": False,
"llm_int8_threshold": 6.0,
"load_in_4bit": True,
"load_in_8bit": False,
}
backend_config = PyTorchConfig(model=model_id, quantization_scheme=quantization_scheme, torch_dtype="float16", quantization_config=quantization_config, device="cuda", device_ids="0", no_weights=False)
else:
name = "benchmark/benchmark_inference_report_"+model_name+"_bs"+str(bs)+"_seqlen"+str(seqlen)+".csv"
backend_config = PyTorchConfig(model=model_id, device="cuda", torch_dtype="float16", device_ids="0", no_weights=False)
benchmark_config = BenchmarkConfig(
name="pytorch_"+model_name,
scenario=scenario_config,
launcher=launcher_config,
backend=backend_config,
)
benchmark_report = Benchmark.launch(benchmark_config)
benchmark_report.log()
benchmark_config.to_dict()
benchmark_report.save_csv(name)
for m in models:
inference_bench(m,1) #Inference, not quantized (float16)
if "GPTQ" not in m and "gptq" not in m:
inference_bench(m,1,1024,True) #Inference, quantized (bnb4bit)
Fine-Tuning Memory Consumption
This is an ongoing task.
For consistency, I’m also doing it with Optimum Benchmark, for LoRA and full fine-tuning.