LLM Leaderboards

Note: This page needs to be updated. I'm currently creating my own private benchmarks. Stay tuned!

The goal of these leaderboards is to help you identify the most suitable LLM for your GPU.

To achieve this, the leaderboards are categorized into four groups based on the GPU memory requirements: 8 GB, 12 GB, 16 GB, and 24 GB. Each category reflects the recommended GPU memory needed for optimal performance with the respective models.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

I’ve only tested models that can run on consumer-grade hardware, but I might add larger models later.

Note: All tables are interactive, allowing you to rank models based on specific tasks and navigate through multiple pages. Throughput and latency metrics for all the tables were calculated using the same RTX 3090 GPU for consistent comparisons.

How to Use the Leaderboards

There are four leaderboards that group models based on required GPU sizes: 8 GB, 12 GB, 16 GB, and 24 GB. These sizes represent the recommended GPU memory needed to run each model.

I’ve only tested models that can run on consumer-grade hardware, but I might add larger models later.

How to decide if a model can run on a GPU?

This was one of the toughest questions in building the leaderboard. GPU memory usage depends on several factors, including the inference framework and specific settings like the size of the key-value (KV) cache, sequence length, and batch size.

To make things consistent, I use:

  • Inference framework: Hugging Face Transformers

  • Batch size: 1

  • Sequence length: 1,024

Hugging Face Transformers is not necessarily well optimized for reducing memory consumption, so I think it’s a good baseline: If the model runs with Transformers with a batch size of 1 and a sequence length of 1,024, it will run with a larger batch size or longer sequences with a better-optimized framework.

Zero-shot Evaluation of LLMs

To begin, I chose the following public benchmarks (tasks as named in the Evaluation Harness):

  • leaderboard_gpqa (GPQA)

  • leaderboard_musr (MuSR)

  • arc_challenge (ARC_C)

  • leaderboard_mmlu_pro (MMLU-PRO)

  • mmlu (MMLU)

These benchmarks are widely used for evaluating LLMs. Arc Challenge and MMLU were included in the first version of the Hugging Face OpenLLM Leaderboard, while newer benchmarks were added in the second version.

I believe these benchmarks give a solid overview of an LLM’s performance and are among the most affordable to run. Some benchmarks in the OpenLLM Leaderboard, like IFEval, are more expensive to compute, so I haven’t included them yet due to the costs involved.

All the models are evaluated with the exact same hyperparameters and hardware.

Why I Used Zero-Shot Evaluation

I used zero-shot evaluation, meaning that I didn’t include any example questions or answers in the prompts. I’m confident that modern LLMs are advanced enough to recognize the need to generate an answer without seeing examples first. Zero-shot evaluation has two key advantages:

  1. Lower Costs: Without examples in the prompt, the context is shorter, which reduces memory use and speeds up the encoding process.

  2. Better Reproducibility: Zero-shot testing removes the need to select and justify specific examples in the prompt, making the results easier to reproduce and interpret.

I’ve written an article for The Salt that compares the costs of zero-shot and few-shot evaluations in more detail:

How the Scores Are Calculated

The leaderboards show an average score for each model. This score is a simple, unweighted average of all benchmark results.

To calculate the scores, I used the Evaluation Harness. For each model included in this release, I ran:

for m in models:
    m_name = m.split('/')[0]
    !lm_eval --model hf --model_args pretrained={m},dtype=float16 --tasks  leaderboard_gpqa,leaderboard_musr,arc_challenge,leaderboard_mmlu_pro,mmlu --device cuda:0 --num_fewshot 0 --batch_size 1 --output_path ./eval/
    if "GPTQ" not in m and "gptq" not in m:
      !lm_eval --model hf --model_args pretrained={m},load_in_4bit=True --tasks  leaderboard_gpqa,leaderboard_musr,arc_challenge,leaderboard_mmlu_pro,mmlu --device cuda:0 --num_fewshot 0 --batch_size 1 --output_path ./eval/eval-bnb4bit/

The condition if "GPTQ" not in m and "gptq" not in m is used to avoid scoring GPTQ models with bitsandbytes quantization, since the models are already quantized.

Inference Memory Consumption

To benchmark the inference memory consumption of the models, I used Optimum-Benchmark.

Memory consumption during inference depends on several hyperparameters, with the key factors being:

  • Batch Size

  • Sequence Length

I chose a sequence length of 1,024 tokens, as it is generally sufficient for most applications. If you plan to use longer sequences, memory usage will increase accordingly.

For batch size, I used a value of 1. Larger batch sizes will further increase memory usage.

The memory values in the tables represent the peak memory allocation during the decode phase.

Here’s the code I used for benchmarking:

from optimum_benchmark import Benchmark, BenchmarkConfig, TorchrunConfig, InferenceConfig, TrainingConfig, PyTorchConfig
from optimum_benchmark.logging_utils import setup_logging
from transformers import set_seed
import gc
set_seed(1234)


def inference_bench(model_id, bs=1, seqlen=1024, quant=False):
    model_name = model_id.split('/')[1]
    launcher_config = TorchrunConfig(nproc_per_node=1)
    input_shapes = {"batch_size": bs, "num_choices": 1, "sequence_length": seqlen}

    scenario_config = InferenceConfig(latency=True, memory=True, input_shapes=input_shapes)
    if quant:
      name = "benchmark_inference_report_quant"+model_name+"_bs"+str(bs)+"_seqlen"+str(seqlen)+".csv"
      quantization_scheme = 'bnb'
      quantization_config = {
                              "bnb_4bit_compute_dtype": "float16",
                              "bnb_4bit_quant_type": "nf4",
                              "bnb_4bit_use_double_quant": True,
                              "llm_int8_enable_fp32_cpu_offload": False,
                              "llm_int8_has_fp16_weight": False,
                              "llm_int8_threshold": 6.0,
                              "load_in_4bit": True,
                              "load_in_8bit": False,
                            }
      backend_config = PyTorchConfig(model=model_id, quantization_scheme=quantization_scheme, torch_dtype="float16", quantization_config=quantization_config, device="cuda", device_ids="0", no_weights=False)
    else:
      name = "benchmark/benchmark_inference_report_"+model_name+"_bs"+str(bs)+"_seqlen"+str(seqlen)+".csv"
      backend_config = PyTorchConfig(model=model_id, device="cuda", torch_dtype="float16", device_ids="0", no_weights=False)
    benchmark_config = BenchmarkConfig(
        name="pytorch_"+model_name,
        scenario=scenario_config,
        launcher=launcher_config,
        backend=backend_config,
    )
    benchmark_report = Benchmark.launch(benchmark_config)


    benchmark_report.log()
    benchmark_config.to_dict()
    benchmark_report.save_csv(name)

for m in models:
    inference_bench(m,1) #Inference, not quantized (float16)

    if "GPTQ" not in m and "gptq" not in m:
      inference_bench(m,1,1024,True) #Inference, quantized (bnb4bit)