vLLM vs Ollama: How They Differ and When To Use Them
With Examples of Offline and Online Inference
This article takes a close look at two popular open-source tools for LLM inference: vLLM and Ollama. Both are widely used but optimized for very different use cases.
vLLM is built to maximize GPU throughput in server environments, while Ollama focuses on ease-of-use and local model execution, often on CPU. While they might seem like alternatives at first glance, they serve distinct roles in the LLM ecosystem.
We'll explore how vLLM achieves high performance through low-level memory optimizations like PagedAttention, and how it excels in multi-user or long-context scenarios. Then we'll look at Ollama’s lightweight design, its integration with quantized GGUF models, and its focus on simplicity.
The goal of this article is to help you understand the differences between vLLM and Ollama, so you can choose the right tool based on your use case. You don’t need any prior experience with LLM inference or deployment to understand this article. This is meant to be beginner-friendly, with the exception of the PagedAttention (short) section just ahead.
I prepared a simple notebook containing the main commands to set up and try vLLM and Ollama with Qwen3:
Note: What about SGLang? I’m much less familiar with SGLang, but I think it is as good as vLLM, maybe with fewer features. SGLang can be used for the same use cases as vLLM.
vLLM vs Ollama: One for the GPU, the Other for the CPU?
vLLM: Leveraging the GPU at Almost Full Capacity
vLLM is a high-performance open-source library for LLM inference and serving, originally developed at UC Berkeley.
Originally, its main innovation was PagedAttention, a custom memory-management algorithm for attention that treats the GPU memory like virtual memory pages. Instead of allocating one big contiguous chunk for the attention key-value cache (which can lead to 60–80% memory waste due to fragmentation), PagedAttention breaks the cache into fixed-size blocks or “pages”. These pages can be flexibly assigned and reused, dramatically reducing memory overhead.
Now, PagedAttention is implemented in most inference frameworks. vLLM implements many more optimizations on top of it to be even more efficient.
By better exploiting GPU memory, vLLM can batch more concurrent requests and generate multiple tokens in parallel without running out of memory. The result is state-of-the-art throughput.
In short, vLLM is designed for speed and efficiency in serving LLMs, especially when handling many requests or long context lengths. It also provides an easy Python API and an OpenAI-compatible server mode, making integration with applications straightforward.
Ollama: Simple Local Inference
Ollama, on the other hand, is an open-source tool that focuses on simplicity and local model management.
GitHub: https://github.com/ollama/ollama
Ollama stands out for its simplicity and user-friendly design. Unlike vLLM, it doesn’t require technical expertise to get started, making it far more accessible to a wider audience. This ease of use has been a major driver of its popularity: on GitHub, Ollama has earned over 100,000 more stars than vLLM.
It allows you to run LLMs on your local machine (Linux, macOS, or Windows) with minimal setup. Ollama packages model weights, tokenizer, and configuration together into a single bundle (defined by a “Modelfile”). It’s based on llama.cpp, which is optimized for CPU inference. So it easily supports all the models that are supported by llama.cpp and that can be converted to GGUF.
Using a Docker analogy, if a model were a container image, Ollama would be your docker pull
and docker run
for LLMs.
All you need is a one-line command to install Ollama and another to download a model. Under the hood, Ollama handles downloading the model (often in a quantized format optimized for local inference) and spins up a local inference server.
Ollama provides a REST API (including an OpenAI-compatible /v1/chat/completions
endpoint) out-of-the-box. This means you can chat with a model interactively in your terminal or programmatically via HTTP requests, all without relying on external cloud APIs.
When to Use vLLM
Choose vLLM if you are aiming to deploy LLMs in a production or research setting where throughput and latency are critical. For example, if you want to serve a Qwen3 model to multiple users (or handle many parallel requests), vLLM’s continuous batching and optimized GPU utilization will be optimal. For The Kaitchup, when I need results for articles, I use vLLM.
The engine was literally designed to keep a GPU fed with as many tokens as possible at all times, which is ideal for an online service. Moreover, if your use case involves long prompts or outputs (tens of thousands of tokens), vLLM is well-suited.
Another reason to use vLLM is when you need tight integration with a Python ecosystem or custom logic. Because vLLM is a Python library, you can load a model and generate text in a script or notebook, mixing it with other Python code (for example, pre- or post-processing the prompts/responses). This is great for pipeline workflows or research experiments.
If you plan to experiment with multiple models or very new models not yet packaged in tools like Ollama, vLLM (with HuggingFace Transformers as backend) gives you the freedom to do so by just pointing to the model path.
When to Use Ollama
Ollama is the go-to choice when ease of use and quick setup are top priorities. If you want to get a model running right now, with minimal fuss, Ollama is a good option.
For example, if you’re a developer or enthusiast who just wants to chat with an LLM locally or integrate it into a personal project, Ollama lets you do that with a few terminal commands (as we’ll see below). It abstracts away all the Python environment setup, GPU device configurations, and model conversion issues.
This makes Ollama friendly to a broader audience, including those who may not be ML experts but want to experiment with an advanced model. Even for experts, sometimes you just need a quick local endpoint for a model without digging into library internals, and Ollama provides that.
Another scenario for Ollama is when you are running on more limited hardware or want to conserve resources. Ollama’s ability to fetch quantized models means you can run larger models than you otherwise could on a given machine. If you don’t have a strong GPU, or any GPU, Ollama can still run the model on CPU (with a performance hit, of course). In fact, the developers note that while GPU is recommended (NVIDIA or AMD for acceleration), many people do run models on CPU for smaller workloads. So if you’re, say, on a laptop with 16GB RAM and no CUDA, you could still load Qwen3-1.7B or 4B in Ollama and get it to work. vLLM in CPU-only mode is possible too, but it’s less optimized for that scenario compared to the lightweight runtimes that Ollama likely employs under the hood.
Ollama is also a great fit when you want a persistent local AI service that you can use from various interfaces. Because it runs a background server by default, you can connect GUI front-ends to it or use it in your own tools via its REST API.
Interestingly, if you want to use GGUF models, vLLM is still not able to run them efficiently. I recommend using Ollama or llama.cpp for GGUF.
How to Use vLLM
Now that we’ve covered the concepts, let’s get hands-on. In this section, we’ll walk through setting up vLLM on a Linux system and demonstrate both offline (batched) inference and online serving for Qwen3 models.
Set Up vLLM
1. Install vLLM. vLLM is available as a Python package on PyPI. You can install it via pip. It’s recommended to do this in a virtual environment (conda or python3 -m venv
) to avoid dependency conflicts (vLLM might install a new PyTorch).
pip install vllm
This command will fetch the latest stable vLLM (e.g., 0.9.x as of June 2025). Under the hood, vLLM will also bring in PyTorch and other dependencies. On a Linux machine with NVIDIA GPUs, vLLM should detect your CUDA and install the appropriate build of PyTorch. If you run into issues (for instance, a wrong CUDA version), you might need to manually install torch
first or specify a --torch-backend
as noted in the vLLM docs. But in most cases, pip install vllm
“just works”.
2. Verify installation. After installation, you should be able to run the vllm
command-line tool or import the library in Python. For example, try:
vllm --help
If this prints out usage information (options for the vLLM CLI), then you have vLLM installed successfully. You can also check the version:
python -c "import vllm; print(vllm.__version__)"
Running Online and Offline Inference with vLLM
vLLM can be used in two primary modes:
Offline inference (library mode): You load a model and feed it prompts directly in code, getting back the completions.
Online serving (server mode): You launch a service that listens for requests (OpenAI-style), useful for interactive or multi-client scenarios.
We’ll demonstrate both using a Qwen3 model.
Offline Batched Inference (Python usage)
The offline mode is great for scripting and testing. You can generate text from one or many prompts in a batch with high efficiency. Here’s a minimal example of using vLLM’s Python API to generate text from Qwen3:
from vllm import LLM, SamplingParams
# Load the Qwen3 model (8B variant in this example).
# vLLM will automatically download weights from Hugging Face on first run.
llm = LLM(model="Qwen/Qwen3-8B", tokenizer="Qwen/Qwen3-8B")
# Prepare a prompt. You can also prepare a list of prompts for batch inference.
prompt = [{"role":"user", "content":"Qwen3 is a large language model. Give a one-line description of Qwen3."}]
# You can optionally set sampling parameters (temperature, top_p, etc.).
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
# Generate text for the prompt
outputs = llm.chat([prompt], sampling_params=params)
# The result is a list of RequestOutput objects (one per input prompt).
output_text = outputs[0].outputs[0].text
# first prompt, first output variation
print(output_text)
We initialize LLM
with the model name.
Here, we used the Hugging Face repo name Qwen/Qwen3-8B
. vLLM will load the model weights (and tokenizer) the first time. This may take a while for an 8B model as it downloads ~16GB of data (in FP16).
Once loaded, the llm.chat
method is given a list of prompts (we passed a single prompt in a list) and returns outputs. We specified some sampling parameters: a moderate temperature and top-p for nucleus sampling.
If you don’t specify SamplingParams
, vLLM will actually try to use the model’s recommended generation config if available in its HuggingFace repo (which often yields good results). The output we get is the generated completion text. For a simple descriptive prompt as above, Qwen3 might output something like: “Qwen3 is an advanced open-source multilingual LLM with a dual thinking mode for reasoning.” (The actual text will vary since generation has randomness unless you set temperature=0.)
Because vLLM batches internally, you can pass multiple prompts to generate
and it will handle them together for efficiency. This offline mode does not start a persistent server; it runs in your Python process and returns the results when done, which is perfect for one-off jobs or asynchronous processing of a dataset.
Online Serving (OpenAI-compatible server)
vLLM makes it easy to launch a local server that implements the OpenAI API (both completion and chat completion endpoints). This is extremely useful if you want to use Qwen3 through existing tools that expect an OpenAI API, or if you want to serve it to a front-end interface.
To start a vLLM server with a Qwen3 model, simply run in your terminal:
vllm serve Qwen/Qwen3-8B --host 0.0.0.0 --port 8000
vllm serve Qwen/Qwen3-8B
will load the Qwen3-8B model and start serving requests on the default address (localhost:8000
). We override --host 0.0.0.0
in this example if we want it accessible from other machines on the network, you can omit that if you only need local access.
Once this command is running, you have an API endpoint listening. It implements routes like /v1/chat/completions
and /v1/completions
(and even /v1/models
to list the loaded model) in the same format as OpenAI’s official API. You could test it with a quick curl command from another terminal:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "Qwen/Qwen3-8B", "messages": [ {"role": "user", "content": "Hello, how would you describe yourself?"} ] }'
This should return a JSON response with Qwen3’s answer. You can also use the OpenAI Python SDK by pointing it to this URL (set openai.api_base = "http://localhost:8000/v1"
and use a dummy api_key). In server mode, vLLM will continuously batch and optimize incoming requests. So if you have two or three users hitting it at once, it can group their generation steps to reduce total compute. This is where vLLM really outperforms a naive approach.
To stop the server, just Ctrl+C the process in the terminal. Note that each time you start it up, it will reload the model weights from disk, which can take some time (you’ll see it allocating memory and such). In scenarios where you need the server always on, you might run this inside a tmux
session or systemd service.
How to Use Ollama
One of the beautiful things about Ollama is that it encapsulates a lot of complexity into simple commands. We’ll cover both interactive offline use (the command-line chat) and how to leverage its online API features.
Set Up Ollama
1. Install Ollama. The Ollama team provides an easy installation script that works on Linux. Simply open a terminal and run:
curl -fsSL https://ollama.com/install.sh | sh
This shell script will detect your platform and install the ollama
CLI. Within a half minute or so, you should have Ollama installed (it might prompt for your password if it needs to place the binary in a system directory). After installation, the Ollama service usually starts automatically in the background. You can verify it’s running by visiting
http://localhost:11434
in a browser, it should show a message like “Ollama is running”. This means the Ollama daemon is active on port 11434.
Note: Ollama works on CPU-only systems, but you will get much better performance with a proper GPU. If you only have an integrated GPU (or none), Ollama will default to CPU, which can be very slow for large models.
2. Download your model. Once Ollama is installed and running, you need to pull the model weights. Ollama has a registry of supported model names. For instance, for the Qwen3 series, the base name is "qwen3"
with different sizes as tags. If you run ollama pull qwen3
, by default it might pull a certain variant (often the 7B/8B model). To be specific, we can include the parameter count. Let’s say we want Qwen3-8B:
ollama pull qwen3:8b
This command will connect to Ollama’s model hub and download the Qwen3 8B model package. You’ll see a progress bar as it downloads. Once complete, the model is “installed” in Ollama’s library (usually under /usr/share/ollama/.ollama/models
on Linux). You can list all models you have by running ollama list
. For example, after pulling, you should see qwen3:8b
in the list (along with any others you pulled).
Running Online and Offline Inference with Ollama
Ollama’s usage can be split into the interactive CLI and the programmatic API, but these are just two ways to access the same local service. The “offline” inference in Ollama’s context typically means you are running it locally without requiring Internet (after the model is downloaded) and possibly using it in a one-off manner (like a terminal query). The “online” inference would refer to keeping the service running and hitting it via API from other programs or devices. Let’s explore both.
Interactive (offline) usage/terminal chat
The simplest way to use a model with Ollama is via the ollama run
command. This will start an interactive session where you type prompts and the model responds. For example:
ollama run qwen3:8b
After running this, you’ll see no explicit confirmation message (the prompt may just blink). You can now type a question and hit enter, and Qwen3 will generate an answer. It streams the answer word by word to your terminal. You can have a multi-turn conversation this way. To exit, press Ctrl+D
or type /bye
and hit Enter.
If you want to just get a single-shot response and exit (non-interactive), you can pipe a prompt into ollama run
. For example:
echo "What is the capital of France?" | ollama run qwen3:8b
This will output something like Paris.
and then the command will finish. This is useful if you want to integrate Ollama calls in shell scripts or quickly test prompts without entering a REPL.
Programmatic (online) usage/APIs
Because Ollama runs a persistent server, you can also query it through HTTP calls. There are a couple of ways to do this:
Ollama’s native REST API: For lower-level access, Ollama provides endpoints like
http://localhost:11434/api/generate
. You can POST a JSON payload specifying the model and prompt, and it will return a JSON with the generated text (streaming or all-at-once). For instance, usingcurl
you might do:
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{ "model": "qwen3:8b", "prompt": "What is water made of?" }'
This will return a JSON response with the completion (by default it streams chunks; you can add "stream": false
in the JSON to get one response object). This API is specific to Ollama, but it’s quite straightforward. You could wrap this in any HTTP client or use it from languages like Python (using requests
) or JavaScript (fetch).
OpenAI-compatible API: In February 2024, Ollama introduced an OpenAI Chat API compatibility layer. This means you can take any code that uses
openai.ChatCompletion.create
and point it to your Ollama server. For example, withcurl
:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "qwen3:8b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, Qwen!"} ] }'
This mirrors the OpenAI API format (with roles and messages). Ollama will process it and return a completion in the same JSON structure as OpenAI (with choices
, etc.).
Regardless of which API method you choose, the result is that Ollama lets you treat the local model almost as if it were an OpenAI model. You can build applications on top of it, have multiple clients send requests, or connect a UI.
A quick example using curl
with the OpenAI format, to illustrate how you might get a reasoning vs non-reasoning answer (with Qwen3) via the API:
# Ask Qwen3 without reasoning
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen3:8b", "messages": [{"role": "user", "content": "What is 8^999? /no_think"}] }'
# Ask Qwen3 with reasoning curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen3:8b", "messages": [{"role": "user", "content": "What is 8^999? /think"}] }'
The first query should return a brief answer (and likely an incorrect one, because 8^999 is astronomically large, the model might just say it's a huge number). The second query, with /think
, will trigger Qwen3 to output its thought process. You would see in the streamed content some reasoning (like it might break down the problem or mention modular arithmetic) before arriving at the final answer. The reasoning text will be included in the choices[0].message.content
in the JSON. Ollama doesn’t strip it out; it leaves it to you or the UI to decide what to do with it (unlike vLLM, which can strip it when --enable-reasoning
is used). So, when integrating into an app, if you use thinking mode, you might want to parse the <think>
tags out for a cleaner user display.
Conclusion
vLLM and Ollama represent two very different approaches to the same problem: serving a model and getting useful output. One emphasizes performance and control in production-grade systems, and the other prioritizes accessibility and fast local setup. Neither is strictly better; each reflects trade-offs based on infrastructure, goals, and the kinds of tasks you're running.
Choosing between them is less about comparing speeds or feature sets and more about understanding where you're operating. If you're building an application that needs to scale with users or requests, vLLM makes sense. If you're prototyping or just want a model running with zero hassle, Ollama will get you there faster.