RAG with Qwen3 Embedding and Qwen3 Reranker
How to use embedding and reranker models to efficiently retrieve only the most relevant chunks or documents given a user query
Retrieval-Augmented Generation (RAG) is a powerful paradigm that enhances a large language model (LLM) with a retrieval mechanism, enabling it to access relevant background information, such as documents or passages, before generating a response.
At the core of a RAG pipeline, we usually find two components: the embedding model and the reranker.
The embedding model transforms text into dense numerical vectors (embeddings), placing semantically similar texts close together in vector space. This facilitates efficient retrieval of candidate documents through similarity search.
The reranker model then takes these candidate documents and evaluates the relevance of each query–document pair, reordering them so that the most relevant documents rise to the top.
In other words, high-quality embeddings capture the semantic relationships between pieces of text, while a strong reranker ensures that the retrieved results are contextually the most relevant.
To support high-performance RAG workflows, the Qwen team has open-sourced both embedding and reranker models based on Qwen3.
In this article, we’ll walk through how to use and combine Qwen3 Embedding and Qwen3 Reranker to retrieve relevant documents and provide your LLM with meaningful context for a given user query. We’ll first take a closer look at how the embedding and reranking models work individually and combined. With one example, we’ll see how to use them with sentence-transformers and vLLM.
The notebook below demonstrates a simple retrieval pipeline using Qwen3 Embeddings and Qwen3 Reranker models:
Qwen3 Embedding Model: Specialized Text Embeddings
Qwen3 Embedding is a series of models built on the Qwen3 LLMs. These models are fine-tuned specifically for embedding tasks. They operate as dual-encoders, meaning they encode a single text input, such as a document and a query, into embedding vectors in a high-dimensional semantic space.
Each input text is processed with an end-of-sequence token ([EOS]
), and the hidden state of the final [EOS]
token is taken as the text’s embedding vector. This vector captures the semantic content of the text in a format suitable for similarity search.
Under the hood, Qwen3 Embedding models leverage a multi-stage training pipeline to achieve robust performance. Training begins with a large-scale unsupervised contrastive pre-training stage, where the model learns to bring semantically similar text pairs closer and push unrelated pairs apart using massive amounts of weakly supervised data.
In Qwen3’s case, an innovative approach was used to generate this training data: the Qwen3 LLM itself was employed to synthesize diverse query–document pairs across many domains and languages via a multi-task prompt system, greatly expanding the training corpus beyond what is readily available from public data. This yielded a broad base of weakly labeled data covering everything from web passages to code snippets. Next, a supervised fine-tuning stage on a smaller set of high-quality, human-annotated (or high-fidelity) relevance data further sharpens the model’s ability to produce task-specific embeddings.
Finally, the Qwen team applied a model merging strategy, combining multiple trained checkpoints into a single model to integrate their strengths and improve generalization. This multi-stage process results in embeddings that work well across various downstream scenarios.
Overall, the pipeline looks very similar to the LLM2Vec approach, except that Qwen models’ architecture is preserved. They remain LLMs fine-tuned to produce accurate semantic representation through the EOS token.
Qwen3 Reranker Model: Optimizing Relevance Ranking
Qwen3-Reranker focuses on the second, optional, stage of retrieval: evaluating and scoring how well a given document matches a query.
The reranker is a cross-encoder model, fine-tuned to output a relevance score given a pair of texts (typically a user query and a candidate passage, i.e., the same types of input as for the embedding model). Instead of embedding each text independently, a cross-encoder takes the query and document together as input (often concatenated with a special format or instruction) and processes them with full self-attention over the combined text.
This allows the reranker to consider the query–document interaction directly, capturing subtle nuances like whether a document actually answers the query. The Qwen3 Reranker then produces a relevance score (for example, by outputting a probability or a scalar ranking score) indicating how likely that document is to be a good match for the query.
The goal here is to score text pairs to enhance search relevance, refining the ranking of results that the embedding model retrieve.
Like the embedding models, the Qwen3-Rerankers are available in 0.6B, 4B, and 8B parameter variants. They share the Qwen3 base architecture but are fine-tuned for the ranking task.
Training the reranker models was somewhat simpler than the multi-stage embedding training: the Qwen team found that a single-stage supervised fine-tuning on high-quality labeled data was sufficient to achieve excellent results for reranking. In other words, they curated datasets of queries and relevant vs. non-relevant documents (from real-world search logs or annotated QA pairs) and trained the Qwen3 Reranker to predict high scores for relevant pairs and low scores for irrelevant pairs.
This focused training, without a massive unsupervised stage, sped up development and yielded highly effective rerankers. (The Qwen3 model’s strong language understanding likely made large-scale unsupervised pre-training less necessary for the rerank task.)
The reranker models also support instruction prompts similar to the embedding models, meaning you can prefix an instruction to the input pair (e.g., “You are a search engine ranking results for a programming question.”) to nudge the scoring behavior for specialized situations. Additionally, because they inherit the multilingual capabilities of Qwen3, the rerankers can evaluate relevance for non-English or code queries just as well – a critical feature if your RAG system needs to handle global or multi-format content.
The query and document texts (plus an instruction, if used) are fed together into the model in a single sequence. For example, the input might be structured as:
"Instruct: Given the query, determine if the document is relevant.\nQuery: ...\nDocument: ... <|endoftext|>"
The model processes this concatenated sequence and typically outputs a classification label or score (e.g., "relevant" vs "not relevant" or a numeric score).
The Qwen3 Reranker is effectively a pointwise reranker that scores each (query, doc) pair independently. This is more computationally expensive per pair than the embedding model’s scoring (especially since the latter can pre-compute document embeddings offline), but it yields a much more precise relevance judgment by examining the full context of query and document together.
Performance-wise, the Qwen3-Rerankers have shown strong gains when used in combination with Qwen3-Embedding retrieval.