Does Re-ranking Improve the Performance of RAG?

Yes, re-ranking is a crucial step that can significantly improve the performance of a Retrieval-Augmented Generation (RAG) system. While the initial retrieval stage fetches a broad set of documents, the re-ranking stage refines this set, ensuring that only the most relevant and highest-quality information is passed to the LLM.

ColBERT pipeline illustration
Figure 1: End-to-end pipeline of the implemented RAG system.

Let's look at the process in more detail.


1. Classical RAG

In a traditional RAG pipeline, the process is as follows:

ColBERT pipeline illustration
Figure 1: End-to-end pipeline of the implemented RAG system.

Problem:


2. With Re-ranking Models

Re-ranking is a fine-grained retrieval step that analyzes the relationship between a query and documents more deeply. After the initial retrieval, the top document chunks are passed through a re-ranking model, which re-sorts them in descending order based on their importance to the input query.

Types of Re-ranking Models:

  1. Cross-Encoder based models: ColBERT, ColBERTv2
  2. LLM-based re-ranking models: RankGPT
  3. API-based models: CohereAPI

Cross-Encoder based Models: ColBERT

ColBERT uses a basic BERT model to encode tokens into embeddings, which is a good choice because:

ColBERT pipeline illustration
Figure 1: End-to-end pipeline of the implemented RAG system.

The core concept behind ColBERT is late interaction. Unlike other models that create a single embedding for a document, ColBERT creates an embedding for every token in the query and every token in the document using a pre-trained BERT-based encoder.

During the re-ranking process, ColBERT calculates a similarity score for every token in the query against all tokens in the documents. It then takes the maximum similarity score for each query token and sums them to determine the document's final relevance score.

The retrieval process in ColBERT is split into two stages:

Approximate Filtering: For each query embedding, a separate vector similarity search is performed on the FAISS index. The search finds the top-k' most similar document embeddings from the entire collection. The IVFPQ index speeds up this search by searching only a few nearest partitions (p nearest cells). The results of these Nq searches (one for each query embedding) are combined, and the unique document IDs are collected into a set of K candidate documents. These K documents are likely to contain embeddings highly similar to the query embeddings.

Re-ranking and Refinement: Once the smaller set of K candidate documents is identified, a precise re-ranking finds the final top-k results. The full ColBERT scoring process (MaxSim) is applied to these K documents. The relevance score for each candidate is calculated by summing the maximum similarity between each query embedding and all document embeddings. The documents are then sorted by score, and the top-k are returned as the final result.

ColBERT re-ranking result
Figure 2: Re-ranking and scoring in the ColBERT system.

ColBERTv2

ColBERTv2 is an improvement over ColBERT on two key features: the training strategy and compressed indexing. An initial ColBERT model is used to index the training passages. For each training query, this model retrieves the top-k passages, which often include a mix of relevant documents and challenging, but irrelevant, documents (hard negatives) that are semantically similar to the query.

1. Knowledge Distillation
For training, knowledge distillation of a powerful cross-encoder model like ms-macro-MiniLM-L-6-V2 is used as a teacher model. The scores from the cross-encoder are used to train the ColBERT student model. ColBERTv2 uses a KL-Divergence loss to distill the teacher's scores into its architecture, helping the student model mimic the teacher's ranking distribution. To further improve training, the process also uses in-batch negatives, where a cross-entropy loss is applied to the positive score of each query against all passages corresponding to other queries in the same batch.

2. Indexing
MaxSim is applied on a reduced inverted list in a two-stage retrieval process to improve efficiency. This two-step process allows for fast candidate generation using approximate scores, followed by a more accurate re-ranking of the top passages.

Candidate Generation with the Reduced Inverted List
This first stage is for speed and efficiency. The goal is to quickly identify a small set of promising documents without decompressing all the vectors.

Decompression and Full MaxSim Calculation
For each of the ncandidate passages, the system loads and fully decompresses all of its token embeddings by combining the centroid and the residual vector. The true MaxSim operation is then performed. This involves calculating the cosine similarity between each query token vector and all of the uncompressed passage token vectors, and then summing up the maximum similarities.

Alternative text for the image
Figure 1: End-to-end pipeline of the implemented RAG system.

LLM-based Re-ranking Models: RankGPT

RankGPT uses permutation knowledge distillation instead of traditional supervised fine-tuning. The approach has two main components: sliding window data generation and permutation distillation with RankNet loss.

Sliding Window:

Given a query q and a candidate set of passages $\mathcal{P} = \{p_1, p_2, \dots, p_M\}$, the teacher LLM produces a high-quality ranked list:

$$R = \{r_1, r_2, \dots, r_M\}, \quad r_i = \text{rank}(p_i)$$

Since LLMs cannot process all $M$ passages at once, passages are split into overlapping windows of size $w$:

$$\mathcal{W}_k = \{p_k, p_{k+1}, \dots, p_{k+w-1}\}, \quad k=1, \dots, M-w+1$$

The teacher re-ranks passages inside each window. At step $k$, the relative order of the top-$t$ passages is locked in while the window slides backward through the list.

RankGPT data generation pipeline
Figure 1: Data generation using sliding windows.

Permutation Distillation:

The student model $f_\theta(q, p_i) = s_i$ assigns a score $s_i$ to each passage $p_i$. Instead of absolute scores, RankNet uses pairwise ordering:

$$L_{\text{RankNet}} = \sum_{i=1}^{M}\sum_{j=1}^{M} \mathbf{1}_{r_i < r_j} \, \log\!\big(1 + \exp(s_j - s_i)\big)$$

Where:

Intuitively:

Thus, the student (e.g., a BERT cross-encoder) learns to approximate the teacher’s ranking by minimizing pairwise inversions rather than predicting exact permutations.

References

  • Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR.
  • Santhanam, K., et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. arXiv:2112.01488.
  • Sun, S., et al. (2023). RankGPT: Instruction Tuning for Generative Ranking. arXiv:2304.09569.
  • Cohere AI. (2023). Cohere Rerank API Documentation. Retrieved from https://docs.cohere.com.