RAG Reranking To Elevate Retrieval Results
NBD Lite #47 How Reranking Enhances Precision and Relevance
The previous article taught us how to evaluate the RAG System with LLM-as-a-Judge or LLM evaluator.
By the end of the article, we find that most of the documents retrieved are contextually irrelevant to the query we passed on. This means that most of the answers we ask can’t be answered with the top retrieved documents.
The key to this problem is the “top retrieved documents.” As we pass the top retrieved documents to the generative model, irrelevant top documents will only produce an unrelated answer. So, maybe the relevant document is somewhere below the top retrieved documents.
That is the principle of Reranking. We want to restructure the retrieved results to end up with relevant documents.
In this article, we will explore what is RAG Reranking more closely and how to build it.
The diagram below provides an overview of what we’ll build today. Don’t forget all the code is stored in the RAG-To-Know repository.
Without further ado, let’s get into it.
Sponsor Section
Data Science Roadmap by aigents
Feeling lost in the data science jungle? 🌴 Don’t worry—we’ve got you covered!
Check out this AI-powered Data Science Roadmap 🗺️—your ultimate step-by-step guide to mastering data science! From stats 📊 to machine learning 🤖, and Python tools 🐍 to Tableau dashboards 📈, it’s all here.
✨ Why it’s awesome:
AI-powered explanations & Q&A 🤓
Free learning resources 🆓
Perfect for beginners & skill-builders 🚀
👉 Start your journey here: Data Science Roadmap
Need extra help? Try the AI-tutor for personalized guidance: AI-tutor
Let’s make data science simple and fun! 🎉
Introduction to RAG Reranking
As mentioned above, RAG comes with challanges — particularly when balancing Retrieval and LLM performance.
We already understand that the most common way RAG works is by transforming text into vectors and performing semantic similarity between query and document.
But this process isn’t perfect. The text compression into vectors inevitably leads to some information loss, which can cause relevant documents to be ranked lower than they should be.
This raises a critical question: What happens when the most helpful information exceeds the top few results?
The most straightforward approach might be increasing the number of retrieved documents to improve retrieval recall. After all, recall measures how many relevant documents are retrieved, regardless of their position.
However, LLMs inherently have constraints, such as their context window, which limits the amount of text (input and output) they can process. Even models with high-context windows accepting large amounts of information may experience performance degradation if overloaded with irrelevant data.
Simply retrieving and feeding many documents into an LLM is not a viable solution.
The solution to solve the problem above means:
Retrieve as many as possible documents with Retrieval
Minimize the documents that pass into the LLM.
This is where Reranking helps the process. We can select the most relevant documents by ordering the retrieved documents from top to bottom.
How Does Reranking Work?
In general, Rerank uses a model that could quantify the retrieved document into numerical output, and we sort the document using the output.
The image below shows you the basic diagram of how Reranking works.
The Reranker could be anything, but we will focus on two implementations in this tutorial which is:
LLM Reranker
Cross-Encoder Reranker
For LLM Reranker, the process is similar to the LLM-as-a-Judge we discussed in the previous article. We transform the LLM into an evaluator model using Evaluator Prompt, and the model produces output such as a score for each document. The prompt can be designed to evaluate how relevant the retrieved documents are to the query.
In contrast, Cross-Encoder Reranker is a neural model architecture designed for tasks requiring understanding the relationship between two texts. In the RAG system, we often use Bi-Encoders. They are preferred in RAG for Retrieval because they balance accuracy and computational efficiency well.
The overall structural differences can be seen in the image below.
Cross-encoders encode the query and document together, making them highly effective for reranking tasks. The goal of Reranking is to refine the relevance of retrieved documents based on a query. However, cross-encoders tend to be slower compared to other architectures. This is why Retrieval is typically performed using bi-encoders, which are faster but may result in some information loss.
Despite their slower speed, cross-encoders avoid the information loss associated with bi-encoders. This makes cross-encoders the preferred choice for reranking tasks, where accuracy and relevance are critical.
That’s the basic theory, so let’s jump into the technical parts.
RAG Improvement with Reranking
We would not establish the RAG System from the beginning, as we already did in the previous article. If you haven’t read about them, I suggest you do so.
We will continue in the part where we have retrieved the document chunks, which we will perform using the code below.
query = "What is the insurance for car?"
results = semantic_search(query, top_k = 20)
We will try to rerank the top 20 documents we retrieve with our RAG systems. First, let’s use LLM to evaluate the retrieved document's relevance to the query.
We will set up the prompt as we did previously in the LLM-as-a-Judge article, evaluate each retrieved document, and rerank the documents.
# Define evaluation criteria
chunk_validation_prompt_template = """
{task}
{evaluation_criteria}
Follow these steps to generate your evaluation:
{evaluation_steps}
Please respond using the following JSON schema:
Answer = {json_format}
You MUST provide values for 'Evaluation:' and 'Score' in your answer.
{question}
{context}
Answer: """
rating_json_format = """
{
"Evaluation": "your rationale for the rating, as a text",
"Score": "your rating, as a number between 1 and 5"
}
"""
question_template = """Now here are the question (delimited by triple backticks)
Question: ```{question}```
"""
context_template = """here are the context (delimited by triple quotes).
Context: \"\"\"{context}\"\"\"\n
"""
# Tasks, evaluation criteria, and steps
context_task = """You will be given a context and a question.
Your task is to evaluate the question based on the given context and provide a score between 1 and 5 according to the following criteria:"""
context_eval = """- Score 1: The context does not provide sufficient information to answer the question in any way.
- Score 2 or 3: The context provides some relevant information, but the question remains partially answerable, or is unclear/ambiguous.
- Score 4: The context offers sufficient information to answer the question, but some minor details are missing or unclear.
- Score 5: The context provides all necessary information to answer the question clearly and without ambiguity."""
context_steps = """- Read the context and question carefully.
- Analyse and evaluate the question based on the provided evaluation criteria.
- Provide a scaled score between 1 and 5 that reflect your evaluation."""
# Define the request and token limits
REQUEST_LIMIT = 15 # Maximum requests per minute
TOKEN_LIMIT = 1_000_000 # Maximum tokens per minute
# Initialize counters for rate limiting
start_time = time.time()
requests_made = 0
tokens_used = 0
# Define the function to enforce rate limits
def enforce_rate_limit(start_time, requests_made, tokens_used):
elapsed_time = time.time() - start_time
if requests_made >= REQUEST_LIMIT or tokens_used >= TOKEN_LIMIT:
sleep_time = max(0, 60 - elapsed_time) # Wait until 1 minute has passed
time.sleep(sleep_time)
return time.time(), 0, 0 # Reset counters
return start_time, requests_made, tokens_used
# Function to generate response using LiteLLM with Gemini
def generate_response(query, context):
"""
Generate a response using the Gemini model via LiteLLM.
Args:
query (str): The query string.
context (str): The context string.
Returns:
str: The generated response from the Gemini model.
"""
# Combine the query and context for the prompt
prompt = f"Query: {query}\nContext: {context}\nAnswer:"
# Call the Gemini model via LiteLLM
response = completion(
model="gemini/gemini-1.5-flash", # Use the Gemini model
messages=[{"content": prompt, "role": "user"}],
api_key=GEMINI_API_KEY
)
# Extract and return the generated text
return response['choices'][0]['message']['content']
# Define the function to evaluate relevance
def evaluate_context(question, context):
global start_time, requests_made, tokens_used
# Enforce rate limits
start_time, requests_made, tokens_used = enforce_rate_limit(start_time, requests_made, tokens_used)
# Prepare the prompt for relevance evaluation
prompt = chunk_validation_prompt_template.format(
task=context_task,
evaluation_criteria=context_eval,
evaluation_steps=context_steps,
json_format=rating_json_format,
question=question_template.format(question=question),
context=context_template.format(context=context)
)
# Generate the response using Gemini
response = generate_response(prompt, "")
# Update the counters
requests_made += 1
tokens_used += len(prompt.split()) # Approximate token count
# Parse the response
try:
# Extract JSON part from the response (if any)
json_start = response.find("{")
json_end = response.rfind("}") + 1
json_response = response[json_start:json_end]
evaluation = json.loads(json_response)
return evaluation
except json.JSONDecodeError:
return {"Evaluation": "Invalid JSON response", "Score": 0}
# Function to rerank documents using LLM
def rerank_with_llm(query, documents):
scores = []
evaluations = []
# Evaluate each document with a progress bar
for doc in tqdm(documents, desc="Reranking documents with LLM"):
evaluation = evaluate_context(query, doc)
scores.append(evaluation["Score"])
evaluations.append(evaluation["Evaluation"])
# Sort documents based on scores
reranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
return reranked_docs, scores, evaluations
# Perform LLM reranking
llm_reranked_docs, llm_scores, llm_evaluations = rerank_with_llm(query, results['documents'][0])
# Display LLM reranked documents
print("\nLLM Reranked Documents:")
for i, (doc, score, evaluation) in enumerate(zip(llm_reranked_docs, llm_scores, llm_evaluations)):
print(f"{i+1}. Score: {score}")
print(f"Evaluation: {evaluation}")
print(f"Document: {doc[:100]}...\n")
Then, we will use the Cross-Encoder model to evaluate and rerank our retrieved documents.
# Function to evaluate relevance using Cross-Encoder
def evaluate_context_cross_encoder(query, context):
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
score = cross_encoder.predict([(query, context)])[0]
return score
# Function to rerank documents using Cross-Encoder
def rerank_with_cross_encoder(query, documents):
scores = []
for doc in tqdm(documents, desc="Reranking documents with Cross-Encoder"):
score = evaluate_context_cross_encoder(query, doc)
scores.append(score)
# Sort documents based on scores
reranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
return reranked_docs, scores
# Perform Cross-Encoder reranking
cross_encoder_reranked_docs, cross_encoder_scores = rerank_with_cross_encoder(query, results['documents'][0])
# Display Cross-Encoder reranked documents
print("\nCross-Encoder Reranked Documents:")
for i, (doc, score) in enumerate(zip(cross_encoder_reranked_docs, cross_encoder_scores)):
print(f"{i+1}. Score: {score:.2f}, Document: {doc[:100]}...")
Both techniques generate different scores that can be converted into rankings. In this case, we can transform these scores into a ranked list. If there is a tie between scores, the tied items will be assigned the same rank.
def scores_to_rankings(scores):
"""
Convert scores to rankings (competition rank).
The highest score gets rank 1, the second highest gets rank 2, etc.
Ties (same scores) receive the same rank, and the next distinct score
jumps in rank accordingly (e.g., 1, 2, 2, 4).
"""
score_index_pairs = [(score, i) for i, score in enumerate(scores)]
# Sort by descending score
sorted_pairs = sorted(score_index_pairs, key=lambda x: -x[0])
rankings = [0] * len(scores)
prev_score = None
# 'idx' is the position in the sorted list, used for "competition" style ranking
for idx, (score, original_idx) in enumerate(sorted_pairs):
if score != prev_score:
# rank = index in sorted list + 1
current_rank = idx + 1
# Assign to original index position
rankings[original_idx] = current_rank
prev_score = score
return rankings
As an additional example, we will try to combine both rankings into one. We will call it Parallel Ranking as we use the sum of both rankings as a new rank for the reranker.
def parallel_reranking_with_existing_scores(llm_scores, cross_encoder_scores, documents):
"""
Perform parallel reranking using existing LLM and Cross-Encoder scores.
We convert each set of scores into rankings, then sum those ranks to form
a combined rank. Ties in the combined rank receive the same "competition" rank.
"""
# Convert scores to rankings
llm_rankings = scores_to_rankings(llm_scores)
cross_encoder_rankings = scores_to_rankings(cross_encoder_scores)
# Sum LLM and Cross-Encoder ranks to get combined rank
combined_rankings = [llm_rank + ce_rank
for llm_rank, ce_rank in zip(llm_rankings, cross_encoder_rankings)]
# Sort documents by ascending combined rank
sorted_data = sorted(zip(combined_rankings, documents), key=lambda x: x[0])
reranked_docs = [doc for _, doc in sorted_data]
# Assign final ranks with tie handling
final_rankings = []
prev_combined = None
for idx, (comb_rank, _) in enumerate(sorted_data):
if comb_rank != prev_combined:
current_rank = idx + 1
final_rankings.append(current_rank)
prev_combined = comb_rank
return reranked_docs, final_rankings
# Perform parallel reranking using existing scores
parallel_reranked_docs, combined_rankings = parallel_reranking_with_existing_scores(
llm_scores, cross_encoder_scores, results['documents'][0]
)
print("\nParallel Reranked Documents:")
for i, (doc, rank) in enumerate(zip(parallel_reranked_docs, combined_rankings)):
print(f"{i+1}. Combined Rank: {rank}, Document: {doc[:100]}...")
Lastly, we will create a DataFrame object containing all the reranking processes we have done.
def create_rankings_df(
initial_docs,
llm_scores,
llm_reranked_docs,
llm_rankings, # competition-style ranks from scores_to_rankings
cross_encoder_scores,
cross_encoder_reranked_docs,
cross_encoder_rankings, # competition-style ranks from scores_to_rankings
parallel_reranked_docs,
parallel_rankings, # final parallel competition ranks
):
"""
Create a DataFrame to compare:
- Initial documents & their indices
- LLM scores & competition ranks
- Cross-Encoder scores & competition ranks
- Parallel combined rank
All in the *original document order* (i.e., initial_docs).
"""
# Create a mapping from document to its parallel rank
doc_to_parallel_rank = {doc: rank for doc, rank in zip(parallel_reranked_docs, parallel_rankings)}
# Create the DataFrame in the initial document order
df = pd.DataFrame({
"Initial Rank": range(1, len(initial_docs) + 1),
"Initial Document": initial_docs,
})
# LLM columns (score + competition rank)
df["LLM Score"] = llm_scores
df["LLM Competition Rank"] = llm_rankings
# Cross-Encoder columns (score + competition rank)
df["Cross-Encoder Score"] = cross_encoder_scores
df["Cross-Encoder Competition Rank"] = cross_encoder_rankings
# Map parallel ranks back to the initial document order
df["Parallel Rank"] = df["Initial Document"].map(doc_to_parallel_rank)
# Optional: Positional ranks (if needed)
df["LLM Positional Rank"] = df["Initial Document"].apply(
lambda doc: llm_reranked_docs.index(doc) + 1 if doc in llm_reranked_docs else None
)
df["Cross-Encoder Positional Rank"] = df["Initial Document"].apply(
lambda doc: cross_encoder_reranked_docs.index(doc) + 1 if doc in cross_encoder_reranked_docs else None
)
return df
llm_rankings = scores_to_rankings(llm_scores)
cross_encoder_rankings = scores_to_rankings(cross_encoder_scores)
llm_sorted_data = sorted(zip(llm_scores, results['documents'][0]),
key=lambda x: x[0], reverse=True)
llm_reranked_docs = [doc for score, doc in llm_sorted_data]
ce_sorted_data = sorted(zip(cross_encoder_scores, results['documents'][0]),
key=lambda x: x[0], reverse=True)
cross_encoder_reranked_docs = [doc for score, doc in ce_sorted_data]
parallel_reranked_docs, combined_rankings = parallel_reranking_with_existing_scores(
llm_scores, cross_encoder_scores, results['documents'][0]
)
rankings_df = create_rankings_df(
initial_docs=results['documents'][0],
llm_scores=llm_scores,
llm_reranked_docs=llm_reranked_docs,
llm_rankings=llm_rankings,
cross_encoder_scores=cross_encoder_scores,
cross_encoder_reranked_docs=cross_encoder_reranked_docs,
cross_encoder_rankings=cross_encoder_rankings,
parallel_reranked_docs=parallel_reranked_docs,
parallel_rankings=combined_rankings
)
You can see that we end up with various rankings for reranking to improve the generation result. I put up all the ranks for you to compare.
Notice that the LLM model evaluation rank differs from the Cross-Encoder information. We can utilize interesting information as a stand-alone evaluation or combine them into the parallel ranking.
To know which method is the best, we can only experiment with and evaluate the results. We can use LLM-as-a-Judge again, but human evaluation could also be employed here.
That’s all for now! Next time, we will discuss further techniques to improve RAG results!
Is there anything else you’d like to discuss? Let’s dive into it together!
👇👇👇