RAG Reranking To Elevate Retrieval Results

NBD Lite #47 How Reranking Enhances Precision and Relevance

Jan 24, 2025

The previous article taught us how to evaluate the RAG System with LLM-as-a-Judge or LLM evaluator.

Evaluating RAG with LLM-as-a-Judge: A Guide to Production Monitoring

When I started writing about the Simple RAG (Retrieval-Augmented Generation) Implementation tutorial, I realized it’s crucial to first address how to evaluate the results…

5 months ago · 5 likes · Cornellius Yudha Wijaya

By the end of the article, we find that most of the documents retrieved are contextually irrelevant to the query we passed on. This means that most of the answers we ask can’t be answered with the top retrieved documents.

The key to this problem is the “top retrieved documents.” As we pass the top retrieved documents to the generative model, irrelevant top documents will only produce an unrelated answer. So, maybe the relevant document is somewhere below the top retrieved documents.

That is the principle of Reranking. We want to restructure the retrieved results to end up with relevant documents.

In this article, we will explore what is RAG Reranking more closely and how to build it.

The diagram below provides an overview of what we’ll build today. Don’t forget all the code is stored in the RAG-To-Know repository.

RAG Reranking To Elevate Retrieval Results

Without further ado, let’s get into it.

Sponsor Section

Data Science Roadmap by aigents

Feeling lost in the data science jungle? 🌴 Don’t worry—we’ve got you covered!

Check out this AI-powered Data Science Roadmap 🗺️—your ultimate step-by-step guide to mastering data science! From stats 📊 to machine learning 🤖, and Python tools 🐍 to Tableau dashboards 📈, it’s all here.

✨ Why it’s awesome:

AI-powered explanations & Q&A 🤓
Free learning resources 🆓
Perfect for beginners & skill-builders 🚀

👉 Start your journey here: Data Science Roadmap

Need extra help? Try the AI-tutor for personalized guidance: AI-tutor

Let’s make data science simple and fun! 🎉

Introduction to RAG Reranking

As mentioned above, RAG comes with challanges — particularly when balancing Retrieval and LLM performance.

We already understand that the most common way RAG works is by transforming text into vectors and performing semantic similarity between query and document.

But this process isn’t perfect. The text compression into vectors inevitably leads to some information loss, which can cause relevant documents to be ranked lower than they should be.

This raises a critical question: What happens when the most helpful information exceeds the top few results?

The most straightforward approach might be increasing the number of retrieved documents to improve retrieval recall. After all, recall measures how many relevant documents are retrieved, regardless of their position.

However, LLMs inherently have constraints, such as their context window, which limits the amount of text (input and output) they can process. Even models with high-context windows accepting large amounts of information may experience performance degradation if overloaded with irrelevant data.

Simply retrieving and feeding many documents into an LLM is not a viable solution.

The solution to solve the problem above means:

Retrieve as many as possible documents with Retrieval
Minimize the documents that pass into the LLM.

This is where Reranking helps the process. We can select the most relevant documents by ordering the retrieved documents from top to bottom.

How Does Reranking Work?

In general, Rerank uses a model that could quantify the retrieved document into numerical output, and we sort the document using the output.

The image below shows you the basic diagram of how Reranking works.

The Reranker could be anything, but we will focus on two implementations in this tutorial which is:

LLM Reranker
Cross-Encoder Reranker

For LLM Reranker, the process is similar to the LLM-as-a-Judge we discussed in the previous article. We transform the LLM into an evaluator model using Evaluator Prompt, and the model produces output such as a score for each document. The prompt can be designed to evaluate how relevant the retrieved documents are to the query.

In contrast, Cross-Encoder Reranker is a neural model architecture designed for tasks requiring understanding the relationship between two texts. In the RAG system, we often use Bi-Encoders. They are preferred in RAG for Retrieval because they balance accuracy and computational efficiency well.

The overall structural differences can be seen in the image below.

Cross-encoders encode the query and document together, making them highly effective for reranking tasks. The goal of Reranking is to refine the relevance of retrieved documents based on a query. However, cross-encoders tend to be slower compared to other architectures. This is why Retrieval is typically performed using bi-encoders, which are faster but may result in some information loss.

Despite their slower speed, cross-encoders avoid the information loss associated with bi-encoders. This makes cross-encoders the preferred choice for reranking tasks, where accuracy and relevance are critical.

That’s the basic theory, so let’s jump into the technical parts.

Thanks for reading Non-Brand Data! This post is public so feel free to share it.

RAG Improvement with Reranking

We would not establish the RAG System from the beginning, as we already did in the previous article. If you haven’t read about them, I suggest you do so.

Non-Brand Data

Simple RAG Implementation With Contextual Semantic Search

Hi everyone! Cornellius here, back with another Lite series. This time, we’ll explore the advanced techniques and production methods of Retrieval-Augmented Generation (RAG)—tools that will be helpful for your use cases. I will make it into a long series, so stay tuned…

5 months ago · 9 likes · 2 comments · Cornellius Yudha Wijaya

We will continue in the part where we have retrieved the document chunks, which we will perform using the code below.

query = "What is the insurance for car?"
results = semantic_search(query, top_k = 20)

We will try to rerank the top 20 documents we retrieve with our RAG systems. First, let’s use LLM to evaluate the retrieved document's relevance to the query.

We will set up the prompt as we did previously in the LLM-as-a-Judge article, evaluate each retrieved document, and rerank the documents.

# Define evaluation criteria
chunk_validation_prompt_template = """
{task}
{evaluation_criteria}

Follow these steps to generate your evaluation:
{evaluation_steps}

Please respond using the following JSON schema:

Answer = {json_format}

You MUST provide values for 'Evaluation:' and 'Score' in your answer.

{question}
{context}

Answer: """

rating_json_format = """
{
    "Evaluation": "your rationale for the rating, as a text",
    "Score": "your rating, as a number between 1 and 5"
}
"""

question_template = """Now here are the question (delimited by triple backticks)
Question: ```{question}```
"""

context_template = """here are the context (delimited by triple quotes).
Context: \"\"\"{context}\"\"\"\n
"""

# Tasks, evaluation criteria, and steps
context_task = """You will be given a context and a question.
Your task is to evaluate the question based on the given context and provide a score between 1 and 5 according to the following criteria:"""

context_eval = """- Score 1: The context does not provide sufficient information to answer the question in any way.
- Score 2 or 3: The context provides some relevant information, but the question remains partially answerable, or is unclear/ambiguous.
- Score 4: The context offers sufficient information to answer the question, but some minor details are missing or unclear.
- Score 5: The context provides all necessary information to answer the question clearly and without ambiguity."""

context_steps = """- Read the context and question carefully.
- Analyse and evaluate the question based on the provided evaluation criteria.
- Provide a scaled score between 1 and 5 that reflect your evaluation."""

# Define the request and token limits
REQUEST_LIMIT = 15  # Maximum requests per minute
TOKEN_LIMIT = 1_000_000  # Maximum tokens per minute

# Initialize counters for rate limiting
start_time = time.time()
requests_made = 0
tokens_used = 0

# Define the function to enforce rate limits
def enforce_rate_limit(start_time, requests_made, tokens_used):
    elapsed_time = time.time() - start_time
    if requests_made >= REQUEST_LIMIT or tokens_used >= TOKEN_LIMIT:
        sleep_time = max(0, 60 - elapsed_time)  # Wait until 1 minute has passed
        time.sleep(sleep_time)
        return time.time(), 0, 0  # Reset counters
    return start_time, requests_made, tokens_used

# Function to generate response using LiteLLM with Gemini
def generate_response(query, context):
    """
    Generate a response using the Gemini model via LiteLLM.

    Args:
        query (str): The query string.
        context (str): The context string.

    Returns:
        str: The generated response from the Gemini model.
    """
    # Combine the query and context for the prompt
    prompt = f"Query: {query}\nContext: {context}\nAnswer:"

    # Call the Gemini model via LiteLLM
    response = completion(
        model="gemini/gemini-1.5-flash",  # Use the Gemini model
        messages=[{"content": prompt, "role": "user"}],
        api_key=GEMINI_API_KEY
    )

    # Extract and return the generated text
    return response['choices'][0]['message']['content']

# Define the function to evaluate relevance
def evaluate_context(question, context):
    global start_time, requests_made, tokens_used

    # Enforce rate limits
    start_time, requests_made, tokens_used = enforce_rate_limit(start_time, requests_made, tokens_used)

    # Prepare the prompt for relevance evaluation
    prompt = chunk_validation_prompt_template.format(
        task=context_task,
        evaluation_criteria=context_eval,
        evaluation_steps=context_steps,
        json_format=rating_json_format,
        question=question_template.format(question=question),
        context=context_template.format(context=context)
    )

    # Generate the response using Gemini
    response = generate_response(prompt, "")

    # Update the counters
    requests_made += 1
    tokens_used += len(prompt.split())  # Approximate token count

    # Parse the response
    try:
        # Extract JSON part from the response (if any)
        json_start = response.find("{")
        json_end = response.rfind("}") + 1
        json_response = response[json_start:json_end]
        evaluation = json.loads(json_response)
        return evaluation
    except json.JSONDecodeError:
        return {"Evaluation": "Invalid JSON response", "Score": 0}

# Function to rerank documents using LLM
def rerank_with_llm(query, documents):
    scores = []
    evaluations = []

    # Evaluate each document with a progress bar
    for doc in tqdm(documents, desc="Reranking documents with LLM"):
        evaluation = evaluate_context(query, doc)
        scores.append(evaluation["Score"])
        evaluations.append(evaluation["Evaluation"])

    # Sort documents based on scores
    reranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
    return reranked_docs, scores, evaluations

# Perform LLM reranking
llm_reranked_docs, llm_scores, llm_evaluations = rerank_with_llm(query, results['documents'][0])

# Display LLM reranked documents
print("\nLLM Reranked Documents:")
for i, (doc, score, evaluation) in enumerate(zip(llm_reranked_docs, llm_scores, llm_evaluations)):
    print(f"{i+1}. Score: {score}")
    print(f"Evaluation: {evaluation}")
    print(f"Document: {doc[:100]}...\n")

Then, we will use the Cross-Encoder model to evaluate and rerank our retrieved documents.

# Function to evaluate relevance using Cross-Encoder
def evaluate_context_cross_encoder(query, context):
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    score = cross_encoder.predict([(query, context)])[0]
    return score

# Function to rerank documents using Cross-Encoder
def rerank_with_cross_encoder(query, documents):
    scores = []
    for doc in tqdm(documents, desc="Reranking documents with Cross-Encoder"):
        score = evaluate_context_cross_encoder(query, doc)
        scores.append(score)

    # Sort documents based on scores
    reranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
    return reranked_docs, scores

# Perform Cross-Encoder reranking
cross_encoder_reranked_docs, cross_encoder_scores = rerank_with_cross_encoder(query, results['documents'][0])

# Display Cross-Encoder reranked documents
print("\nCross-Encoder Reranked Documents:")
for i, (doc, score) in enumerate(zip(cross_encoder_reranked_docs, cross_encoder_scores)):
    print(f"{i+1}. Score: {score:.2f}, Document: {doc[:100]}...")

Both techniques generate different scores that can be converted into rankings. In this case, we can transform these scores into a ranked list. If there is a tie between scores, the tied items will be assigned the same rank.

def scores_to_rankings(scores):
    """
    Convert scores to rankings (competition rank).
    The highest score gets rank 1, the second highest gets rank 2, etc.
    Ties (same scores) receive the same rank, and the next distinct score
    jumps in rank accordingly (e.g., 1, 2, 2, 4).
    """
    score_index_pairs = [(score, i) for i, score in enumerate(scores)]
    # Sort by descending score
    sorted_pairs = sorted(score_index_pairs, key=lambda x: -x[0])

    rankings = [0] * len(scores)
    prev_score = None
    # 'idx' is the position in the sorted list, used for "competition" style ranking
    for idx, (score, original_idx) in enumerate(sorted_pairs):
        if score != prev_score:
            # rank = index in sorted list + 1
            current_rank = idx + 1
        # Assign to original index position
        rankings[original_idx] = current_rank
        prev_score = score

    return rankings

As an additional example, we will try to combine both rankings into one. We will call it Parallel Ranking as we use the sum of both rankings as a new rank for the reranker.

def parallel_reranking_with_existing_scores(llm_scores, cross_encoder_scores, documents):
    """
    Perform parallel reranking using existing LLM and Cross-Encoder scores.
    We convert each set of scores into rankings, then sum those ranks to form
    a combined rank. Ties in the combined rank receive the same "competition" rank.
    """
    # Convert scores to rankings
    llm_rankings = scores_to_rankings(llm_scores)
    cross_encoder_rankings = scores_to_rankings(cross_encoder_scores)

    # Sum LLM and Cross-Encoder ranks to get combined rank
    combined_rankings = [llm_rank + ce_rank
                         for llm_rank, ce_rank in zip(llm_rankings, cross_encoder_rankings)]

    # Sort documents by ascending combined rank
    sorted_data = sorted(zip(combined_rankings, documents), key=lambda x: x[0])
    reranked_docs = [doc for _, doc in sorted_data]

    # Assign final ranks with tie handling
    final_rankings = []
    prev_combined = None
    for idx, (comb_rank, _) in enumerate(sorted_data):
        if comb_rank != prev_combined:
            current_rank = idx + 1
        final_rankings.append(current_rank)
        prev_combined = comb_rank

    return reranked_docs, final_rankings

# Perform parallel reranking using existing scores
parallel_reranked_docs, combined_rankings = parallel_reranking_with_existing_scores(
    llm_scores, cross_encoder_scores, results['documents'][0]
)

print("\nParallel Reranked Documents:")
for i, (doc, rank) in enumerate(zip(parallel_reranked_docs, combined_rankings)):
    print(f"{i+1}. Combined Rank: {rank}, Document: {doc[:100]}...")

Lastly, we will create a DataFrame object containing all the reranking processes we have done.

def create_rankings_df(
    initial_docs,
    llm_scores,
    llm_reranked_docs,
    llm_rankings,  # competition-style ranks from scores_to_rankings
    cross_encoder_scores,
    cross_encoder_reranked_docs,
    cross_encoder_rankings,  # competition-style ranks from scores_to_rankings
    parallel_reranked_docs,
    parallel_rankings,       # final parallel competition ranks
):
    """
    Create a DataFrame to compare:
      - Initial documents & their indices
      - LLM scores & competition ranks
      - Cross-Encoder scores & competition ranks
      - Parallel combined rank
    All in the *original document order* (i.e., initial_docs).
    """
    # Create a mapping from document to its parallel rank
    doc_to_parallel_rank = {doc: rank for doc, rank in zip(parallel_reranked_docs, parallel_rankings)}

    # Create the DataFrame in the initial document order
    df = pd.DataFrame({
        "Initial Rank": range(1, len(initial_docs) + 1),
        "Initial Document": initial_docs,
    })

    # LLM columns (score + competition rank)
    df["LLM Score"] = llm_scores
    df["LLM Competition Rank"] = llm_rankings

    # Cross-Encoder columns (score + competition rank)
    df["Cross-Encoder Score"] = cross_encoder_scores
    df["Cross-Encoder Competition Rank"] = cross_encoder_rankings

    # Map parallel ranks back to the initial document order
    df["Parallel Rank"] = df["Initial Document"].map(doc_to_parallel_rank)

    # Optional: Positional ranks (if needed)
    df["LLM Positional Rank"] = df["Initial Document"].apply(
        lambda doc: llm_reranked_docs.index(doc) + 1 if doc in llm_reranked_docs else None
    )
    df["Cross-Encoder Positional Rank"] = df["Initial Document"].apply(
        lambda doc: cross_encoder_reranked_docs.index(doc) + 1 if doc in cross_encoder_reranked_docs else None
    )

    return df

llm_rankings = scores_to_rankings(llm_scores)
cross_encoder_rankings = scores_to_rankings(cross_encoder_scores)
llm_sorted_data = sorted(zip(llm_scores, results['documents'][0]),
                         key=lambda x: x[0], reverse=True)
llm_reranked_docs = [doc for score, doc in llm_sorted_data]
ce_sorted_data = sorted(zip(cross_encoder_scores, results['documents'][0]),
                        key=lambda x: x[0], reverse=True)
cross_encoder_reranked_docs = [doc for score, doc in ce_sorted_data]

parallel_reranked_docs, combined_rankings = parallel_reranking_with_existing_scores(
    llm_scores, cross_encoder_scores, results['documents'][0]
)

rankings_df = create_rankings_df(
    initial_docs=results['documents'][0],
    llm_scores=llm_scores,
    llm_reranked_docs=llm_reranked_docs,
    llm_rankings=llm_rankings,
    cross_encoder_scores=cross_encoder_scores,
    cross_encoder_reranked_docs=cross_encoder_reranked_docs,
    cross_encoder_rankings=cross_encoder_rankings,
    parallel_reranked_docs=parallel_reranked_docs,
    parallel_rankings=combined_rankings
)

You can see that we end up with various rankings for reranking to improve the generation result. I put up all the ranks for you to compare.

Notice that the LLM model evaluation rank differs from the Cross-Encoder information. We can utilize interesting information as a stand-alone evaluation or combine them into the parallel ranking.

To know which method is the best, we can only experiment with and evaluate the results. We can use LLM-as-a-Judge again, but human evaluation could also be employed here.

That’s all for now! Next time, we will discuss further techniques to improve RAG results!

Is there anything else you’d like to discuss? Let’s dive into it together!

👇👇👇

Join Cornellius Yudha Wijaya’s subscriber chat

Available in the Substack app and on web