Evaluating RAG with LLM-as-a-Judge: A Guide to Production Monitoring

NBD Lite #46 - Bringing AI Judgments to RAG Monitoring in Live Environments

Jan 22, 2025

When I started writing about the Simple RAG (Retrieval-Augmented Generation) Implementation tutorial, I realized it’s crucial to first address how to evaluate the results.

While exploring advanced techniques to improve RAG performance is tempting, evaluation is essential to determining whether our RAG system solves the intended problems. Without proper evaluation, any improvements might be based on assumptions rather than actual outcomes.

Anyway, if you are curious about how to develop a simple RAG, you can visit it below.

Non-Brand Data

Simple RAG Implementation With Contextual Semantic Search

Hi everyone! Cornellius here, back with another Lite series. This time, we’ll explore the advanced techniques and production methods of Retrieval-Augmented Generation (RAG)—tools that will be helpful for your use cases. I will make it into a long series, so stay tuned…

6 months ago · 7 likes · 2 comments · Cornellius Yudha Wijaya

Returning to the evaluation phase, several ways exist to assess the output of a RAG (Retrieval-Augmented Generation) system. However, the LLM-as-a-Judge method is the most practical approach for projects constrained by time and resources.

This tutorial will explore the fundamentals of evaluating an RAG system, focusing on the LLM-as-a-Judge method and how to implement it in a production environment.

The diagram below provides an overview of what we’ll build today. Don’t forget, all the code is stored in the RAG-To-Know repository.

Evaluating RAG with LLM-as-a-Judge: A Guide to Production Monitoring

Without further ado, let’s get into it!

Sponsor Section

Data Science Roadmap by aigents

Feeling lost in the data science jungle? 🌴 Don’t worry—we’ve got you covered!

Check out this AI-powered Data Science Roadmap 🗺️—your ultimate step-by-step guide to mastering data science! From stats 📊 to machine learning 🤖, and Python tools 🐍 to Tableau dashboards 📈, it’s all here.

✨ Why it’s awesome:

AI-powered explanations & Q&A 🤓
Free learning resources 🆓
Perfect for beginners & skill-builders 🚀

👉 Start your journey here: Data Science Roadmap

Need extra help? Try the AI-tutor for personalized guidance: AI-tutor

Let’s make data science simple and fun! 🎉

Introduction to RAG Evaluation

RAG systems are transforming how LLMs generate responses by integrating the retrieval of relevant, real-time data into the generative process.

In our previous article, we have discussed that the core of every RAG system is two key components:

Retriever: This component uses a similarity search to identify the most relevant information from a vector database.
Generation: Once the retriever gathers the relevant documents, the LLM synthesizes them with the user query to create and generate the response.

For an RAG system to perform effectively, the retriever and the generator must function seamlessly.

Seamlessly means that the retriever must consistently deliver accurate and relevant data, while the generator must transform this input into factually accurate and helpful responses.

That’s why the RAG systems from both components must be evaluated to ensure reliability.

These evaluations are critical to optimizing RAG systems for production environments where high performance and reliability are essential.

To comprehensively evaluate RAG systems, the TRIAD framework introduced by Trulens offers a structured approach focusing on three major components:

Context Relevance: This ensures the retrieved context aligns with the user's query. Traditionally, we assess this retrieval process using metrics such as precision, recall, Mean Reciprocal Rank (MRR), and Mean Average Precision (MAP).
Faithfulness (Groundedness): Assesses the generated response's factual accuracy by verifying its grounding in the retrieved documents. Techniques include human evaluation, automated fact-checking, and consistency checks.
Answer Relevance: Measures how well the response addresses the user’s query, often using metrics like BLEU, ROUGE, METEOR, and embedding-based evaluations.

The cycle can be summarize in the graph below.

Leveraging LLM-as-a-Judge for RAG Evaluation

As you can see from the framework above, a significant amount of data collection and ground truth is required to validate the evaluation results. However, with companies' fast growth and limited resources, this collection is challenging.

LLM-as-a-Judge has gained popularity for the reasons mentioned above. It offers a faster and more cost-effective alternative to human evaluation.

This approach works by having the LLM assess generated outputs based on predefined guidelines, much like a human evaluator, while utilizing an evaluation prompt to guide the process.

The guideline can be anything you want to measure, such as politeness, bias, sentiment, or hallucination. You can imagine the LLM as a person who judges the input and asks questions such as “Is it Bias?” that person will answer Yes or No.

Unlike traditional evaluation metrics, the output can also be defined as LLM-as-a-Judge, which is not a deterministic measure. It acts as a use-case-specific proxy metric, relying on our definition of evaluation criteria.

But, Why Does LLM-as-a-Judge Work?

The basis is that LLM is having an easier time assessing the text output than generating it. In the real world, it’s easier to critique than generate content as it is inherently less complex.

The approach can either use a different LLM or the same LLM with different prompts that activate the model’s classification capabilities. The separation of roles will allow the LLM-as-a-Judge or the evaluator to detect any error from the original system.

The LLM evaluators can be applied in several ways, including:

Pairwise Comparison: Present two responses to the LLM and ask it to choose the better one. This method is ideal for offline model comparison.
Reference-Free Evaluation: Ask the LLM to assess responses based on predefined criteria such as tone, bias, or correctness.
Reference-Based Evaluation: Provide a reference document or context and ask the LLM to judge the response against it.

These evaluation strategies can be applied offline (e.g., during development) and online (e.g., for continuous production monitoring).

When implemented effectively, LLM-as-a-Judge will be able to evaluate our system better, even against the human evaluator.

While not without challenges—such as prompt design and task complexity—this approach provides a robust framework for evaluating and improving the quality of LLM-powered products.

How does LLM-as-a-Judge work in the RAG Evaluation?

In the RAG Triad framework case, we can instruct the LLM to assess the three components we measured previously. By using the LLM-as-a-Judge, we can effectively evaluate our RAG system.

For example, we can evaluate the context relevance by asking if the retrieved document is relevant to the query we pass.

The output is not necessarily required to be a yes or no answer. It can be a score or the reasoning itself, depending on our requirements.

LLM evaluators are good to use, but they need a careful design to work on our scenario. We also need to take time to construct the judges and evaluate them. Just like evaluating traditional machine learning models, we need to iterate and evaluate.

We will discuss creating a good LLM-as-a-Judge at other times, but now I want us to focus more on building the evaluator and applying it to our project.

Building LLM-as-a-Judge for RAG Evaluation

Let’s start with the exciting part, which is building the LLM Evaluator system that you can use continuously in the future.

For preparation, there are a few things that we will do before moving on to the evaluation.

Set up MLFlow Dashboard Monitoring

As we want to store the evaluation experiment somewhere, we need to set them initially. In this case, we will rely on three separate tools that work intertwine:

MLFlow: Open-Source MLOps Platform
Docker Desktop: Containerization Software
PostgreSQL Database via Neon: Data Storage

What we will do here is host the MLFlow dashboard for experiment tracking in our Docker container with the backend storage PostgreSQL for persistence.

Download and install Docker Desktop by following the recommended steps. For data storage, sign up for a free storage account in Neon and copy the connection string. It should look something like the example below.

postgresql://cornelliusdb_owner:*******@ep-sparkling-bush-a1n2vkp9.ap-southeast-1.aws.neon.tech/cornelliusdb?sslmode=require

Once you have the connection string, we will create a Dockerfile to set up the MLFlow dashboard. The Dockerfile and the requirements.txt file is already stored in the Evaluation folder so you can clone them to your folder.

In the Dockerfile, replace the ENV POSTGRES_URI=YOUR_URI with your connection string like above and we are ready to go.

Build the Docker Image and set up the Container with the following command.

docker build -t mlflow-neon .       
docker run -d -p 5000:5000 --name mlflow-server mlflow-neon

If they are run correctly, you can visit the http://localhost:5000/ and you will able to access the MLFlow Dashboard.

We are now ready to store the evaluation result in our MLFlow Dashboard. Let’s move on to the next LLM-as-a-Judge evaluation section, which is to generate a Test Dataset.

Generate Evaluation Dataset

Before we evaluate our RAG System, we need a test dataset that we can use for the evaluation material.

In this section, we will extract the PDF file we used in the previous article, split it up into chunks, and generate a Questions-Answer pair using LLM.

For this tutorial, we will only use the Question’s output results as we only want to evaluate the context relevance. Feel free if you want to use the Answer part for further analysis.

Let’s set up the library used using the following code.

pip install transformers docling google-generativeai

Also, prepare the Gemini API Key as we will rely on the Gemini model as our LLM. You can read the previous article on how to acquire the key as well.

import os

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

Next, I will use docling for data extraction. This time, I am using docling as the library designed especially for generative AI data extraction use cases.

from docling.document_converter import DocumentConverter

source = "dataset/Insurance_Handbook_20103.pdf"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)

Then, we will split up the documents into chunks where I will only take text that longer than 500 characters. We are splitting them with HierarchicalChunker to gain more semantically meaningful chunks as well.

from langchain.schema import Document
from docling_core.transforms.chunker import HierarchicalChunker

chunks = list(HierarchicalChunker().chunk(result.document))

chunks = [chunk for chunk in chunks if len(chunk.text) > 500]

LC_docs = [
    Document(page_content=chunk.text, metadata={"source": "Insurance_Handbook_20103"})
    for chunk in chunks
]

After that, we will set up the Gemini model which acts as our LLM.

import google.generativeai as genai

# Configure the generative AI model
genai.configure(api_key=GEMINI_API_KEY )

# Instantiate the generative AI regular model
model = genai.GenerativeModel('gemini-1.5-flash')

# Instantiate the generative AI QA model with the response mime type set to JSON
QA_model = genai.GenerativeModel('gemini-1.5-flash',
                                 generation_config = {"response_mime_type":"application/json"})

def call_model(model: genai, prompt:str):
    response = model.generate_content(prompt)
    return response.text

With the model ready, the next step is to set up a prompt that can effectively generate QA pairs from the chunks. An example prompt for this purpose is shown in the code below.

from langchain.prompts import PromptTemplate

Rich_context_prompt = """
You are tasked with evaluating if a given context contains sufficient rich context to generate a fact-based question (factoid) and its answer.

The evaluation should satisfy the rules below:
{rules}

Follow these steps to evaluate the context:
{guidelines}

Here are some examples (delimited by triple backticks):
```
{examples}
```
Now here is the context (delimited by triple quotes):

Context: \"\"\"{context}\"\"\" \n
Please use the JSON schema below for your output:

If the context contains sufficient rich context, generate the output in the following JSON format:
Output = {{
  "question": "insert the generated question here",
  "answer": "insert the corresponding answer here"
}}

If the context lacks sufficient rich context, generate this JSON format:
Output = {{
  "reasoning": "Explain why the context lacks sufficient richness for a question and answer pair.",
  "evaluation": "No"
}}

Return the output in the required JSON format only.
"""

rules = """- The context must present a clear subject or main idea.
- The context must include specific details, facts, or examples.
- The context must contain claims, arguments, or explanations that could be questioned.
- The context must have sufficient depth or complexity to allow meaningful questions to be generated."""

guidelines = """1. Read the context thoroughly to understand its depth and scope.
2. Identify whether the context includes specific details or claims.
3. Assess if a meaningful question can be generated from the information provided.
4. Conclude if the context has "enough rich context" or "lacks sufficient context."
5. Summarize the answer in less than 300 characters."""

examples = """
# Example 1:
## Context: The Earth revolves around the Sun in an elliptical orbit, completing one revolution approximately every 365.25 days.
## Output: {
##   "question": "What is the shape of the Earth's orbit around the Sun, and how long does one revolution take?",
##   "answer": "The Earth's orbit is elliptical, and it takes approximately 365.25 days to complete one revolution."
## }

# Example 2:
## Context: Apples are a type of fruit.
## Output: {
##   "reasoning": "The context is too general and lacks specific details or claims to generate a meaningful question.",
##   "evaluation": "No"
## }
"""

# Create the PromptTemplate
rich_context_template = PromptTemplate(
    template=Rich_context_prompt,
    input_variables=["rules", "guidelines", "examples", "context"]
)

If you see my prompt above, we set up the prompt in a detailed way like setting up the rules, guidelines, and examples. In this way, the LLM can produce the QA pair as we intended.

With the prompt ready, we can proceed to generate the QA Pair dataset. Since we’re using the free tier of Gemini LLM, I’ve also included code to enforce a rate limit to ensure the generation process runs smoothly without interruptions.

import time
from tqdm import tqdm
import pandas as pd

# Define Gemini API limits
REQUEST_LIMIT = 15  # Max requests per minute
TOKEN_LIMIT = 1_000_000  # Max tokens per minute

# Function to enforce rate limits
def enforce_rate_limit(start_time, requests_made, tokens_used):
    elapsed_time = time.time() - start_time
    if requests_made >= REQUEST_LIMIT or tokens_used >= TOKEN_LIMIT:
        sleep_time = max(0, 60 - elapsed_time)  # Wait until 1 minute has passed
        time.sleep(sleep_time)
        return time.time(), 0, 0  # Reset counters
    return start_time, requests_made, tokens_used

# Initialize counters
start_time = time.time()
requests_made = 0
tokens_used = 0

N_GENERATIONS = 100  # Number of QA pairs to generate
outputs = []

print(f"Generating {N_GENERATIONS} QA couples...")

for sampled_context in tqdm(random.sample(LC_docs, N_GENERATIONS)):
    # Generate QA pair using the updated prompt
    prompt = rich_context_template.format(
        rules=rules,
        guidelines=guidelines,
        examples=examples,
        context=sampled_context.page_content,
        format=json_format,
    )

    # Enforce rate limit before making the request
    start_time, requests_made, tokens_used = enforce_rate_limit(start_time, requests_made, tokens_used)

    try:
        # Call the model
        output_QA_couple = call_model(QA_model, prompt)

        # Simulate token usage (adjust based on model specifics)
        token_count = len(prompt.split()) + 200  # Estimate ~200 tokens for the response
        tokens_used += token_count

        # Increment request counter
        requests_made += 1

        # Convert the model's output to a dictionary
        output_QA_couple = eval(output_QA_couple)

        # Check for 'question' and 'Answer' if evaluation is positive
        if "question" in output_QA_couple and "Answer" in output_QA_couple:
            question = output_QA_couple['question']
            answer = output_QA_couple['Answer']

            # Ensure the answer is not too long
            assert len(answer) < 400, "Answer is too long"

            # Append the QA pair with context and metadata
            outputs.append({
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata.get("source", "unknown"),
            })
        else:
            # Log reasoning if evaluation is "No"
            print(f"Skipped due to insufficient context: {output_QA_couple.get('reasoning', 'No reasoning provided')}")
    except Exception as e:
        print(f"Skipped context due to error: {e}")

# Convert outputs to a Pandas DataFrame
qa_dataframe = pd.DataFrame(outputs)

After the generation process, we certainly want to know if the QA pair generated is relevant or not with the chunk we pass. Of course, the most accurate way to do this is by using human evaluation to go through each question.

However, we will use LLM-as-a-Judge (or an LLM evaluator) for this task. While it might seem counterintuitive to use the same LLM to evaluate the generated QA pairs, as explained earlier, separating roles activates different model capabilities. This can be effectively utilized here for evaluation purposes.

Let’s set up the evaluation prompt and the metrics criteria. In this example, we are using the following metrics:

Groundedness: How well the question is grounded in the provided context.
Relevance: How relevant the question is to the domain (Insurance in this case).
Standalone: How self-contained and understandable the question is without additional context.

from tqdm import tqdm
import pandas as pd

# Define the domain/topic
topic = "Insurance"

# Define evaluation criteria
question_validation_prompt_template = """
{task}
{evaluation_criteria}

Follow these steps to generate your evaluation:
{evaluation_steps}

Please respond using the following JSON schema:

Answer = {json_format}

You MUST provide values for 'Evaluation:' and 'Score' in your answer.

{question}
{context}

Answer: """

rating_json_format = """
{
    "Evaluation": "your rationale for the rating, as a text",
    "Score": "your rating, as a number between 1 and 5"
}
"""

question_template = """Now here is the question (delimited by triple backticks)
Question: ```{question}```
"""

context_template = """ here are the context (delimited by triple quotes).
Context: \"\"\"{context}\"\"\"\n
"""

# Tasks, evaluation criteria, and steps
groundedness_task = """You will be given a context and a question.
Your task is to evaluate the question based on the given context and provide a score between 1 and 5 according to the following criteria:"""

groundedness_eval = """- Score 1: The context does not provide sufficient information to answer the question in any way.
- Score 2 or 3: The context provides some relevant information, but the question remains partially answerable or is unclear/ambiguous.
- Score 4: The context offers sufficient information to answer the question, but some minor details are missing or unclear.
- Score 5: The context provides all necessary information to answer the question clearly and without ambiguity."""

groundedness_steps = """- Read the context and question carefully.
- Analyse and evaluate the question based on the provided evaluation criteria.
- Provide a scaled score between 1 and 5 that reflects your evaluation."""

relevance_task = """You will be provided with a question that may or may not relate to the {domain} domain.
Your task is to evaluate its usefulness to users seeking information in the {domain} domain and assign a score between 1 and 5 based on the following criteria:"""

relevance_eval = """- Score 1: The question is unrelated to the {domain} domain.
- Score 2 or 3: The question touches on {domain} but leans more towards another domain and is not particularly useful or relevant for {domain}-specific needs.
- Score 4: The question is related to the {domain} domain but lacks direct usefulness or relevance for users looking for valuable information in this domain.
- Score 5: The question is clearly related to the {domain} domain, makes sense, and is likely to be useful to users seeking information within this domain."""

relevance_steps = """- Read the question carefully.
- Analyse and evaluate the question based on the provided evaluation criteria.
- Provide a scaled score between 1 and 5 that reflects your evaluation."""

standalone_task = """You will be given a question.
Your task is to evaluate how context-independent this question is. You need to assess how self-contained and understandable a question is without relying on external context.
The score reflects whether the question makes sense on its own. Questions referring to a specific, unstated context, such as "in the document" or "in the context," should receive a lower score.
Technical terms or acronyms related to {domain} can still qualify for a high score if they are clear to someone with standard domain knowledge and documentation access.
Please provide a score between 1 and 5 based on the following criteria:"""

standalone_eval = """- Score 1: The question is highly dependent on external context and cannot be understood without additional information.
- Score 2: The question clarifies but requires significant additional context to make sense.
- Score 3: The question can mostly be understood but may depend slightly on an external context for complete clarity.
- Score 4: The question is nearly self-contained, with only minor reliance on external context.
- Score 5: The question is entirely self-contained and makes complete sense on its own, without any reliance on external context."""

standalone_steps = """- Read the question carefully.
- Analyse and evaluate the question based on the provided evaluation criteria.
- Provide a scaled score between 1 and 5 that reflects your evaluation."""

Let’s execute the judging process with the following code. We will only focus on evaluating the Questions part, so please tweak the code if you want to see the result for the Answers part as well.

import json 

# Define the request and token limits
REQUEST_LIMIT = 15  # Maximum requests per minute
TOKEN_LIMIT = 1_000_000  # Maximum tokens per minute

# Initialize counters
start_time = time.time()
requests_made = 0
tokens_used = 0

# Ensure outputs is a list of dictionaries
if isinstance(outputs, pd.DataFrame):
    outputs = outputs.to_dict("records")  # Convert DataFrame to list of dictionaries

# Processing loop with rate limiting
for output in tqdm(outputs):
    evaluations = {}
    for criterion, prompt in [
        ("groundedness", question_validation_prompt_template.format(
            task=groundedness_task,
            evaluation_criteria=groundedness_eval,
            evaluation_steps=groundedness_steps,
            json_format=rating_json_format,
            question=question_template.format(question=output["question"]),
            context=context_template.format(context=output["context"]),
        )),
        ("relevance", question_validation_prompt_template.format(
            task=relevance_task.format(domain=topic),
            evaluation_criteria=relevance_eval.format(domain=topic),
            evaluation_steps=relevance_steps,
            json_format=rating_json_format,
            question=question_template.format(question=output["question"]),
            context=context_template.format(context=output["context"]),
        )),
        ("standalone", question_validation_prompt_template.format(
            task=standalone_task.format(domain=topic),
            evaluation_criteria=standalone_eval.format(domain=topic),
            evaluation_steps=standalone_steps,
            json_format=rating_json_format,
            question=question_template.format(question=output["question"]),
            context="",  # No context needed for standalone evaluation
        )),
    ]:
        # Enforce rate limit before each request
        start_time, requests_made, tokens_used = enforce_rate_limit(start_time, requests_made, tokens_used)

        try:
            # Call the model
            evaluation = call_model(QA_model, prompt)

            # Simulate token usage (adjust based on model and prompt specifics)
            token_count = len(prompt.split()) + 200  # Assume ~200 tokens in response
            tokens_used += token_count

            # Increment request counter
            requests_made += 1

            # Parse the response and store evaluations
            evaluation = json.loads(evaluation)  # Safely parse JSON response
            evaluations[criterion] = {
                "score": int(evaluation["Score"]),
                "eval": evaluation["Evaluation"]
            }

        except Exception as e:
            print(f"Error processing {criterion} evaluation: {e}")
            evaluations[criterion] = {"score": None, "eval": str(e)}

    # Update the output with evaluations
    output.update({
        "groundedness_score": evaluations["groundedness"]["score"],
        "groundedness_eval": evaluations["groundedness"]["eval"],
        "relevance_score": evaluations["relevance"]["score"],
        "relevance_eval": evaluations["relevance"]["eval"],
        "standalone_score": evaluations["standalone"]["score"],
        "standalone_eval": evaluations["standalone"]["eval"],
    })

# Convert outputs to a DataFrame
qa_evaluation_df = pd.DataFrame(outputs)
qa_evaluation_df.to_csv('evaluation_results_question_insurance.csv', index = False)

The result is shown in the table above. Looks pretty good as the LLM provides a score and explanation for each score. Let’s log this result into our MLFlow Dashboard using the code below.

import mlflow
# Log results to MLflow

mlflow.set_tracking_uri("http://localhost:5000")

#Only set up once
EXPERIMENT_NAME = "Question-Evaluation-Insurance"
mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model_name", "gemini-1.5-flash")
    mlflow.log_param("topic", topic)

    # Log aggregated metrics
    groundedness_scores = qa_evaluation_df["groundedness_score"].dropna()
    relevance_scores = qa_evaluation_df["relevance_score"].dropna()
    standalone_scores = qa_evaluation_df["standalone_score"].dropna()

    mlflow.log_metric("average_groundedness_score", groundedness_scores.mean())
    mlflow.log_metric("average_relevance_score", relevance_scores.mean())
    mlflow.log_metric("average_standalone_score", standalone_scores.mean())


    print("Results logged to MLflow.")

In the code above, we only store the model and topic parameters with average metrics scores. You can always add additional metadata as you need to the dashboard.

The Question looks relevant and ready to use as a test dataset. Then, let’s move on to evaluate our RAG Context Relevance with our dataset.

Thanks for reading Non-Brand Data! This post is public so feel free to share it.

RAG Context Relevance Evaluation with LLM-as-a-Judge

We will continue working with the RAG System we set up in the previous article. I’ve also updated the notebook to reflect today’s evaluation, so no need to worry!

Using the test dataset we have saved previously, we will load them in the Simple RAG Implementation notebook.

import pandas as pd

df = pd.read_csv('evaluation_results_question_insurance.csv')

Next, we will set up the Context Relevance Evaluation prompt for the LLM evaluator to follow. The prompt is kinda similar to the groundedness we set up previously with a little tweak.

from tqdm import tqdm

# Define evaluation criteria
chunk_validation_prompt_template = """
{task}
{evaluation_criteria}

Follow these steps to generate your evaluation:
{evaluation_steps}

Please respond using the following JSON schema:

Answer = {json_format}

You MUST provide values for 'Evaluation:' and 'Score' in your answer.

{question}
{context}

Answer: """

rating_json_format = """
{
    "Evaluation": "your rationale for the rating, as a text",
    "Score": "your rating, as a number between 1 and 5"
}
"""

question_template = """Now here are the question (delimited by triple backticks)
Question: ```{question}```
"""

context_template = """here are the context (delimited by triple quotes).
Context: \"\"\"{context}\"\"\"\n
"""

# Tasks, evaluation criteria, and steps
context_evaluation_task = """You will be given a context and a question.
Your task is to evaluate the question based on the given context and provide a score between 1 and 5 according to the following criteria:"""

context_evaluation_task = """- Score 1: The context does not provide sufficient information to answer the question in any way.
- Score 2 or 3: The context provides some relevant information, but the question remains partially answerable, or is unclear/ambiguous.
- Score 4: The context offers sufficient information to answer the question, but some minor details are missing or unclear.
- Score 5: The context provides all necessary information to answer the question clearly and without ambiguity."""

context_evaluation_task = """- Read the context and question carefully.
- Analyse and evaluate the question based on the provided evaluation criteria.
- Provide a scaled score between 1 and 5 that reflect your evaluation."""

Similar to the previous section, we will set up the evaluation looping process with the following code.

import time
import json

# Define the request and token limits
REQUEST_LIMIT = 15  # Maximum requests per minute
TOKEN_LIMIT = 1_000_000  # Maximum tokens per minute

# Initialize counters for rate limiting
start_time = time.time()
requests_made = 0
tokens_used = 0

# Define the function to enforce rate limits
def enforce_rate_limit(start_time, requests_made, tokens_used):
    elapsed_time = time.time() - start_time
    if requests_made >= REQUEST_LIMIT or tokens_used >= TOKEN_LIMIT:
        sleep_time = max(0, 60 - elapsed_time)  # Wait until 1 minute has passed
        time.sleep(sleep_time)
        return time.time(), 0, 0  # Reset counters
    return start_time, requests_made, tokens_used

# Define the function to evaluate groundness
def evaluate_groundness(question, context):
    global start_time, requests_made, tokens_used
    
    # Enforce rate limits
    start_time, requests_made, tokens_used = enforce_rate_limit(start_time, requests_made, tokens_used)
    
    # Prepare the prompt for groundness evaluation
    prompt = f"""
    {groundedness_task}

    {groundedness_eval}

    {groundedness_steps}

    Please respond using the following JSON schema:

    {rating_json_format}

    Now here is the question (delimited by triple backticks):
    Question: ```{question}```

    Here is the context (delimited by triple quotes):
    Context: \"\"\"{context}\"\"\"
    """
    
    # Generate the response using Gemini
    response = generate_response(prompt, "")
    
    # Update the counters
    requests_made += 1
    tokens_used += len(prompt.split())  # Approximate token count
    
    # Parse the response
    try:
        # Extract JSON part from the response (if any)
        json_start = response.find("{")
        json_end = response.rfind("}") + 1
        json_response = response[json_start:json_end]
        evaluation = json.loads(json_response)
        return evaluation
    except json.JSONDecodeError:
        return {"Evaluation": "Invalid JSON response", "Score": 0}

results = []
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    question = row['question']
    
    # Perform semantic search
    search_results = semantic_search(question, top_k=2)
    
    # Evaluate each chunk
    for i, context in enumerate(search_results['documents'][0]):
        evaluation = evaluate_groundness(question, context)
        results.append({
            "question": question,
            "context": context,
            "evaluation": evaluation
        })
    
    # Enforce rate limits after each question
    start_time, requests_made, tokens_used = enforce_rate_limit(start_time, requests_made, tokens_used)

With that, you already evaluate with LLM-as-a-Judge to our RAG retrieval process. Let’s save the file and log the result into the MLFlow dashboard.

# Convert the results list to a DataFrame
results_df = pd.DataFrame(results)

# Flatten the 'evaluation' column into separate columns
results_df = pd.concat(
    [results_df.drop(columns=["evaluation"]), results_df["evaluation"].apply(pd.Series)],
    axis=1
)

results_df.rename(columns={"Evaluation": "Evaluation_Rationale", "Score": "Evaluation_Score"}, inplace=True)


# Optionally, save the DataFrame to a CSV file
results_df.to_csv("RAG_evaluation_results.csv", index=False)

You can use another experiment name for the MLFlow monitoring as we already evaluate different things.

import mlflow

# Set the tracking URI and experiment name
mlflow.set_tracking_uri("http://localhost:5000")
EXPERIMENT_NAME = "RAG-Question-Evaluation-Insurance"
mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model_name", "gemini-1.5-flash")

    groundedness_scores = rag_result['Evaluation_Score'].dropna()

    mlflow.log_metric("average_groundedness_score", groundedness_scores.mean())

    print("Results logged to MLflow.")

I have used the groundness_score name previously but you can change it into context_relevance or another name. I am accidentally using that metric name but we know what it means.

The score shows that the context relevance average is only around 2.3, which means many of the retrieval chunks are only slightly relevant.

This is something we will discuss in the next edition as we want to improve the retrieval result using advanced techniques. This time, I promise we will start with the Reranking!

Is there anything else you’d like to discuss? Let’s dive into it together!

👇👇👇