May 30, 2024

Case Study: Using Synthetic Data to Benchmark RAG Systems

Case study

In the last article, we explained the core concepts of using Synthetic Data to test LLM applications at a high level. In this article, we will walk through a case study of the end-to-end workflow.

There are two ways to generate synthetic data on our platform:

  1. Continuous-eval (our Open-source repository): to generate testing dataset for simple RAG evaluation (today’s article)
  2. Relari API: to generate custom dataset for any kind of LLM applications (including RAG) tailored to specific use cases with more controls

The goal of this case study is to illustrate how simple it is to systematically benchmark design choices and make informed decisions using synthetic datasets.

We are going to walk through 3 steps:

  • Step 0: Build 3 Simple RAG Systems
  • Step 1: Generate Synthetic Dataset
  • Step 2: Evaluate the RAG Systems with Synthetic Datasets

You can run all the code in this case study through this notebook.

Step 0: Build RAG Systems

Build a vector store with RAG knowledge corpus

We first define a simple RAG use case, which is to digest a long-form blog post. We use Langchain to load and chunk the content, and then a lightweight vector database Milvus Lite by importing the langchain-milvus package.

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_milvus import Milvus
from langchain_openai import OpenAIEmbeddings

# Load and split documents from a long article
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Load chunks into a vector database
vectorstore = Milvus(
    embedding_function=OpenAIEmbeddings(),
    connection_args={
        "uri": "milvus_vectorstore.db",
    },
    auto_id=True,
    drop_old=True,
)

vectorstore.add_documents(docs)

Build RAG systems with 3 different retrieval strategies

Now let’s build simple RAG systems with 3 different retrieval strategies that we will evaluate later with synthetic data.

  • Vectorstore-based RAG: Simple embedding-based retrieval
  • Keyword-search based RAG: Uses BM25 algorithm for keyword-based retrieval
  • Hybrid-and-reranker-based RAG: Merges vectorstore & keyword retriever results and reranks the chunks using a Reranker model (provided by VoyageAI and Milvus)

A simple answer_generator function then uses gpt-3.5-turbo to generate an answer.

from langchain_community.retrievers import BM25Retriever
from langchain_voyageai import VoyageAIRerank
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
# Note: 'rag_utils' is a locally defined module by Milvus with a variety of advanced RAG functions (check the notebook for more details.) 
from rag_utils.hybrid_and_rerank import RerankerRunnable

# 1. Vectorstore-based retriever
vectorstore_retriever = vectorstore.as_retriever()

# 2. Keyword-search-based retriever
keyword_retriever = BM25Retriever.from_documents(docs)

# 3. Hybrid (Keyword + Vectorstore) => Reranker retriever
reranker = RerankerRunnable(
    compressor=VoyageAIRerank(model="rerank-lite-1"),
    top_k=4,
)
hybrid_and_rerank_retriever = {
    "milvus_retrieved_doc": vectorstore_retriever,
    "bm25_retrieved_doc": keyword_retriever,
    "query": RunnablePassthrough(),
} | reranker

# RAG LLM answer generator
def answer_generator(query, retrieved_docs):
    llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0.5)
    system_prompt = (
        "You are a helpful assistant. Answer the question use the context provided."
    )
    user_prompt = f"Question: {query}\n\n"
    user_prompt += "Contexts:\n" + "\n".join(
        [doc.page_content for doc in retrieved_docs]
    )
    return llm.invoke(system_prompt + user_prompt).content

Step 1: Generate Synthetic Dataset

Below is refresher from our last article on what’s needed for the Generator:

Image by author from Generate Synthetic Data to Test LLM Application.

Now let’s generate a synthetic testing dataset specific to this Knowledge Corpus that the RAG systems are built for. We use the continuous-eval to generate a dataset for RAG (Application Logic) with 20 questions from the Milvus Vectorstore (Environment Data) created earlier.

Relari API supports more complex types of application logic, and the ability to ingest Seed Example Data and more Environment Data to create more realisitic and higher quality synthetic data.

Here’s how to use the SimpleDatasetGenerator class in continuous-eval:

import json
from continuous_eval.generators import SimpleDatasetGenerator
from continuous_eval.llm_factory import LLMFactory

generator = SimpleDatasetGenerator(
    vector_store_index = vectorstore,
    generator_llm = LLMFactory("gpt-4o"),
)

synthetic_dataset = generator.generate(
    embedding_vector_size = 1536,
    num_questions = 20,
)

In a few minutes, a small dataset will be created. Let’s take a look at an example data point.

{
    "question": "How does CoH address overfitting and missing information in its model?",
    "answer": "CoH addresses overfitting by adding a regularization term to maximize the log-likelihood of the pre-training dataset. To handle missing information in its model, the agent learns to call external APIs for extra information that is missing from the model weights, including current information, code execution capability, access to proprietary information sources, and more.",
    "contexts": [
        "To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens",
        "The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more."
    ],
    "metadata": [
        {
            "source": "https://lilianweng.github.io/posts/2023-06-23-agent/",
            "title": "LLM Powered Autonomous Agents | Lil'Log",
            "description": "Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent\u2019s brain, complemented by several key components:",
            "language": "en",
            "pk": 450063656769030980
        },
        {
            "source": "https://lilianweng.github.io/posts/2023-06-23-agent/",
            "title": "LLM Powered Autonomous Agents | Lil'Log",
            "description": "Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent\u2019s brain, complemented by several key components:",
            "language": "en",
            "pk": 450063656769030951
        }
    ],
    "question_type": "Multi Hop Fact Seeking",
    "uid": 19
}

Each Synthetic Datum contains the synthetic inputs and reference outputs that can be used for evaluation.

  • Question (synthetic input): synthetic question a user could ask based on the documents
  • Contexts (intermediate output): ground truth contexts that the RAG system is supposed to retrieve to answer the question
  • Answer (final output): ground truth answer based on the ground truth contexts

Other variables such as metadata and question_type can be used for further analysis.

Step 2: Evaluate the RAG systems with the Synthetic Dataset

Here’s is refresher from our last article on how to use the synthetic dataset for evaluation:

Image by author from Generate Synthetic Data to Test LLM Application.

First select metrics for each RAG module

Here we select the following metrics to evaluate the RAG systems:

Retriever:

LLM Answer Generator:

  • Correctness to evaluate the overall quality of the final answer w.r.t. the ground truth answers.
  • Relevance to evaluate if the answer is directly addressing the question.
  • Faithfulness to evaluate if the answer is grounded on the retrieved context.
from typing import Dict, List
from continuous_eval.eval import Dataset, Module, ModuleOutput, Pipeline
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
from continuous_eval.metrics.generation.text import LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedFaithfulness

# Use existing dataset
synthetic_dataset = Dataset("synthetic_dataset.jsonl")

Documents = List[Dict[str, str]]
DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])

retriever = Module(
    name="retriever",
    input=synthetic_dataset.question,
    output=Documents,
    eval=[
        PrecisionRecallF1().use(
            retrieved_context=DocumentsContent,
            ground_truth_context=synthetic_dataset.contexts,
        ),
        RankedRetrievalMetrics().use(
            retrieved_context=DocumentsContent,
            ground_truth_context=synthetic_dataset.contexts,
        ),
    ],
)

llm = Module(
    name="llm",
    input=retriever,
    output=str,
    eval=[
        LLMBasedAnswerCorrectness().use(
            question=synthetic_dataset.question,
            answer=ModuleOutput(), ground_truth_answers=synthetic_dataset.answer
        ),
        LLMBasedAnswerRelevance().use(
            question=synthetic_dataset.question,
            answer=ModuleOutput()
        ),
        LLMBasedFaithfulness().use(
            question=synthetic_dataset.question,
            retrieved_context=ModuleOutput(DocumentsContent, module=retriever),
            answer=ModuleOutput(), 
        )
    ],
)

pipeline = Pipeline([retriever, llm], dataset=synthetic_dataset)

Now let’s run the evaluation and view the results

from continuous_eval.eval.logger import PipelineLogger
from tqdm import tqdm

def run_app(retriever_function, retriever_name):
    
    pipelog = PipelineLogger(pipeline=pipeline)

    for datum in tqdm(pipeline.dataset.data, total=len(pipeline.dataset.data), desc=f"Running {retriever_name}-based RAG"):
        q = datum["question"]
        # Run retrieval and log retrieved docs
        retrieved_docs = retriever_function.invoke(q)
        pipelog.log(
            uid=datum["uid"],
            module="retriever",
            value=[doc.__dict__ for doc in retrieved_docs],
        )
        # Run generation and log llm answer
        response = answer_generator(q, retrieved_docs)
        pipelog.log(uid=datum["uid"], module="llm", value=response)
    
    pipelog.save(f"{retriever_name}_RAG_outputs.jsonl")

run_app(vectorstore_retriever, "vectorstore")
run_app(keyword_retriever, "keyword")
run_app(hybrid_and_rerank_retriever, "hybrid_and_rerank")

Finally, we can benchmark the RAG Systems

Image by author. Generated in notebook.

Evaluation over this dataset shows that the Vectorstore-based RAG system and Hybrid and Rerank RAG both have close to 90% recall, meaning ~90% of the relevant context was retrieved. The Hybrid and Rerank RAG is slightly better in recall and rank-aware metrics but traded off precision by adding more noise to the retrieved context. The Keyword-based RAG system performed much worse in comparison.

If we look at the generation metrics, the three retrieval strategies actually yielded similar correctness and relevance. But we can see from faithfulness that much of the Keyword-based RAG’s outputs are hallucinated. This is a dangerous situation where the LLM is generating answers based on the knowledge it’s trained on rather than the source material the RAG system is supposed to provide.

You can further analyze the distribution of results in more granularity. For example, inspect performance by type of questions, type of context, etc. Lastly, it is also important to increase the size of the dataset to have more statistically significant comparisons.

Case study
Writen by
Yi Zhang
Like what you read? Share with a friend