September 17, 2024

Qdrant Case Study: End-to-End RAG Optimization

Case study

Retrieval-Augmented Generation (RAG) has quickly become a go-to strategy for enhancing Large Language Models (LLMs) with up-to-date and context-specific information. The allure of RAG lies in its promise: plug in a retrieval component, and suddenly your model can access an external knowledge base, seemingly extending its capabilities far beyond its training data. But as straightforward as this sounds, building a RAG system that truly performs well in production is far from trivial.

Image by Qdrant (What is RAG in AI)

While tools like Langchain, LlamaIndex, Haystack, or DSPy make it seem like you can set up a RAG pipeline with just a few lines of code, real-world applications demand far more nuanced solutions. 

Despite the endless variety of RAG techniques—ranging from simple vector searches to complex multi-agent retrieval systems backed on knowledge graphs—there’s no universal setup that fits every scenario. Different types of data, user queries, precision requirements, and user experiences necessitate tailored strategies.

Images from Advanced RAG Techniques by Singh, Apr 2024

Why is there no one-size-fits-all approach?

We attribute that to a few sources of variances inherent to different use cases:

  • Data Diversity: The way information is stored and represented significantly influences the way information should be retrieved. For example, building a RAG system for semi-conversational legal court transcripts is different from creating one for well-structured SEC filings.
  • Query Diversity: User queries can vary significantly, even within the same application. For instance, a query like “Summarize the key takeaways from this month’s customer discovery calls” versus “When was the last time a customer mentioned feature X?” will require different amounts and types of information to respond effectively. 
  • Requirement Diversity: The precision required varies by application. For example, financial reporting needs highly accurate data retrieval, while a brainstorming tool can afford a broader range of relevance. The level of accuracy needed shapes the design of the retrieval system. 
  • UX Diversity: The user experience influences the RAG design choices as well. Automated responses for predefined questions allow for a high-latency, narrow but accurate retrieval process, whereas a dynamic chatbot interface needs a more adaptable approach to handle diverse, real-time queries.

Given these complexities, finding the right RAG strategy for your application requires rigorous experimentation with data that mirrors your real-world scenario. Only then can you determine the most effective approach or combination of strategies for your specific needs.

In the following case study, we'll demonstrate how to efficiently evaluate and improve RAG systems using a data-driven approach, helping you iteratively refine and identify the optimal RAG strategy for your application.

Case study: Data-driven way to optimize RAG for a specific use case

In this case study, we will use a simple RAG system built on Gitlab Legal Policies as an example. We will use Qdrant and Langchain to set up different RAG systems and then leverage Relari’s data-driven toolkit to optimize the RAG system. Access the notebook here to follow along: Qdrant <> Relari Webinar Notebook 

Step 0. Building the RAG with Qdrant and Langchain

Step 1. Define a Golden Dataset

Step 2. Run Experiments to Identify the Best Retrieval strategy

Step 3. Use Auto Prompt Optimizer to Create a Robust RAG Prompt

Step 0: Building the RAG with Qdrant and Langchain

We begin by building RAG systems with Qdrant's vector database. Qdrant offers a user-friendly yet highly scalable solution, suitable for both beginners and those managing vast amounts of data in production environments. For this case study, we use the in-memory version of the Qdrant vector store to load the GitLab legal policy documents.

from langchain_community.document_loaders.directory import DirectoryLoader
from langchain_qdrant import Qdrant
from langchain_openai import OpenAIEmbeddings
from relari.core.types import DatasetDatum

# load the document and split it into chunks
loader = DirectoryLoader("gitlab_legal_policies/")
documents = loader.load_and_split()

# Load chunks into a Qdrant vectorstore
db = Qdrant.from_documents(
   documents,
   embedding=OpenAIEmbeddings(),
   location=":memory:",
   collection_name="gitlab_legal_policies",
)

Once the vector database is set up, we can use it as a retriever with various parameters and architectures.

Step 1: Define a Golden Dataset

Datasets are crucial in Relari's approach to evaluating and improving LLM applications. The data serves as a ground truth (or reference) to test an LLM application pipeline. You can think of a dataset as an exam rubric that includes questions, correct answers, and the steps to derive those answers (e.g., which page of a book provides the answer). We then evaluate how well the AI application's generated answers match the expected ones. Learn more about our data-driven approach: Why Use Datasets?.

In Relari, you can upload a pre-existing golden dataset (often human-labeled or collected from users), or you can easily generate a synthetic golden dataset:

from relari import RelariClient
client = RelariClient()

dir = Path("gitlab_legal_policies")
task_id = client.synth.new(
   project_id=proj["id"],
   name="Gitlab Legal Policies",
   samples=30,
   files=list(dir.glob("*.txt")),
)

Within minutes, you can create a synthetic dataset based on the GitLab Legal Policies, containing questions, ground truth answers, and ground truth contexts (the chunks the RAG system should retrieve to answer a question). Curious to learn more about synthetic datasets? Check out this blog: Generate Synthetic Data to Test LLM Applications.

Image by Relari: Synthetic datasets generated from Gitlab Legal Policies

Step 2: Running Experiments to Identify the Best Retrieval Strategy

After defining the dataset, we can experiment with different retrieval methods, parameters, and chunking strategies to find the optimal approach. In this example, we conduct two experiments: optimizing the "Top K" parameter and exploring different retrieval architectures, such as hybrid search.

Experiment 1: Optimizing the 'Top K' Parameter

The "Top K" parameter determines how many chunks are retrieved from the vector store by the semantic retriever. While retrieving more chunks increases the likelihood of capturing relevant documents, it also introduces more irrelevant chunks (noise), potentially confusing the LLM.

We experiment with different "Top K" values (ranging from 3 to 9 chunks) and use two retrieval metric groups from the Relari Metric Library 

- PrecisionRecallF1: Recall measures what % relevant chunks are retrieved; Precision measures what % of retrieved chunks are relevant. Read more: Context Precision Recall.

- RankedRetrievalMetrics: MAP, MRR, NDCG - these metrics take into account the ordering of chunks in retrieval. Read more: Rank-Aware Metrics.

from relari import Metric

k_values = [3, 5, 7, 9]  # Define the different values of top k to experiment

semantic_retrievers = {}
semantic_logs = {}

for k in k_values:
   # Step 1: Run the retriever on the dataset and log retrieved chunks
   retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": k})
   log = log_retriever_results(retriever, dataset)

   semantic_retrievers[f"k_{k}"] = retriever
   semantic_logs[f"k_{k}"] = log
   print(f"Results on {dataset.name} by Semantic Retriever with k={k} saved!")

   # Step 2: Submit evaluation for each retriever
   eval_name = f"Semantic Retriever Evaluation k={k}"
   eval_data = log  # Use the logged data directly from the loop

   eval_info = client.evaluations.submit(
       project_id=proj["id"],
       dataset=dataset_info["id"],
       name=eval_name,
       pipeline=[Metric.PrecisionRecallF1, Metric.RankedRetrievalMetrics],
       data=eval_data,
   )
   print(f"{eval_name} submitted!")

Results: The experiment shows that a "Top K" of at least 7 is required to retrieve over 85% of relevant documents (Context Recall). However, with a context precision of about 20%, roughly 80% of the retrieved documents are irrelevant. Whether this is acceptable depends on your application’s design requirements. Further evaluations, including metrics like Faithfulness, can help understand the impact of retrieval performance on LLM hallucinations and answer quality.

Image by Relari: Top-K experiments for RAG

Image by Relari. Tradeoff of Precision and Recall shown in second chart.

Experiment 2: Evaluating Keyword Search and Hybrid Search

In this experiment, we compare traditional keyword search techniques, such as BM25, with a vector store-based approach and explore a hybrid search that combines both.

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

keyword_retriever = BM25Retriever.from_documents(documents)

# Build a hybrid retriever with equal weighting
hybrid_retriever = EnsembleRetriever(
   retrievers=[keyword_retriever, semantic_retrievers["k_7"]], weights=[0.5, 0.5]
)

Results: The Keyword Retriever shows similar context recall (87%) to the Semantic Retriever but with much higher precision (32% vs. 18%). The Hybrid search, which merges the results of the Semantic Retriever (k=7) and the Keyword Retriever, improves recall to 92%. This suggests that semantic and keyword methods complement each other to improve the overall RAG system.

Image by Relari: Keyword / Hybrid search experiment

Image by Relari. Hybrid Retriever improves performance compared to baseline Semantic Retrievers.

Step 3: Auto-Optimizing the RAG Prompt

The generation component of RAG systems—comprising the prompt and the LLM—also plays a critical role. While a basic prompt might be “Answer the following Question using the Context provided,” this often isn't sufficient to meet unique requirements.

Relari offers an Auto Prompt Optimizer that automatically refines the prompt given a Golden Dataset and a Target Metric.

Image by Relari: Auto Prompt Optimizer architecture

Image by Relari. Data-Driven Auto Prompt Optimizer Architecture.

We optimized the prompt for our GitLab Legal Policy RAG as follows.

from relari.core.types import Prompt, UserPrompt

base_prompt = Prompt(
   system="You are a gitlab legal policy Q&A bot. Answer the following question given the context.",
   user=UserPrompt(
       prompt="Question: $question\n\nContext:\n$ground_truth_context",
       description="Question and context to answer the question.",
   ),
)

task_id = client.prompts.optimize(
   name="Gitlab Legal Policy RAG Prompt",
   project_id=proj["id"],
   dataset_id=dataset_info["id"],
   prompt=base_prompt,
   llm="gpt-4o-mini",
   task_description="Answer the question using the provided context.",
   metric=client.prompts.Metrics.CORRECTNESS,
)
print(f"Optimization task submitted with ID: {task_id}")

Results: The Optimizer conducted six training steps, significantly improving performance (as measured by correctness) compared to the base prompt. The new prompt incorporates an updated System Prompt and auto-selected few-shot examples to better align the generation with our specific needs.

Image by Relari: Optimized RAG prompt for this dataset

Conclusion

We showcased how a data-driven approach can significantly enhance the performance of Retrieval-Augmented Generation (RAG) systems. By experimenting with various retrieval strategies and optimizing prompts using Relari's tools, we demonstrated that there is no universal solution—each application demands a tailored strategy.

We encourage you to explore these techniques with your own data. Try the Qdrant <> Relari Webinar Notebook today and discover how a data-driven approach can optimize your RAG systems for better accuracy and performance.

Case study
Writen by
Yi Zhang
Like what you read? Share with a friend