In the last article, we explained the core concepts of using Synthetic Data to test LLM applications at a high level. In this article, we will walk through a case study of the end-to-end workflow.
There are two ways to generate synthetic data on our platform:
The goal of this case study is to illustrate how simple it is to systematically benchmark design choices and make informed decisions using synthetic datasets.
We are going to walk through 3 steps:
You can run all the code in this case study through this notebook.
We first define a simple RAG use case, which is to digest a long-form blog post. We use Langchain to load and chunk the content, and then a lightweight vector database Milvus Lite by importing the langchain-milvus package.
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_milvus import Milvus
from langchain_openai import OpenAIEmbeddings
# Load and split documents from a long article
loader = WebBaseLoader(
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# Load chunks into a vector database
vectorstore = Milvus(
embedding_function=OpenAIEmbeddings(),
connection_args={
"uri": "milvus_vectorstore.db",
},
auto_id=True,
drop_old=True,
)
vectorstore.add_documents(docs)
Now let’s build simple RAG systems with 3 different retrieval strategies that we will evaluate later with synthetic data.
A simple answer_generator
function then uses gpt-3.5-turbo
to generate an answer.
from langchain_community.retrievers import BM25Retriever
from langchain_voyageai import VoyageAIRerank
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
# Note: 'rag_utils' is a locally defined module by Milvus with a variety of advanced RAG functions (check the notebook for more details.)
from rag_utils.hybrid_and_rerank import RerankerRunnable
# 1. Vectorstore-based retriever
vectorstore_retriever = vectorstore.as_retriever()
# 2. Keyword-search-based retriever
keyword_retriever = BM25Retriever.from_documents(docs)
# 3. Hybrid (Keyword + Vectorstore) => Reranker retriever
reranker = RerankerRunnable(
compressor=VoyageAIRerank(model="rerank-lite-1"),
top_k=4,
)
hybrid_and_rerank_retriever = {
"milvus_retrieved_doc": vectorstore_retriever,
"bm25_retrieved_doc": keyword_retriever,
"query": RunnablePassthrough(),
} | reranker
# RAG LLM answer generator
def answer_generator(query, retrieved_docs):
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0.5)
system_prompt = (
"You are a helpful assistant. Answer the question use the context provided."
)
user_prompt = f"Question: {query}\n\n"
user_prompt += "Contexts:\n" + "\n".join(
[doc.page_content for doc in retrieved_docs]
)
return llm.invoke(system_prompt + user_prompt).content
Below is refresher from our last article on what’s needed for the Generator:
Now let’s generate a synthetic testing dataset specific to this Knowledge Corpus that the RAG systems are built for. We use the continuous-eval
to generate a dataset for RAG (Application Logic) with 20 questions from the Milvus Vectorstore (Environment Data) created earlier.
Relari API supports more complex types of application logic, and the ability to ingest Seed Example Data and more Environment Data to create more realisitic and higher quality synthetic data.
Here’s how to use the SimpleDatasetGenerator
class in continuous-eval:
import json
from continuous_eval.generators import SimpleDatasetGenerator
from continuous_eval.llm_factory import LLMFactory
generator = SimpleDatasetGenerator(
vector_store_index = vectorstore,
generator_llm = LLMFactory("gpt-4o"),
)
synthetic_dataset = generator.generate(
embedding_vector_size = 1536,
num_questions = 20,
)
In a few minutes, a small dataset will be created. Let’s take a look at an example data point.
{
"question": "How does CoH address overfitting and missing information in its model?",
"answer": "CoH addresses overfitting by adding a regularization term to maximize the log-likelihood of the pre-training dataset. To handle missing information in its model, the agent learns to call external APIs for extra information that is missing from the model weights, including current information, code execution capability, access to proprietary information sources, and more.",
"contexts": [
"To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens",
"The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more."
],
"metadata": [
{
"source": "https://lilianweng.github.io/posts/2023-06-23-agent/",
"title": "LLM Powered Autonomous Agents | Lil'Log",
"description": "Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent\u2019s brain, complemented by several key components:",
"language": "en",
"pk": 450063656769030980
},
{
"source": "https://lilianweng.github.io/posts/2023-06-23-agent/",
"title": "LLM Powered Autonomous Agents | Lil'Log",
"description": "Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent\u2019s brain, complemented by several key components:",
"language": "en",
"pk": 450063656769030951
}
],
"question_type": "Multi Hop Fact Seeking",
"uid": 19
}
Each Synthetic Datum contains the synthetic inputs
and reference outputs
that can be used for evaluation.
Other variables such as metadata
and question_type
can be used for further analysis.
Here’s is refresher from our last article on how to use the synthetic dataset for evaluation:
Here we select the following metrics to evaluate the RAG systems:
Retriever:
LLM Answer Generator:
from typing import Dict, List
from continuous_eval.eval import Dataset, Module, ModuleOutput, Pipeline
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
from continuous_eval.metrics.generation.text import LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedFaithfulness
# Use existing dataset
synthetic_dataset = Dataset("synthetic_dataset.jsonl")
Documents = List[Dict[str, str]]
DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])
retriever = Module(
name="retriever",
input=synthetic_dataset.question,
output=Documents,
eval=[
PrecisionRecallF1().use(
retrieved_context=DocumentsContent,
ground_truth_context=synthetic_dataset.contexts,
),
RankedRetrievalMetrics().use(
retrieved_context=DocumentsContent,
ground_truth_context=synthetic_dataset.contexts,
),
],
)
llm = Module(
name="llm",
input=retriever,
output=str,
eval=[
LLMBasedAnswerCorrectness().use(
question=synthetic_dataset.question,
answer=ModuleOutput(), ground_truth_answers=synthetic_dataset.answer
),
LLMBasedAnswerRelevance().use(
question=synthetic_dataset.question,
answer=ModuleOutput()
),
LLMBasedFaithfulness().use(
question=synthetic_dataset.question,
retrieved_context=ModuleOutput(DocumentsContent, module=retriever),
answer=ModuleOutput(),
)
],
)
pipeline = Pipeline([retriever, llm], dataset=synthetic_dataset)
from continuous_eval.eval.logger import PipelineLogger
from tqdm import tqdm
def run_app(retriever_function, retriever_name):
pipelog = PipelineLogger(pipeline=pipeline)
for datum in tqdm(pipeline.dataset.data, total=len(pipeline.dataset.data), desc=f"Running {retriever_name}-based RAG"):
q = datum["question"]
# Run retrieval and log retrieved docs
retrieved_docs = retriever_function.invoke(q)
pipelog.log(
uid=datum["uid"],
module="retriever",
value=[doc.__dict__ for doc in retrieved_docs],
)
# Run generation and log llm answer
response = answer_generator(q, retrieved_docs)
pipelog.log(uid=datum["uid"], module="llm", value=response)
pipelog.save(f"{retriever_name}_RAG_outputs.jsonl")
run_app(vectorstore_retriever, "vectorstore")
run_app(keyword_retriever, "keyword")
run_app(hybrid_and_rerank_retriever, "hybrid_and_rerank")
Evaluation over this dataset shows that the Vectorstore-based RAG system and Hybrid and Rerank RAG both have close to 90% recall
, meaning ~90% of the relevant context was retrieved. The Hybrid and Rerank RAG is slightly better in recall
and rank-aware metrics
but traded off precision
by adding more noise to the retrieved context. The Keyword-based RAG system performed much worse in comparison.
If we look at the generation metrics, the three retrieval strategies actually yielded similar correctness
and relevance
. But we can see from faithfulness
that much of the Keyword-based RAG’s outputs are hallucinated. This is a dangerous situation where the LLM is generating answers based on the knowledge it’s trained on rather than the source material the RAG system is supposed to provide.
You can further analyze the distribution of results in more granularity. For example, inspect performance by type of questions, type of context, etc. Lastly, it is also important to increase the size of the dataset to have more statistically significant comparisons.