In our prior articles about RAG / LLM pipeline evaluation, we analyzed retrieval and generation as the two components of the pipeline. It is, however, an abstraction we used to simplify many production RAG / LLM systems. Real world pipelines typically contain many more “modules” and “steps” between a user query and the final system output than the simple 2-step process.
Below is an example GenAI pipeline from one of our customers. As you can see, more than 10 modules were used to process the data from User Input to Final Module output (image is blurred for confidentiality purpose). LLM Classifiers are used to detect user intent and send queries down the right processing paths. Multiple types of retrievers are used to produce the optimal context-retrieval results. LLM rerankers are used to filter and compress the context before feeding to LLM. Agents are used to call specific functions when necessary….
As you can probably imagine, just evaluating the final responses, or two high-level components, is not enough to tell you what’s working and what’s not. If you want to understand what caused a poor system output, you will need to backtrace multiple steps to understand the intermediate outputs and judge their quality. You might be able to go through the trouble a few times to spot check anecdotally, but it is almost impossible to analyze these large pipelines at scale.
In this article we will show how to get more granular insights and how to tailor your auto-evaluators to measure and test each pipeline component. We will also show a complex RAG example.
But before we get there, why have AI application pipeline got so complicated? Aren’t LLMs supposed to be know-it-alls and you just need to prompt it correctly to produce the right output?
Turns out in order to use the full potential of LLM’s reasoning and processing capabilities, you need to provide them with access to the right context, tools to interact with the environment, and maybe another LLM to reason over intermediate outputs. You can imagine LLM as a worker with decent level of general intelligence, but still need to be provided the right knowledge, instruction, and access to tools to be able to produce good work.
As a result, you find in many production systems the need to build a robust retriever system, a variety of functions or tools to call, and multiple filtering / processing modules to be able to provide helpful answers / actions.
To truly understand the capability of your GenAI application, it’s much more than just evaluating the LLM itself, you need to make sure all the surrounding modules, and the way they interface the LLMs are functioning correctly. Just like unit tests are written to test the components of the software, you want to have tests at different levels of granularity to help you gain both detailed and high-level perspectives of what’s working and what’s not.
An ideal evaluation framework needs to have the following:
In the latest version of continuous-eval v0.3
, we designed the framework to have these exact capabilities. Let’s walk through a case study below to see how it works
In the RAG pipeline below, a query is first sent to a classifier to decide the intent of the question and then passed through three separate retrievers. The Base Retriever runs a vector search in the vector database, the BM25 Retriever uses keyword search through documents, and a HyDE generator creates hypothetical context documents and then retrieve semantically similar chunks. A reranker which uses Cohere LLM reorder and compresses the retrieved chunks based on relevance and finally feeds into the LLM to produce an output.
To build an evaluation Pipeline tailored to this complex RAG application , you can define the Modules as follows using continuous-eval
.
from continuous_eval.eval import Module, Pipeline, Dataset, ModuleOutput
dataset = Dataset("data/eval_golden_dataset")
classifier = Module(
name="query_classifier",
input=dataset.question,
output=str,
)
base_retriever = Module(
name="base_retriever",
input=dataset.question,
output=Documents,
)
bm25_retriever = Module(
name="bm25_retriever",
input=dataset.question,
output=Documents,
)
hyde_generator = Module(
name="HyDE_generator",
input=dataset.question,
output=str,
)
hyde_retriever = Module(
name="HyDE_retriever",
input=hyde_generator,
output=Documents,
)
reranker = Module(
name="cohere_reranker",
input=(base_retriever, hyde_retriever, bm25_retriever),
output=Documents,
)
llm = Module(
name="answer_generator",
input=reranker,
output=str,
)
pipeline = Pipeline([classifier, base_retriever, hyde_generator, hyde_retriever, bm25_retriever, reranker, llm], dataset=dataset)
To select the appropriate metrics and tests to a module, you can use the eval
and tests
fields. Let’s use the answer_generator
module as an example and add 3 metrics and 2 tests to the module.
from continuous_eval.metrics.generation.text import (
FleschKincaidReadability,
DebertaAnswerScores,
LLMBasedAnswerCorrectness,
)
from continuous_eval.eval.tests import GreaterOrEqualThan
llm = Module(
name="answer_generator",
input=reranker,
output=str,
eval=[
FleschKincaidReadability().use(answer=ModuleOutput()),
DebertaAnswerScores().use(
answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
),
LLMBasedFaithfulness().use(
answer=ModuleOutput(),
retrieved_context=ModuleOutput(DocumentsContent, module=reranker),
question=dataset.question,
),
],
tests=[
GreaterOrEqualThan(
test_name="Readability", metric_name="flesch_reading_ease", min_value=20.0
),
GreaterOrEqualThan(
test_name="Deberta Entailment", metric_name="deberta_answer_entailment", min_value=0.8
),
],
)
Once you have the pipeline set up, you can use an eval_manager
to log all the intermediate steps.
eval_manager.start_run()
while eval_manager.is_running():
if eval_manager.curr_sample is None:
break
q = eval_manager.curr_sample["question"] # get the question or any other field
# run your pipeline ...
eval_manager.log("reranker", Document) # log intermediate results
# ...
eval_manager.next_sample()
Finally you can run the evaluation and tests,
Below is an example run on the pipeline. In this visualized output, you can see that the final answer is generally faithful, relevant and stylistically consistent. However, it is only correct 70% of the time. In this case you can trace back the performance at each module (it looks like on of the Retrievers is suffering at Recall that’s worth investigating).
To checkout more complete examples, we have created four examples with complete code. github.com/relari-ai/examples. The applications themselves are built using LlamaIndex and @LangChain and has continuous-eval evaluators built-in.
Here’s the link to the open-source continuous-eval: github.com/relari-ai/continuous-eval