In the earlier blog post How important is a Golden Dataset for LLM pipeline evaluation?, we discussed the pros and cons of using Reference-free evaluation vs using Reference-based evaluation (using Synthetic Dataset or Golden Dataset).
To best illustrate the specific differences between the two, we will walk through a specific Enterprise Q&A example in a classic RAG pipeline.
You will see how only Reference-based evaluation can help you get a holistic view of your pipeline performance on top of the Reference-free metrics.
Question: What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?”
Retrieved Context: Windsor Corp, a rising player in renewable energy solutions, has been actively involved in reducing carbon emissions. In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm. These efforts have reportedly reduced their carbon footprint significantly.
LLM-Generated Answer: As of early 2023, Windsor Corp’s initiatives include transitioning to electric vehicles, and partnering with North Carolina for an offshore wind farm.
from continuous_eval.metrics import LLMBasedAnswerRelevance, LLMBasedContextPrecision, LLMBasedFaithfulness, DeterministicFaithfulness,FleschKincaidReadability, BertAnswerRelevance
datum = {
"question": "What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?",
"retrieved_contexts": [
"Windsor Corp, a rising player in renewable energy solutions, has been actively involved in reducing carbon emissions. In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm. These efforts have reportedly reduced their carbon footprint significantly."
],
"answer": "As of early 2023, Windsor Corp's initiatives include transitioning to electric vehicles, and partnering with North Carolina for an offshore wind farm."
}
Reference_Free_Metrics = [
LLMBasedAnswerRelevance(),
BertAnswerRelevance(),
LLMBasedContextPrecision(),
LLMBasedFaithfulness(),
DeterministicFaithfulness(),
FleschKincaidReadability()
]
results = {}
for m in Reference_Free_Metrics:
results.update(m.calculate(**datum))
{
'LLM_based_answer_relevance': 1.0,
'LLM_based_answer_relevance_reasoning': "The answer is relevant to the question and provides specific examples of Windsor Corp's latest sustainability initiatives aimed at reducing carbon emissions, such as transitioning to electric vehicles and partnering for an offshore wind farm.",
'bert_answer_relevance': 0.7595956325531006,
'LLM_based_context_precision': 1.0,
'LLM_based_faithfulness': True,
'LLM_based_faithfulness_reasoning': 'The statement is supported by the context, which mentions that Windsor Corp announced the shift to electric vehicles for their company fleet and a partnership with North Carolina to create an offshore wind farm.',
'rouge_faithfulness': 1.0,
'token_overlap_faithfulness': 1.0,
'flesch_reading_ease': 17.968260869565228,
'flesch_kincaid_grade_level': 16.46695652173913
}
Notably, we can make a few conclusions from these metrics:
Unfortunately, the reference-free metrics alone are not enough to assess the true performance of the RAG pipeline.
In this case, the retrieval pipeline failed to retrieve another key information “Windsor Corp acquired Treeplanter, Inc. for $545 million in stock in July 2023 and announced a strategic initiative of planting over 1 million trees by 2030.”, and as a result the LLM did not answer the question completely correctly.
Now, let’s take a look how leveraging ground-truth data in the Golden Dataset, can help us uncover this issue of the pipeline.
Question: What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?”
Ground Truth Contexts:
In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm.,
Windsor Corp acquired Treeplanter, Inc. for $545 million in stock in July 2023 and announced a strategic initiative of planting over 1 million trees by 2030.
Ground Truth Answer:
The latest sustainability initiatives of Windsor Corp in reducing carbon emissions include:
1. The shift to electric vehicles for their company fleet, announced in early 2023.
2. A partnership with North Carolina to create an offshore wind farm, also announced in early 2023.
3. The acquisition of Treeplanter, Inc. for $545 million in stock in July 2023. This acquisition is part of a strategic initiative to plant over 1 million trees by 2030.
from continuous_eval.metrics import PrecisionRecallF1, RougeSentenceMatch, DebertaAnswerScores, LLMBasedAnswerCorrectness, LLMBasedContextCoverage, LLMBasedStyleConsistency
datum_with_ground_truth = {
"question": "What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?",
"retrieved_contexts": [
"Windsor Corp, a rising player in renewable energy solutions, has been actively involved in reducing carbon emissions. In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm. These efforts have reportedly reduced their carbon footprint significantly."
],
"answer": "As of early 2023, Windsor Corp's initiatives include transitioning to electric vehicles, and partnering with North Carolina for an offshore wind farm.",
"ground_truth_contexts": [
"In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm.",
"Windsor Corp acquired Treeplanter, Inc. for $545 million in stock in July 2023 and announced a strategic initiative of planting over 1 million trees by 2030."
],
"ground_truths": [
"The latest sustainability initiatives of Windsor Corp in reducing carbon emissions include: \
1. The shift to electric vehicles for their company fleet, announced in early 2023.\
2. A partnership with North Carolina to create an offshore wind farm, also announced in early 2023.\
3. The acquisition of Treeplanter, Inc. for $545 million in stock in July 2023. This acquisition is part of a strategic initiative to plant over 1 million trees by 2030.",
]
}
Reference_Based_Metrics = [
PrecisionRecallF1(RougeSentenceMatch()),
DebertaAnswerScores(),
LLMBasedAnswerCorrectness(),
LLMBasedStyleConsistency()
]
results = {}
for m in Reference_Based_Metrics:
results.update(m.calculate(**datum_with_ground_truth))
{
'context_precision': 0.3333333333333333,
'context_recall': 0.5,
'context_f1': 0.4,
'deberta_answer_entailment': 0.013443393632769585,
'deberta_answer_contradiction': 3.2031916816777084e-06,
'LLM_based_answer_correctness': 0.75,
'LLM_based_answer_correctness_reasoning': 'The answer is relevant to the question and correct but is not complete as it omits the acquisition of Treeplanter, Inc. and the strategic initiative to plant over 1 million trees by 2030.',
'LLM_based_style_consistency': 0.3333333333333333,
'LLM_based_style_consistency_reasoning': 'The generated answer is less detailed and lacks the structured, itemized format of the reference answer.'
}
Side note: we don’t go into details of explaining each metric in this blog, but feel free to take a look at continuous-eval docs, where it details how each metric is defined and what information it reveals about your LLM pipeline.