January 29, 2024

Case Study: Reference-free vs Reference-based evaluation of RAG pipeline

Case study

In the earlier blog post How important is a Golden Dataset for LLM pipeline evaluation?, we discussed the pros and cons of using Reference-free evaluation vs using Reference-based evaluation (using Synthetic Dataset or Golden Dataset).

To best illustrate the specific differences between the two, we will walk through a specific Enterprise Q&A example in a classic RAG pipeline.

You will see how only Reference-based evaluation can help you get a holistic view of your pipeline performance on top of the Reference-free metrics.

How to get a holistic view of your LLM pipeline performance? In the story Blind men and an elephant, each blind man touched a part of elephant and arrived a different conclusion what an elephant like, missing the complete picture. (Illustration by Leremy Stick Figures)

Example Datum

Question: What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?”

Retrieved Context: Windsor Corp, a rising player in renewable energy solutions, has been actively involved in reducing carbon emissions. In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm. These efforts have reportedly reduced their carbon footprint significantly.

LLM-Generated Answer: As of early 2023, Windsor Corp’s initiatives include transitioning to electric vehicles, and partnering with North Carolina for an offshore wind farm.

Code snippet to run select Reference-FREE metrics

from continuous_eval.metrics import LLMBasedAnswerRelevance, LLMBasedContextPrecision, LLMBasedFaithfulness, DeterministicFaithfulness,FleschKincaidReadability, BertAnswerRelevance

datum = {
    "question": "What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?",
    "retrieved_contexts": [
        "Windsor Corp, a rising player in renewable energy solutions, has been actively involved in reducing carbon emissions. In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm. These efforts have reportedly reduced their carbon footprint significantly."
    ],
    "answer": "As of early 2023, Windsor Corp's initiatives include transitioning to electric vehicles, and partnering with North Carolina for an offshore wind farm."
}

Reference_Free_Metrics = [
    LLMBasedAnswerRelevance(),
    BertAnswerRelevance(),
    LLMBasedContextPrecision(),
    LLMBasedFaithfulness(),
    DeterministicFaithfulness(),
    FleschKincaidReadability()
]

results = {}
for m in Reference_Free_Metrics:
    results.update(m.calculate(**datum))

Reference-FREE evaluation results:

{
  'LLM_based_answer_relevance': 1.0, 
  'LLM_based_answer_relevance_reasoning': "The answer is relevant to the question and provides specific examples of Windsor Corp's latest sustainability initiatives aimed at reducing carbon emissions, such as transitioning to electric vehicles and partnering for an offshore wind farm.", 
  'bert_answer_relevance': 0.7595956325531006, 
  'LLM_based_context_precision': 1.0, 
  'LLM_based_faithfulness': True, 
  'LLM_based_faithfulness_reasoning': 'The statement is supported by the context, which mentions that Windsor Corp announced the shift to electric vehicles for their company fleet and a partnership with North Carolina to create an offshore wind farm.', 
  'rouge_faithfulness': 1.0, 
  'token_overlap_faithfulness': 1.0, 
  'flesch_reading_ease': 17.968260869565228, 
  'flesch_kincaid_grade_level': 16.46695652173913
}

Notably, we can make a few conclusions from these metrics:

  • The context retrieved seem to fully relevant to the question according to LLM_based_context_precision,
  • The answer is relevant to the question based on LLM_based_answer_relevance and bert_answer_relevance
  • There is no hallucination according to multiplefaithfulness metrics,
  • the answer has a relatively high bar to understand according to flesch_reading_ease

Overall, it seems good?

Unfortunately, the reference-free metrics alone are not enough to assess the true performance of the RAG pipeline.

In this case, the retrieval pipeline failed to retrieve another key information “Windsor Corp acquired Treeplanter, Inc. for $545 million in stock in July 2023 and announced a strategic initiative of planting over 1 million trees by 2030.”, and as a result the LLM did not answer the question completely correctly.

Now, let’s take a look how leveraging ground-truth data in the Golden Dataset, can help us uncover this issue of the pipeline.

Example Datum Ground Truth

Question: What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?”

Ground Truth Contexts:
In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm.,
Windsor Corp acquired Treeplanter, Inc. for $545 million in stock in July 2023 and announced a strategic initiative of planting over 1 million trees by 2030.

Ground Truth Answer:
The latest sustainability initiatives of Windsor Corp in reducing carbon emissions include:
1. The shift to electric vehicles for their company fleet, announced in early 2023.
2. A partnership with North Carolina to create an offshore wind farm, also announced in early 2023.
3. The acquisition of Treeplanter, Inc. for $545 million in stock in July 2023. This acquisition is part of a strategic initiative to plant over 1 million trees by 2030.

Code snippet to run select Reference-BASED metrics

from continuous_eval.metrics import PrecisionRecallF1, RougeSentenceMatch, DebertaAnswerScores, LLMBasedAnswerCorrectness, LLMBasedContextCoverage, LLMBasedStyleConsistency

datum_with_ground_truth = {
    "question": "What are the latest sustainability initiatives of Windsor Corp in reducing carbon emissions?",
    "retrieved_contexts": [
        "Windsor Corp, a rising player in renewable energy solutions, has been actively involved in reducing carbon emissions. In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm. These efforts have reportedly reduced their carbon footprint significantly."
    ],
    "answer": "As of early 2023, Windsor Corp's initiatives include transitioning to electric vehicles, and partnering with North Carolina for an offshore wind farm.",
    "ground_truth_contexts": [
        "In early 2023, they announced several initiatives including the shift to electric vehicles for their company fleet, and a partnership with North Carolina to create a offshore wind farm.",
        "Windsor Corp acquired Treeplanter, Inc. for $545 million in stock in July 2023 and announced a strategic initiative of planting over 1 million trees by 2030."
    ],
    "ground_truths": [
        "The latest sustainability initiatives of Windsor Corp in reducing carbon emissions include: \
            1. The shift to electric vehicles for their company fleet, announced in early 2023.\
            2. A partnership with North Carolina to create an offshore wind farm, also announced in early 2023.\
            3. The acquisition of Treeplanter, Inc. for $545 million in stock in July 2023. This acquisition is part of a strategic initiative to plant over 1 million trees by 2030.",
    ]
}

Reference_Based_Metrics = [
    PrecisionRecallF1(RougeSentenceMatch()),
    DebertaAnswerScores(), 
    LLMBasedAnswerCorrectness(),
    LLMBasedStyleConsistency()
]

results = {}
for m in Reference_Based_Metrics:
    results.update(m.calculate(**datum_with_ground_truth))

Reference-BASED evaluation results

{
  'context_precision': 0.3333333333333333, 
  'context_recall': 0.5, 
  'context_f1': 0.4, 
  'deberta_answer_entailment': 0.013443393632769585, 
  'deberta_answer_contradiction': 3.2031916816777084e-06, 
  'LLM_based_answer_correctness': 0.75, 
  'LLM_based_answer_correctness_reasoning': 'The answer is relevant to the question and correct but is not complete as it omits the acquisition of Treeplanter, Inc. and the strategic initiative to plant over 1 million trees by 2030.', 
  'LLM_based_style_consistency': 0.3333333333333333, 
  'LLM_based_style_consistency_reasoning': 'The generated answer is less detailed and lacks the structured, itemized format of the reference answer.'
}

These ground-truth metrics now clearly reveals the issues of the pipeline and offers much richer insights into its performance.

  • context_recall is only 50%, indicating only half of the necessary information required to answer the question is retrieved
  • context_precision reveals the true signal-to-noise ratio of the retrieved chunk
  • LLM_based_answer_correctness points out that the answer is not fully complete, and missed the key acquisition
  • deberta_answer_entailment is low, indicating that the answer is logically insufficient to imply the ground truth answer
  • LLM_based_style_consistency points out that the way LLM answered is poorly aligned with the structure / format with the desired output
Side note: we don’t go into details of explaining each metric in this blog, but feel free to take a look at continuous-eval docs, where it details how each metric is defined and what information it reveals about your LLM pipeline.

Takeaways from this example:

  • Reference-FREE metrics offer important but incomplete insights into the RAG pipeline
  • Certain key issues can only be revealed with Reference-BASED metrics
  • Use a combination of Reference-FREE and Reference-BASED metrics to get a complete check on your pipeline health in offline evaluation
Case study
Writen by
Yi Zhang
Like what you read? Share with a friend