January 29, 2024

How Important is a Golden Dataset for LLM Evaluation?

Technical guide

Thank you for all of those who read our prior posts on Practical Guide to RAG Pipeline Evaluation (Part 1: Retrieval, Part 2: Generation).

In this article we want to address a key topic many of you reached out to discuss, which is whether you need a reference dataset for evaluation. The short answer is YES, and we will walk through the pros and cons of different types of LLM evaluation methodologies. We will also briefly touch on how to curate a golden dataset (can be through synthetic creation) and what makes a golden dataset golden.

Three ways to evaluate you LLM pipelines

1) Reference-Free

Easiest to get started (5 minutes to first results) but only offers partial and often inconsistent insights into pipeline.

2) Synthetic-Dataset-Based

Shortcut to have consistent and complete metrics. Dataset can be biased and not representative at first but can be improved over time.

3) Golden-Dataset-Based

Best approach! Takes time to curate initially but offers the most reliable and comprehensive insights into your LLM pipeline.

Our Recommendation

  • Start getting fast, complete, and directional insights using reference-free and synthetic-dataset-based metrics.
  • Curate and improve Golden Dataset overtime (using the Synthetic Dataset as a baseline) to get the most reliable evaluation pipeline.

Reference-Free Evaluation

Reference-free evaluation means that you are not evaluating against any benchmarks or expected output. For example, you can directly compare the question and the answer and assess if the answer is directly addressing the question (answer_relevance).

Pros:

  • Quick to start. The key benefit is that it is quick to start. Takes less than 5 minutes to get the first evaluation results.
  • Production monitoring. Can be used to monitor production data and assess trends. Metrics such as faithfulness doesn’t require any references and is very helpful to monitor over production data.

Cons:

  • Partial insights. It only offers a partial look to the pipeline, painting an incomplete picture of its performance. There’s no way to identify key common issues such as the retrieval pipeline not returning all the necessary chunks (context_recall).
  • Lack of consistency. We see many evaluation pipelines apply reference-free metrics to different datasets across runs. This occurs frequently when people try to use changing production data for offline evaluation. The problem with this approach is that you cannot objectively compare results and make conclusions about the pipeline performance. As a result, optimizing over Reference-Free metrics on inconsistent data is like shooting at a moving target.
Using reference-free evaluations on inconsistent data is like shooting a moving target!

Synthetic-Dataset-Based Evaluation

Now as we get into the realm of Reference-Dataset-Based Evaluation, there are many benefits that comes with this approach. Synthetic Dataset, often generated with LLMs with creative prompting techniques, is essentially a Shortcut to get a Reference Dataset without spending much effort.

Pros:

  • Consistent measurement. This solves the problem of consistency mentioned above. With a consistent dataset to measure against, you can truly compare different design choices, such as chunking strategy, retrieval algorithm, prompting, etc and understand their impact.
  • Relatively easy to get started. The only input required to generate a Synthetic Dataset is a sample knowledge source (in the form of source documents or vector embeddings). A proper synthetic generation pipeline leverages multiple layers of generation, transformation, filtering, and auditing to create Question-Context-Answer tuples of the expected output. The LLM evaluation library continuous-eval offers a simple_dataset_generator (link) is an example strategy to generate a diverse synthetic dataset.

Cons:

  • Quality varies. Depending on the specific use case, LLMs may or may not be able to generate representative queries that your users can ask. LLM models are inherantly probabilistic, so they are biased towards certain examples and also may not be able to generate examples with sufficient variety.
  • Still requires human-in-the-loop audits. Human audits is almost always required to sense-check that the generated examples are directionally aligned with reality and user expectations.
  • No standard methodology. There are many creative ways to generate synthetic RAG data but there’s not a one-size-fits-all pipeline yet. This is an open research problem and we will share more about different strategies beyond the simple_dataset_generator (link) in a separate article.

Golden-Dataset-Based Evaluation

This is the best possible way to get a comprehensive assessment of a pipeline. In the context of RAG, below is what a Golden Dataset includes:

Pros:

  • Most consistent and reliable measures. All metrics that reveal various aspects of pipeline health can be confidently applied.
  • Most complete insights. Not only can you evaluate against the most complete suite of metrics, Golden Datasets reveal the subtleties of what makes a good output good. Oftentimes it is difficult to enumerate why users prefer certain answers over others. By comparing the LLM response against the gold standard, you can capture most of the nuances using creative metrics.

Cons:

  • Takes more time and effort. Usually this is a joint effort across Engineers, PMs, Domain Experts, and potentially Users. What makes a Golden Dataset golden is that you have confidence that good performance on the Golden Dataset in offline evaluation highly correlates with user satisfaction in production. To get there, it requires careful design and engineering.
  • Requires maintenance. A well-designed Golden Dataset needs to be a living entity, continually evolving to capture a diverse array of scenarios and user preferences, ensuring that the pipeline’s optimization aligns closely with genuine user requirements.

Our Recommendation

Start Today! A combination of Reference-Free and Synthetic-Dataset-Based Evaluation is a great starting point. You can get a consistent and complete view over your pipeline in a relatively short amount of time.

Build and improve the Golden Dataset Over Time. There’s simply no substitute to a high-quality human-verified Golden Dataset. Use the Synthetic Dataset as a baseline and leverage user data and human feedback to build a robust, diverse set of test set that you can use to optimize your pipeline.

Want to see a real example?

In this separate post (link), we walk through a real-life case study comparing Reference-Free and Reference-Dataset-Based Evaluation, including code examples, evaluation results, and metric-revealed insights.

Subscribe and stay tuned!

Technical guide
Writen by
Yi Zhang
Like what you read? Share with a friend