In this blog, we will walk through the popular topic “Synthetic Data” in the context of LLM testing and evaluation. We will cover:
Note that synthetic data also plays an increasingly important role in model training / fine-tuning processes — a fascinating topic which we will explore in separate articles.
Data-driven evaluation is critical to get high-quality, consistent, and comprehensive assessment of an AI system’s performance. We covered this topic in one of our previous posts (How important is a golden dataset for llm pipeline evaluation).
There are several options to collect an evaluation dataset: use a publicly available dataset, manually collect data or use synthetic data. The challenge with public datasets is that they are not specific enough (and have been probably used to train your model), while tailored human-labeled datasets take a lot of time and effort to create. Synthetic data is a great alternative — combining speed with quality. It can sometimes even cover more granular and complex cases than humans can.
In practice, to allow for high quality synthetic data, at least a very small sample of human-labeled dataset is required — it can be used as the seed for the synthetic dataset generation pipeline to build upon and enable the volume and diversity we need for good evaluation (as we will see later).
Synthetic Dataset is essentially replacing human annotation in the process of creating large-scale reference datasets. The data is used as ground-truth (or in other words as reference) to test an LLM application pipeline.
An analogy to understand how synthetic data is used for testing is the following: imagine that we are (synthetically) generating the exam rubric we want our AI to pass. The exam rubric would include questions, correct answers, and the correct intermediate steps to get to the answer (for example from which page of the book we can get the answer). We then give the same exam questions to our AI application, and see how well the generated answer matches up the expected answer. Using these questions, we evaluate results to measure the capability of the AI application and find wrong answers / steps to improve upon.
It’s evident that we have two steps ins this workflow, 1) generate the application-specific synthetic data, and 2) test the system capabilities. Let’s look at each step more in detail:
The first step to generating high-fidelity, application-specific synthetic data is to define how the data should look like. This includes specifying three aspects:
Let’s walk through an example RAG application for Wikipedia articles. The application is chatbot that 1) takes natural language questions from users, 2) runs a retrieval step on the Wikipedia corpus to find relevant documents, and lastly 3) uses an LLM to generate an answer based on the retrieved context. In this case the inputs are:
The Relari synthetic data generation pipeline then takes all this information and generate synthetic data samples such as the below one:
Question
Source URL
, Source Context
Reference Answer
It is important to notice that synthetic generators (like the one we developed at Relari) should leverages a combination of deterministic, classic machine learning models, and LLMs to create high custom data with sufficient fidelity and diversity.
After the synthetic dataset is generated, we can run tests with it. The specific workflow is:
Source URL
and Source Context
)Answer
by comparing to the Reference Answer
)The same way just a few questions would not be enough to assess the knowledge and capabilities of a student, we leverage a large number (hence large scale) of questions to assess if a system is able achieve the performance we expect. The collection of multiple metrics and the synthetic data enable developers to pinpoint specific shortcomings of the system.
A third step after the initial version of synthetic test data is generated is the continuous iteration step. Although an initial set of seed example data is already fed into the system, it may not be representative enough over time. For many AI applications, the way users interact with them shifts over time (think about your first 10 questions to ChatGPT vs. what you use it for now on a daily basis).
Therefore, it is important to monitor the production data to make sure the drift in real world application data distribution is reflected. We talked about how to leverage user feedback to improve evaluation metrics in this article (How to make the most out of LLM production data: Simulated User Feedback). We will talk more about how to use Production Monitoring to continuously improve your synthetic test data pipeline as well in another article.
Agent workflows are notoriously difficult to test and improve. Synthetic data can be leveraged to create granular functional tests and end-to-end tests to evaluate capabilities of agents.
For example, below is a sample of synthetic data generated to test a coding agent’s accuracy of make specific changes to a given repo given natural language instructions.
The inputs to the Generator includes:
The above synthetic data contains the following:
Instruction
, Source Repo
Reference Diff
Agent’s synthetic data can help developers identify where performance is an issue, but also help ensure that performance does not degrade as the system continues to evolve.
You can also generate synthetic data that tests LLM agent to complete an end-to-end task that is comprised of multiple sub-tasks.
For example, the following synthetic data is generated to test an SEC analyzer agent’s ability to fetch the right SEC documents, find the right data sources, make the correct calculations and out the the right qualitative analysis.
The inputs to the Generator includes:
The above synthetic data contains the following:
Question
Question Type
, SEC Filing URL(s)
, Source Data
, Calculations
Reference Answer
With these granular data, you can now evaluate the agent’s ability to execute an end-to-end task with all the required intermediate steps.
How do you ensure synthetic data is realistic, diverse, and accurate enough? If the quality of your synthetic data is not high enough, you are just optimizing for the wrong things.
In the beginning, it is always recommended to spot check the quality of synthetic data manually, prune out bad samples, and iteratively feed more good examples for the pipeline to learn to generate higher quality results.
There are also systematic ways to evaluate the quality of the synthetic data generated and measure the distance between the synthetic data and the real data in production. A good synthetic data pipeline should be able to use that information to continue to improve its quality.
At the end of the day, Synthetic Data is not a total replacement for human-labeled datasets, but rather a powerful complement. If you can label 100 examples manually, you can 10x that easily with synthetic data to cover more diverse test cases at higher volume.
Check out this Case Study where we included a python notebook you can try for yourself the open-source version of the synthetic data generation!
If you are interested in setting up your custom synthetic data generation pipeline through Relari.ai, please reach out!