The world of LLM app development is always on the move — new tricks, models, and apps pop up every week. As tech gets better, what users expect keeps ramping up. Staying ahead in this game is key to making sure it’s the one users keep coming back to.
The problem now becomes: how do you measure performance improvements? When you’re fiddling with prompts, tweaking the temperature, or switching up models, do you ever pause and think, “Will my users actually like this more?”
In this post, we’ll walk through how in-app user feedback from earlier deployments (or internal human evaluation) can be instrumental in quickly shaping future versions of a product. We’ll discuss the limitations of traditional feedback mechanisms and introduce a new technique that allows AI developers to use feedback data directly in offline testing and iterations (before a new deployment), making the development cycle more adaptable and responsive to user preferences.
When developing LLM-based applications, we are often faced with a particular problem we want to address, e.g., a specific type of question has a low accuracy. As we experiment with tweaks in prompts, parameters, architecture, etc., we want to evaluate performance of the new pipeline, in particular whether users will like the new version(s) of your AI application. The most straightforward way is to A/B test each change with the end users and collect their feedback data such as thumbs up / down, score rating, or written comments, but practically it is challenging for a few reasons:
In the next section, we’ll introduce a novel approach that we’ve deployed to multiple customers to help them make the most out of their user feedback data in offline development.
In response to these challenges in collecting user feedback, we have developed a novel approach to simulate user feedback using a small sample of user (or internally labeled) feedback data. Specifically, we use metric ensembling and conformal prediction to learn user preferences and use them offline during the development phase. At its core, we learn how users weigh different criteria (e.g., tone, conciseness, etc) and leverage conformal prediction to provide predictions to quantify confidence. This method drastically accelerates LLM app development by providing a way to anticipate how users might react to new features or changes before they are fully implemented.
To evaluate its effectiveness, we compared this approach with the more conventional one of using a single LLM call that assesses different aspects of the response to make a judgment. To compare the two alternatives (the proposed approach vs. the single LLM call), we conducted an experiment using the Unified-Feedback dataset. We used Kendall’s tau, a measure of rank correlation, to compare the rankings produced by our user feedback simulation and the single LLM call approach against the ground truth established by human evaluations. This analysis allows us to assess not only the degree of agreement, but also the order of preference that each method predicts compared to the human rankings.
Our experiment revealed that the user feedback simulation has a correlation of 93% that significantly exceeded that of the single LLM call approach, which attains roughly 70% correlation. This indicates that, in terms of ranking , the simulated user feedback simulation provides a closer approximation to human judgment.
The reason why the simulated user feedback performs better is twofold:
While single LLM calls can identify major improvements in the pipeline, they fall short of detecting the more frequent, minor enhancements critical in mature pipelines. Simulated user feedback, however, exhibits a high correlation with human judgment, enabling the detection of these incremental advances.
As a side note, while we could have used the data to fine-tune an LLM, this has the typical drawback of requiring more data and not being as interpretable.
In the next section, we will walk through an example on how to create your simulated user feedback.
In this section we will show how we can use the open-source library continuous-eval to create simulated user feedback.
Consider a Q&A chatbot application. After deployment, users begin rating responses with thumbs up or down, indicating a need for performance enhancement. For this example we will use the example named correctness
in continuous-eval:
dataset = Dataset(example_data_downloader("correctness"))
# Samples are annotated with "correct", "incorrect" or "refuse-to-answer"
# We remove the samples where the LLL refused to answer (i.e., said "I don't know")
dataset.filter(lambda x: x["annotation"] != "refuse-to-answer")
dataset.sample(300) # Only for this example: randomly sample 300 examples
For the tone:
tone = LLMBasedCustomMetric(
name="Tone",
definition="The Tone/Content Issues metric evaluates the appropriateness and accuracy of the tone and content in responses to specific questions. It focuses on ensuring that the tone is professional and suitable for the context, and that the content accurately addresses the question without unnecessary deviations or inaccuracies. This metric is crucial for maintaining a professional image and ensuring clear, direct communication.",
scoring_rubric="""Use the following rubric to assign a score to the answer based on its tone:
- Score 1: The response is inappropriate or inaccurate, with a tone that is either too informal, overly strong, or not suited to the professional context. The content may be irrelevant, incorrect, or fail to directly address the question posed.
- Score 2: The response is mostly appropriate and accurate but may contain minor tone or content issues. The tone is generally professional but may slip into informality or unnecessary strength in places. The content addresses the question but may include minor inaccuracies or unnecessary details.
- Score 3: The response is appropriate and accurate, with a tone that is professional and suited to the context. The content directly and correctly addresses the question without unnecessary deviations or inaccuracies.""",
scoring_function=ScoringFunctions.Numeric(min_val=1, max_val=3),
model_parameters={"temperature": 0},
)
while for conciseness:
conciseness = LLMBasedCustomMetric(
name="Conciseness",
definition="Conciseness in communication refers to the expression of ideas in a clear and straightforward manner, using the fewest possible words without sacrificing clarity or completeness of information. It involves eliminating redundancy, verbosity, and unnecessary details, focusing instead on delivering the essential message efficiently. ",
scoring_rubric="""Use the following rubric to assign a score to the answer based on its conciseness:
- Score 1: The answer is overly verbose, containing a significant amount of unnecessary information, repetition, or redundant expressions that do not contribute to the understanding of the topic.
- Score 2: The answer includes some unnecessary details or slightly repetitive information, but the excess does not severely hinder understanding.
- Score 3: The answer is clear, direct, and to the point, with no unnecessary words, details, or repetition.""",
scoring_function=ScoringFunctions.Numeric(min_val=1, max_val=3),
model_parameters={"temperature": 0},
)
We use Tone and Conciseness together with more standard metrics, in particular we will consider the
DeterministicAnswerCorrectens
and LLMBasedAnswerCorrectness)
LLMBasedAnswerRelevance
)LLMBasedStyleConsistency
)FleschKincaidReadability
)The next step is to put all the metrics together and specify what field of the dataset should be used to compute the metrics. To do that we can use the SingleModulePipeline
pipeline = SingleModulePipeline(
dataset=dataset,
eval=[
DeterministicAnswerCorrectness().use(
answer=dataset.answer,
ground_truth_answers=dataset.ground_truths,
),
LLMBasedAnswerCorrectness().use(
question=dataset.question,
answer=dataset.answer,
ground_truth_answers=dataset.ground_truths,
),
LLMBasedAnswerRelevance().use(
question=dataset.question, answer=dataset.answer
),
LLMBasedStyleConsistency().use(
answer=dataset.answer, ground_truth_answers=dataset.ground_truths
),
FleschKincaidReadability().use(answer=dataset.answer),
tone.use(
question=dataset.question,
answer=dataset.answer,
ground_truth_answers=dataset.ground_truths,
),
conciseness.use(
question=dataset.question,
answer=dataset.answer,
ground_truth_answers=dataset.ground_truths,
),
],
)
and run all the metrics using the EvaluationManager
eval_manager = EvaluationManager(pipeline)
# The dataset already contains the model output so we just set the evaluation results
eval_manager.evaluation.results = dataset.data
eval_manager.run_metrics() # Note: there is no progress bar, it might take a few minutes
The next step is to train simulated user feedback predictor
datasplit = DataSplit(
X=eval_manager.metrics.to_pandas(),
y=map(lambda x: 1 if x == "correct" else 0, dataset["annotation"]),
split_ratios=SplitRatios(train=0.6, test=0.2, calibration=0.2),
)
# We use the train and calibration sets to train the classifier
predictor = EnsembleMetric(training=datasplit.train, calibration=datasplit.calibration)
This simulated user feedback predictor is able to correctly predict the human feedback in the test split 96.67% of the time.
We can leverage the proposed approach to better understand what is important to the user. Below is the learned importance of every metric by the simulated user feedback predictor.
Looking at the plot, we see that Correctness (including token overlap, which is another measure for correctness) and Relevance to the question are the most important predictors of user preference. But the user also weighs tone and style consistency into the decision. At the same time, we can see that conciseness and readability are not as important. Reviewing this graph provides valuable insight into user preferences, giving a clear indication of what elements are essential and what can be adjusted if compromises need to be made.
Collecting user feedback is challenging, yet it is the most important information for developers of large language models (LLMs). By simulating user feedback during offline testing, we significantly reduces the time it takes for feedback to travel from the field back to developers, while maintaining positive user relationships.
In practice, our approach has proven to closely mirror actual human responses, outperforming traditional methods that rely on isolated LLM responses. This strategy allows for the incremental improvement of generative AI applications, fostering continuous refinement and greater congruence with what users expect.