Closing the gap between evaluations and system improvements

We all know that ongoing evaluation is essential for success with AI-driven products. When done well, evaluations provide you with a clear picture of overall performance, identify weaknesses, and reveal new things about how the product itself is being used.

Let's suppose you are conducting high-quality evals. You've done everything right in terms of sampling data points, collecting needed annotations, choosing appropriate metrics, and analyzing the results. You now have a rich picture of your system's performance. So far so good.

What's next? How exactly do you capitalize on what you have learned? To realize the full value of (often very hard-won) evals, we need a specific and actionable answer to this question. All too often, though, the question is not even posed, let alone answered fully.

In earlier eras of AI, the answer was often not very satisfying: maybe tweak how you are processing examples, to try to address the failure cases, retrain the relevant components of your system using the new annotated data, and then hope for the best.

Happily, with modern GenAI systems, we are much better positioned to fully realize the value of our evals. Here are two easy and productive steps to consider:

  1. Ask a GenAI tool to analyze your evaluation data, looking for latent requirements. For example, you might not have noticed that a specific behavior from your chatbot ticks your users off, but an LLM is sure to pick up such a pattern. This will be especially effective if your examples have free-form comments from users or evaluators on them. Review the resulting requirements to make sure they are sensible, and then add the good ones to the relevant prompt in your system.
  2. Ask a GenAI tool to synthesize some few-shot examples from your evaluation data, and then include these examples in your prompt. All the major LLM providers are aware that few-shot examples are powerful tools for shaping system behavior, so they all do a good job synthesizing examples like this. Try to nudge the model to cover weird edge cases, and don't be afraid of including a lot of these examples in your prompt, even if they are long.

I should note that the state-of-the-art prompt optimizer MIPROv2 is, in essence, a sophisticated method for doing both of the above steps. If you have enough labeled examples and the technical set-up required to run them, then we do recommend them.

In addition, if you are in a position to fine-tune the LM components using your labeled data, that is also likely to be helpful. With our BetterTogether method, we showed that it will complement the work you do to improve your prompts. For task-specific work, if I had to choose only one, though, I would choose prompt improvements: more impact at a lower cost in terms of data labeling and compute.

If you are working in Bigspin, you will find that 1 and 2 are happening continuously behind the scenes for you. You can just ask for few-shots and latent requirements in chat, or you can use the dedicated tools. The more examples you and your team annotate with comments and labels, the richer both of these steps will be.

Chris Potts
Chris Potts
Co-Founder & Chief Scientist