Demos
Video

How to Build and Improve AI Agents Faster Using Real-World Feedback

Play video

Discover how teams accelerate AI agent development by turning production failures into structured evaluation and improvement loops. Using Weights & Biases Weave and Models, this demo shows how failures become datasets, datasets become evaluators, and evaluators improve agent performance over time.

1
00:00:02,050 --> 00:00:13,759
Hey everyone, hi, Nico here. I'm going to demo the Mailbox Research Agent use case. Imagine thousands of people using a mailbox research agent that you developed.

2
00:00:14,140 --> 00:00:21,919
You select your company email and can ask any question about it.

3
00:00:21,920 --> 00:00:39,209
You can ask questions, it searches emails, reads them, and returns answers. Many users might be doing this at scale.

4
00:00:39,700 --> 00:00:53,940
Requests go through multiple agent steps. But when many people use it, things can go wrong, and it’s often hard to know what happened.

5
00:00:54,420 --> 00:00:57,720
Let’s look at this scenario with alerts enabled.

6
00:00:57,760 --> 00:01:09,959
We received alerts: a faithfulness alert and a short response alert.

7
00:01:09,970 --> 00:01:23,760
This means the answer wasn’t grounded in emails and was unusually short—often a sign of hallucination or failure.

8
00:01:23,960 --> 00:01:42,320
After multiple turns, the agent stopped searching emails and instead followed the conversation, producing hallucinated responses.

9
00:01:42,320 --> 00:01:55,339
This is hard to detect at scale when you have millions of traces.

10
00:01:55,530 --> 00:02:07,930
Typically, you'd start with dashboards showing traces, latency, tokens, and cost.

11
00:02:07,930 --> 00:02:17,549
But those metrics don’t explain agent behavior.

12
00:02:17,550 --> 00:02:21,180
System metrics alone are not enough.

13
00:02:21,180 --> 00:02:24,329
The next option is to inspect traces.

14
00:02:24,570 --> 00:02:26,620
Inside Weave, you can view traces.

15
00:02:26,840 --> 00:02:50,740
You can explore user questions and see step-by-step actions like searching inboxes and reading emails.

16
00:02:50,890 --> 00:02:59,170
This is useful but doesn’t scale across millions of traces.

17
00:02:59,270 --> 00:03:05,120
That’s where monitors and alerts come in.

18
00:03:05,390 --> 00:03:10,439
You can define monitors that score all incoming traces.

19
00:03:10,440 --> 00:03:26,849
For example, checking if answers are supported by email evidence using CoreWeave compute.

20
00:03:26,910 --> 00:03:46,060
Signals automatically score traces using custom metrics to evaluate agent behavior.

21
00:03:46,060 --> 00:03:51,799
You define a monitor, and all traces are scored in production.

22
00:03:52,120 --> 00:04:00,100
Next, you can trigger alerts when something goes wrong.

23
00:04:00,150 --> 00:04:11,479
Alerts show trends like how grounded your agent responses are over time.

24
00:04:11,480 --> 00:04:25,900
You can see performance degrade and identify when alerts were triggered.

25
00:04:25,900 --> 00:04:33,600
Slack alerts are helpful, but we can go further.

26
00:04:33,660 --> 00:04:44,170
You can trigger webhooks instead of alerts.

27
00:04:44,230 --> 00:05:07,749
These webhooks send failures to a reinforcement learning service for automated improvement.

28
00:05:08,280 --> 00:05:21,160
Failures are collected, rollouts are run, and the model is updated.

29
00:05:21,180 --> 00:05:36,960
This improves the model over time based on real production failures.

30
00:05:37,230 --> 00:06:00,129
This represents an ideal loop: automatic scoring, failure detection, and model improvement.

31
00:06:00,130 --> 00:06:18,259
More commonly, teams manually analyze traces and failures.

32
00:06:18,450 --> 00:06:33,260
You can filter traces, identify failures, and decide next steps.

33
00:06:33,260 --> 00:06:41,960
For example, adding failures to datasets or annotation queues.

34
00:06:42,310 --> 00:06:48,369
Domain experts can review and annotate failures.

35
00:06:48,370 --> 00:07:05,520
You can then run evaluations or re-score data.

36
00:07:05,760 --> 00:07:18,369
Evaluation dashboards show performance improvements over time.

37
00:07:18,700 --> 00:07:33,430
You can compare runs side-by-side to optimize your agent.

38
00:07:33,430 --> 00:07:49,869
This creates a loop from production failures to research improvements.

39
00:07:49,870 --> 00:07:56,590
Ultimately leading to better models and fewer failures.

40
00:07:56,950 --> 00:07:58,890
That’s the overview.

41
00:07:59,290 --> 00:08:05,800
If you have questions, feel free to reach out. Happy tracing!

42
00:08:08,430 --> 00:08:09,590
Alrighty.