Tracking failures
Failures are tracked with daily counts and broken down by scorer type, so you can spot trends at a glance. A sudden spike in hallucination failures after a prompt change, or a gradual increase in tool-usage failures as your agent handles more edge cases. These patterns show up immediately. Each failure entry includes the scorer’s reasoning explaining why it failed, so you don’t have to re-run the evaluation to understand the issue.