1. Add evals to init_tracing
Pass scorer names toinit_tracing() and every trace is automatically evaluated. No extra code required.
evals:
session_evals to run scorers when a session ends (via end_session()):
2. create_scorer
Create custom scorers programmatically. Full signature and options:| Value | Use case |
|---|---|
trace | Evaluate the whole trace (one run per trace) |
span | Evaluate each span (LLM or tool) individually |
trajectory | Evaluate the full session trajectory (multi-turn) |
both | Trace and span scopes |
- numeric — Use
min_scoreandmax_score. Judge returns a number in that range. - categorical — Use
choices. Judge picks one label; each label has a score.
| Scope | Common mappings |
|---|---|
| Trace | {"trajectory": "trajectory"} — full execution trajectory |
| LLM spans | {"prompt": "llm.prompt", "completion": "llm.completion"} |
| Tool spans | {"tool_name": "tool.name", "tool_input": "tool.input", "tool_output": "tool.output"} |
3. evaluate
Score a specific trace after it completes:| Parameter | Type | Default | Description |
|---|---|---|---|
run | span or string | required | Span from trace_run() or trace ID string |
scorers | list[str] | None | Scorer names, built-in keys, or UUIDs. If None, uses init_tracing(evals=[...]) |
wait | float | 3 | Seconds to wait for trace ingestion before evaluating |
4. evaluate_spans
Run span-level scorers on matching spans within a trace. Usespan_type to target LLM or tool spans:
| Parameter | Description |
|---|---|
trace_id | Trace containing the spans |
scorer_ids | List of scorer UUIDs (span-level scorers) |
span_type | "llm" or "tool" — filter spans by type |
span_name_pattern | Optional substring filter for span names |