Evaluation SDK

Cascade provides a built-in evaluation system to score your agent’s performance. The Evaluator SDK covers the core evaluation flow: configuring evals, creating scorers, and running evaluations. The Python SDK (separate section) covers deeper APIs like failure reports, tasks, and trace access.

1. Add evals to init_tracing

Pass scorer names to init_tracing() and every trace is automatically evaluated. No extra code required.

from cascade import init_tracing

init_tracing(
    project="my_agent",
    evals=["helpfulness", "hallucination"],
)
# Every trace is now auto-evaluated.

Rubrics from the Platform: Create rubrics (scorers) in the Cascade dashboard (Evaluations → Rubrics). Use the rubric name in evals:

init_tracing(
    project="my_agent",
    evals=["helpfulness", "My Custom Rubric", "Proceeding Despite Explicit User Refusal"],
)

Session-level evals: For multi-turn conversations, use session_evals to run scorers when a session ends (via end_session()):

init_tracing(
    project="chatbot",
    evals=["helpfulness"],
    session_evals=["Proceeding Despite Explicit User Refusal"],
)

2. create_scorer

Create custom scorers programmatically. Full signature and options:

from cascade import CascadeEval

evals = CascadeEval()

scorer = evals.create_scorer(
    name="My Scorer",
    scorer_type="llm_judge",   # or "code"
    scope="trace",             # "trace" | "span" | "trajectory" | "both"
    threshold=0.5,
    description="Optional description",
    llm_provider="anthropic",  # or "openai"
    llm_model="claude-3-5-haiku-20241022",
    llm_template="...",       # Use {{variable}} placeholders
    variable_mappings={...},  # Map placeholders to span attributes
    output_type="numeric",     # or "categorical"
    min_score=0,              # for numeric
    max_score=1,               # for numeric
    choices=[...],            # for categorical
    tags=["tag1", "tag2"],
)

scope

Value	Use case
`trace`	Evaluate the whole trace (one run per trace)
`span`	Evaluate each span (LLM or tool) individually
`trajectory`	Evaluate the full session trajectory (multi-turn)
`both`	Trace and span scopes

output_type and scoring

numeric — Use min_score and max_score. Judge returns a number in that range.
categorical — Use choices. Judge picks one label; each label has a score.

# Numeric (0–1 scale)
output_type="numeric",
min_score=0,
max_score=1,

# Categorical (Y/N style)
output_type="categorical",
choices=[
    {"label": "Y", "score": 1.0, "description": "Appropriate"},
    {"label": "N", "score": 0.0, "description": "Inappropriate"},
],

variable_mappings Maps template placeholders to span attributes. Use the placeholder name as the key and the attribute path as the value.

Scope	Common mappings
Trace	`{"trajectory": "trajectory"}` — full execution trajectory
LLM spans	`{"prompt": "llm.prompt", "completion": "llm.completion"}`
Tool spans	`{"tool_name": "tool.name", "tool_input": "tool.input", "tool_output": "tool.output"}`

# Trace-level scorer
variable_mappings={"trajectory": "trajectory"}

# LLM span scorer
variable_mappings={"prompt": "llm.prompt", "completion": "llm.completion"}

# Tool span scorer
variable_mappings={
    "tool_name": "tool.name",
    "tool_input": "tool.input",
    "tool_output": "tool.output",
}

Example: trace-level numeric scorer

evals.create_scorer(
    name="Diagnosis Thoroughness",
    scorer_type="llm_judge",
    scope="trace",
    threshold=0.6,
    llm_provider="anthropic",
    llm_model="claude-3-5-haiku-20241022",
    output_type="numeric",
    min_score=0,
    max_score=1,
    llm_template=(
        "Rate the thoroughness of this agent's diagnosis on a scale of 0.0 to 1.0.\n\n"
        "Full execution trajectory:\n{{trajectory}}\n\n"
        "Score (0.0-1.0):"
    ),
    variable_mappings={"trajectory": "trajectory"},
)

Example: span-level categorical scorer (tool spans)

evals.create_scorer(
    name="Tool Usage Quality",
    scorer_type="llm_judge",
    scope="span",
    threshold=0.5,
    llm_provider="anthropic",
    llm_model="claude-3-5-haiku-20241022",
    output_type="categorical",
    choices=[
        {"label": "Y", "score": 1.0, "description": "Appropriate tool usage"},
        {"label": "N", "score": 0.0, "description": "Unnecessary or incorrect usage"},
    ],
    llm_template=(
        "Evaluate this tool call.\n\n"
        "Tool: {{tool_name}}\nInput: {{tool_input}}\nOutput: {{tool_output}}\n\n"
        "Was this tool call appropriate? Answer Y or N."
    ),
    variable_mappings={
        "tool_name": "tool.name",
        "tool_input": "tool.input",
        "tool_output": "tool.output",
    },
)

3. evaluate

Score a specific trace after it completes:

from cascade import init_tracing, trace_run, evaluate

init_tracing(project="my_agent")

with trace_run("MyAgent") as run:
    # ... agent code ...

results = evaluate(run, ["helpfulness", "hallucination"])
for r in results:
    print(f"{r['scorer_name']}: {'PASS' if r['passed'] else 'FAIL'} (score: {r['score']})")

Parameters

Parameter	Type	Default	Description
`run`	span or string	required	Span from `trace_run()` or trace ID string
`scorers`	list[str]	None	Scorer names, built-in keys, or UUIDs. If `None`, uses `init_tracing(evals=[...])`
`wait`	float	3	Seconds to wait for trace ingestion before evaluating

4. evaluate_spans

Run span-level scorers on matching spans within a trace. Use span_type to target LLM or tool spans:

from cascade import CascadeEval

evals = CascadeEval()

# Evaluate all LLM spans
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[llm_scorer_id],
    span_type="llm",
)

# Evaluate all tool spans
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[tool_scorer_id],
    span_type="tool",
)

# Optional: filter by span name pattern
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[scorer_id],
    span_type="llm",
    span_name_pattern="claude-3-5-haiku",
)

Parameters

Parameter	Description
`trace_id`	Trace containing the spans
`scorer_ids`	List of scorer UUIDs (span-level scorers)
`span_type`	`"llm"` or `"tool"` — filter spans by type
`span_name_pattern`	Optional substring filter for span names

Getting started

Performance Evaluation

Dashboard Overview

Integrations

Model providers

Developers

Security

1. Add evals to init_tracing

2. create_scorer

3. evaluate

4. evaluate_spans

Getting started

Performance Evaluation

Dashboard Overview

Integrations

Model providers

Developers

Security

​1. Add evals to init_tracing

​2. create_scorer

​3. evaluate

​4. evaluate_spans

1. Add evals to init_tracing

2. create_scorer

3. evaluate

4. evaluate_spans