Skip to main content
Cascade provides a built-in evaluation system to score your agent’s performance. The Evaluator SDK covers the core evaluation flow: configuring evals, creating scorers, and running evaluations. The Python SDK (separate section) covers deeper APIs like failure reports, tasks, and trace access.

1. Add evals to init_tracing

Pass scorer names to init_tracing() and every trace is automatically evaluated. No extra code required.
from cascade import init_tracing

init_tracing(
    project="my_agent",
    evals=["helpfulness", "hallucination"],
)
# Every trace is now auto-evaluated.
Rubrics from the Platform: Create rubrics (scorers) in the Cascade dashboard (Evaluations → Rubrics). Use the rubric name in evals:
init_tracing(
    project="my_agent",
    evals=["helpfulness", "My Custom Rubric", "Proceeding Despite Explicit User Refusal"],
)
Session-level evals: For multi-turn conversations, use session_evals to run scorers when a session ends (via end_session()):
init_tracing(
    project="chatbot",
    evals=["helpfulness"],
    session_evals=["Proceeding Despite Explicit User Refusal"],
)

2. create_scorer

Create custom scorers programmatically. Full signature and options:
from cascade import CascadeEval

evals = CascadeEval()

scorer = evals.create_scorer(
    name="My Scorer",
    scorer_type="llm_judge",   # or "code"
    scope="trace",             # "trace" | "span" | "trajectory" | "both"
    threshold=0.5,
    description="Optional description",
    llm_provider="anthropic",  # or "openai"
    llm_model="claude-3-5-haiku-20241022",
    llm_template="...",       # Use {{variable}} placeholders
    variable_mappings={...},  # Map placeholders to span attributes
    output_type="numeric",     # or "categorical"
    min_score=0,              # for numeric
    max_score=1,               # for numeric
    choices=[...],            # for categorical
    tags=["tag1", "tag2"],
)
scope
ValueUse case
traceEvaluate the whole trace (one run per trace)
spanEvaluate each span (LLM or tool) individually
trajectoryEvaluate the full session trajectory (multi-turn)
bothTrace and span scopes
output_type and scoring
  • numeric — Use min_score and max_score. Judge returns a number in that range.
  • categorical — Use choices. Judge picks one label; each label has a score.
# Numeric (0–1 scale)
output_type="numeric",
min_score=0,
max_score=1,

# Categorical (Y/N style)
output_type="categorical",
choices=[
    {"label": "Y", "score": 1.0, "description": "Appropriate"},
    {"label": "N", "score": 0.0, "description": "Inappropriate"},
],
variable_mappings Maps template placeholders to span attributes. Use the placeholder name as the key and the attribute path as the value.
ScopeCommon mappings
Trace{"trajectory": "trajectory"} — full execution trajectory
LLM spans{"prompt": "llm.prompt", "completion": "llm.completion"}
Tool spans{"tool_name": "tool.name", "tool_input": "tool.input", "tool_output": "tool.output"}
# Trace-level scorer
variable_mappings={"trajectory": "trajectory"}

# LLM span scorer
variable_mappings={"prompt": "llm.prompt", "completion": "llm.completion"}

# Tool span scorer
variable_mappings={
    "tool_name": "tool.name",
    "tool_input": "tool.input",
    "tool_output": "tool.output",
}
Example: trace-level numeric scorer
evals.create_scorer(
    name="Diagnosis Thoroughness",
    scorer_type="llm_judge",
    scope="trace",
    threshold=0.6,
    llm_provider="anthropic",
    llm_model="claude-3-5-haiku-20241022",
    output_type="numeric",
    min_score=0,
    max_score=1,
    llm_template=(
        "Rate the thoroughness of this agent's diagnosis on a scale of 0.0 to 1.0.\n\n"
        "Full execution trajectory:\n{{trajectory}}\n\n"
        "Score (0.0-1.0):"
    ),
    variable_mappings={"trajectory": "trajectory"},
)
Example: span-level categorical scorer (tool spans)
evals.create_scorer(
    name="Tool Usage Quality",
    scorer_type="llm_judge",
    scope="span",
    threshold=0.5,
    llm_provider="anthropic",
    llm_model="claude-3-5-haiku-20241022",
    output_type="categorical",
    choices=[
        {"label": "Y", "score": 1.0, "description": "Appropriate tool usage"},
        {"label": "N", "score": 0.0, "description": "Unnecessary or incorrect usage"},
    ],
    llm_template=(
        "Evaluate this tool call.\n\n"
        "Tool: {{tool_name}}\nInput: {{tool_input}}\nOutput: {{tool_output}}\n\n"
        "Was this tool call appropriate? Answer Y or N."
    ),
    variable_mappings={
        "tool_name": "tool.name",
        "tool_input": "tool.input",
        "tool_output": "tool.output",
    },
)

3. evaluate

Score a specific trace after it completes:
from cascade import init_tracing, trace_run, evaluate

init_tracing(project="my_agent")

with trace_run("MyAgent") as run:
    # ... agent code ...

results = evaluate(run, ["helpfulness", "hallucination"])
for r in results:
    print(f"{r['scorer_name']}: {'PASS' if r['passed'] else 'FAIL'} (score: {r['score']})")
Parameters
ParameterTypeDefaultDescription
runspan or stringrequiredSpan from trace_run() or trace ID string
scorerslist[str]NoneScorer names, built-in keys, or UUIDs. If None, uses init_tracing(evals=[...])
waitfloat3Seconds to wait for trace ingestion before evaluating

4. evaluate_spans

Run span-level scorers on matching spans within a trace. Use span_type to target LLM or tool spans:
from cascade import CascadeEval

evals = CascadeEval()

# Evaluate all LLM spans
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[llm_scorer_id],
    span_type="llm",
)

# Evaluate all tool spans
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[tool_scorer_id],
    span_type="tool",
)

# Optional: filter by span name pattern
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[scorer_id],
    span_type="llm",
    span_name_pattern="claude-3-5-haiku",
)
Parameters
ParameterDescription
trace_idTrace containing the spans
scorer_idsList of scorer UUIDs (span-level scorers)
span_type"llm" or "tool" — filter spans by type
span_name_patternOptional substring filter for span names