Documentation Index
Fetch the complete documentation index at: https://docs.runcascade.com/llms.txt
Use this file to discover all available pages before exploring further.
Cascade provides a built-in evaluation system to score your agent’s performance. The Evaluator SDK covers the core evaluation flow: configuring evals, creating scorers, and running evaluations. The Python SDK (separate section) covers deeper APIs like failure reports, tasks, and trace access.
1. Add evals to init_tracing
Pass scorer names to init_tracing() and every trace is automatically evaluated. No extra code required.
from cascade import init_tracing
init_tracing(
project="my_agent",
evals=["helpfulness", "hallucination"],
)
# Every trace is now auto-evaluated.
Rubrics from the Platform: Create rubrics (scorers) in the Cascade dashboard (Evaluations → Rubrics). Use the rubric name in evals:
init_tracing(
project="my_agent",
evals=["helpfulness", "My Custom Rubric", "Proceeding Despite Explicit User Refusal"],
)
Session-level evals: For multi-turn conversations, use session_evals to run scorers when a session ends (via end_session()):
init_tracing(
project="chatbot",
evals=["helpfulness"],
session_evals=["Proceeding Despite Explicit User Refusal"],
)
2. create_scorer
Create custom scorers programmatically. Full signature and options:
from cascade import CascadeEval
evals = CascadeEval()
scorer = evals.create_scorer(
name="My Scorer",
scorer_type="llm_judge", # or "code"
scope="trace", # "trace" | "span" | "trajectory" | "both"
threshold=0.5,
description="Optional description",
llm_provider="anthropic", # or "openai"
llm_model="claude-3-5-haiku-20241022",
llm_template="...", # Use {{variable}} placeholders
variable_mappings={...}, # Map placeholders to span attributes
output_type="numeric", # or "categorical"
min_score=0, # for numeric
max_score=1, # for numeric
choices=[...], # for categorical
tags=["tag1", "tag2"],
)
scope
| Value | Use case |
|---|
trace | Evaluate the whole trace (one run per trace) |
span | Evaluate each span (LLM or tool) individually |
trajectory | Evaluate the full session trajectory (multi-turn) |
both | Trace and span scopes |
output_type and scoring
- numeric — Use
min_score and max_score. Judge returns a number in that range.
- categorical — Use
choices. Judge picks one label; each label has a score.
# Numeric (0–1 scale)
output_type="numeric",
min_score=0,
max_score=1,
# Categorical (Y/N style)
output_type="categorical",
choices=[
{"label": "Y", "score": 1.0, "description": "Appropriate"},
{"label": "N", "score": 0.0, "description": "Inappropriate"},
],
variable_mappings
Maps template placeholders to span attributes. Use the placeholder name as the key and the attribute path as the value.
| Scope | Common mappings |
|---|
| Trace | {"trajectory": "trajectory"} — full execution trajectory |
| LLM spans | {"prompt": "llm.prompt", "completion": "llm.completion"} |
| Tool spans | {"tool_name": "tool.name", "tool_input": "tool.input", "tool_output": "tool.output"} |
# Trace-level scorer
variable_mappings={"trajectory": "trajectory"}
# LLM span scorer
variable_mappings={"prompt": "llm.prompt", "completion": "llm.completion"}
# Tool span scorer
variable_mappings={
"tool_name": "tool.name",
"tool_input": "tool.input",
"tool_output": "tool.output",
}
Example: trace-level numeric scorer
evals.create_scorer(
name="Diagnosis Thoroughness",
scorer_type="llm_judge",
scope="trace",
threshold=0.6,
llm_provider="anthropic",
llm_model="claude-3-5-haiku-20241022",
output_type="numeric",
min_score=0,
max_score=1,
llm_template=(
"Rate the thoroughness of this agent's diagnosis on a scale of 0.0 to 1.0.\n\n"
"Full execution trajectory:\n{{trajectory}}\n\n"
"Score (0.0-1.0):"
),
variable_mappings={"trajectory": "trajectory"},
)
Example: span-level categorical scorer (tool spans)
evals.create_scorer(
name="Tool Usage Quality",
scorer_type="llm_judge",
scope="span",
threshold=0.5,
llm_provider="anthropic",
llm_model="claude-3-5-haiku-20241022",
output_type="categorical",
choices=[
{"label": "Y", "score": 1.0, "description": "Appropriate tool usage"},
{"label": "N", "score": 0.0, "description": "Unnecessary or incorrect usage"},
],
llm_template=(
"Evaluate this tool call.\n\n"
"Tool: {{tool_name}}\nInput: {{tool_input}}\nOutput: {{tool_output}}\n\n"
"Was this tool call appropriate? Answer Y or N."
),
variable_mappings={
"tool_name": "tool.name",
"tool_input": "tool.input",
"tool_output": "tool.output",
},
)
3. evaluate
Score a specific trace after it completes:
from cascade import init_tracing, trace_run, evaluate
init_tracing(project="my_agent")
with trace_run("MyAgent") as run:
# ... agent code ...
results = evaluate(run, ["helpfulness", "hallucination"])
for r in results:
print(f"{r['scorer_name']}: {'PASS' if r['passed'] else 'FAIL'} (score: {r['score']})")
Parameters
| Parameter | Type | Default | Description |
|---|
run | span or string | required | Span from trace_run() or trace ID string |
scorers | list[str] | None | Scorer names, built-in keys, or UUIDs. If None, uses init_tracing(evals=[...]) |
wait | float | 3 | Seconds to wait for trace ingestion before evaluating |
4. evaluate_spans
Run span-level scorers on matching spans within a trace. Use span_type to target LLM or tool spans:
from cascade import CascadeEval
evals = CascadeEval()
# Evaluate all LLM spans
result = evals.evaluate_spans(
trace_id="abc123...",
scorer_ids=[llm_scorer_id],
span_type="llm",
)
# Evaluate all tool spans
result = evals.evaluate_spans(
trace_id="abc123...",
scorer_ids=[tool_scorer_id],
span_type="tool",
)
# Optional: filter by span name pattern
result = evals.evaluate_spans(
trace_id="abc123...",
scorer_ids=[scorer_id],
span_type="llm",
span_name_pattern="claude-3-5-haiku",
)
Parameters
| Parameter | Description |
|---|
trace_id | Trace containing the spans |
scorer_ids | List of scorer UUIDs (span-level scorers) |
span_type | "llm" or "tool" — filter spans by type |
span_name_pattern | Optional substring filter for span names |