> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runcascade.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation SDK

> Programmatic evaluation of agent performance with Cascade SDK

Cascade provides a built-in evaluation system to score your agent's performance. The **Evaluator SDK** covers the core evaluation flow: configuring evals, creating scorers, and running evaluations. The **Python SDK** (separate section) covers deeper APIs like failure reports, tasks, and trace access.

### 1. Add evals to init\_tracing

Pass scorer names to `init_tracing()` and every trace is automatically evaluated. No extra code required.

```python theme={null}
from cascade import init_tracing

init_tracing(
    project="my_agent",
    evals=["helpfulness", "hallucination"],
)
# Every trace is now auto-evaluated.
```

**Rubrics from the Platform:** Create rubrics (scorers) in the Cascade dashboard (Evaluations → Rubrics). Use the rubric name in `evals`:

```python theme={null}
init_tracing(
    project="my_agent",
    evals=["helpfulness", "My Custom Rubric", "Proceeding Despite Explicit User Refusal"],
)
```

**Session-level evals:** For multi-turn conversations, use `session_evals` to run scorers when a session ends (via `end_session()`):

```python theme={null}
init_tracing(
    project="chatbot",
    evals=["helpfulness"],
    session_evals=["Proceeding Despite Explicit User Refusal"],
)
```

***

### 2. create\_scorer

Create custom scorers programmatically. Full signature and options:

```python theme={null}
from cascade import CascadeEval

evals = CascadeEval()

scorer = evals.create_scorer(
    name="My Scorer",
    scorer_type="llm_judge",   # or "code"
    scope="trace",             # "trace" | "span" | "trajectory" | "both"
    threshold=0.5,
    description="Optional description",
    llm_provider="anthropic",  # or "openai"
    llm_model="claude-3-5-haiku-20241022",
    llm_template="...",       # Use {{variable}} placeholders
    variable_mappings={...},  # Map placeholders to span attributes
    output_type="numeric",     # or "categorical"
    min_score=0,              # for numeric
    max_score=1,               # for numeric
    choices=[...],            # for categorical
    tags=["tag1", "tag2"],
)
```

**scope**

| Value        | Use case                                          |
| ------------ | ------------------------------------------------- |
| `trace`      | Evaluate the whole trace (one run per trace)      |
| `span`       | Evaluate each span (LLM or tool) individually     |
| `trajectory` | Evaluate the full session trajectory (multi-turn) |
| `both`       | Trace and span scopes                             |

**output\_type and scoring**

* **numeric** — Use `min_score` and `max_score`. Judge returns a number in that range.
* **categorical** — Use `choices`. Judge picks one label; each label has a score.

```python theme={null}
# Numeric (0–1 scale)
output_type="numeric",
min_score=0,
max_score=1,

# Categorical (Y/N style)
output_type="categorical",
choices=[
    {"label": "Y", "score": 1.0, "description": "Appropriate"},
    {"label": "N", "score": 0.0, "description": "Inappropriate"},
],
```

**variable\_mappings**

Maps template placeholders to span attributes. Use the placeholder name as the key and the attribute path as the value.

| Scope          | Common mappings                                                                        |
| -------------- | -------------------------------------------------------------------------------------- |
| **Trace**      | `{"trajectory": "trajectory"}` — full execution trajectory                             |
| **LLM spans**  | `{"prompt": "llm.prompt", "completion": "llm.completion"}`                             |
| **Tool spans** | `{"tool_name": "tool.name", "tool_input": "tool.input", "tool_output": "tool.output"}` |

```python theme={null}
# Trace-level scorer
variable_mappings={"trajectory": "trajectory"}

# LLM span scorer
variable_mappings={"prompt": "llm.prompt", "completion": "llm.completion"}

# Tool span scorer
variable_mappings={
    "tool_name": "tool.name",
    "tool_input": "tool.input",
    "tool_output": "tool.output",
}
```

**Example: trace-level numeric scorer**

```python theme={null}
evals.create_scorer(
    name="Diagnosis Thoroughness",
    scorer_type="llm_judge",
    scope="trace",
    threshold=0.6,
    llm_provider="anthropic",
    llm_model="claude-3-5-haiku-20241022",
    output_type="numeric",
    min_score=0,
    max_score=1,
    llm_template=(
        "Rate the thoroughness of this agent's diagnosis on a scale of 0.0 to 1.0.\n\n"
        "Full execution trajectory:\n{{trajectory}}\n\n"
        "Score (0.0-1.0):"
    ),
    variable_mappings={"trajectory": "trajectory"},
)
```

**Example: span-level categorical scorer (tool spans)**

```python theme={null}
evals.create_scorer(
    name="Tool Usage Quality",
    scorer_type="llm_judge",
    scope="span",
    threshold=0.5,
    llm_provider="anthropic",
    llm_model="claude-3-5-haiku-20241022",
    output_type="categorical",
    choices=[
        {"label": "Y", "score": 1.0, "description": "Appropriate tool usage"},
        {"label": "N", "score": 0.0, "description": "Unnecessary or incorrect usage"},
    ],
    llm_template=(
        "Evaluate this tool call.\n\n"
        "Tool: {{tool_name}}\nInput: {{tool_input}}\nOutput: {{tool_output}}\n\n"
        "Was this tool call appropriate? Answer Y or N."
    ),
    variable_mappings={
        "tool_name": "tool.name",
        "tool_input": "tool.input",
        "tool_output": "tool.output",
    },
)
```

***

### 3. evaluate

Score a specific trace after it completes:

```python theme={null}
from cascade import init_tracing, trace_run, evaluate

init_tracing(project="my_agent")

with trace_run("MyAgent") as run:
    # ... agent code ...

results = evaluate(run, ["helpfulness", "hallucination"])
for r in results:
    print(f"{r['scorer_name']}: {'PASS' if r['passed'] else 'FAIL'} (score: {r['score']})")
```

**Parameters**

| Parameter | Type           | Default  | Description                                                                        |
| --------- | -------------- | -------- | ---------------------------------------------------------------------------------- |
| `run`     | span or string | required | Span from `trace_run()` or trace ID string                                         |
| `scorers` | list\[str]     | None     | Scorer names, built-in keys, or UUIDs. If `None`, uses `init_tracing(evals=[...])` |
| `wait`    | float          | 3        | Seconds to wait for trace ingestion before evaluating                              |

***

### 4. evaluate\_spans

Run span-level scorers on matching spans within a trace. Use `span_type` to target LLM or tool spans:

```python theme={null}
from cascade import CascadeEval

evals = CascadeEval()

# Evaluate all LLM spans
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[llm_scorer_id],
    span_type="llm",
)

# Evaluate all tool spans
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[tool_scorer_id],
    span_type="tool",
)

# Optional: filter by span name pattern
result = evals.evaluate_spans(
    trace_id="abc123...",
    scorer_ids=[scorer_id],
    span_type="llm",
    span_name_pattern="claude-3-5-haiku",
)
```

**Parameters**

| Parameter           | Description                                |
| ------------------- | ------------------------------------------ |
| `trace_id`          | Trace containing the spans                 |
| `scorer_ids`        | List of scorer UUIDs (span-level scorers)  |
| `span_type`         | `"llm"` or `"tool"` — filter spans by type |
| `span_name_pattern` | Optional substring filter for span names   |
