Overview
The Cascade MCP server lets you manage rubrics, run evaluations, inspect traces, and debug failures directly from your IDE, without switching to the Cascade dashboard. It works with Cursor, Claude Code, VS Code, Codex, and any MCP-compatible client. Everything you do through the MCP server is reflected in the Cascade dashboard in real time. Create a rubric from your IDE and it shows up in the Rubrics page. Run an evaluation and the results appear in the trace detail view. Set up active evals and failures flow into the Failures page. The MCP and dashboard are two interfaces to the same system, use whichever fits your workflow, or both.Setup
You need two things: your Cascade backend URL and your API key (found in Dashboard → Settings).Cursor
Add to your project’s.cursor/mcp.json:
Claude Code
/mcp in a Claude Code session to authenticate.
Example Workflows
You interact with Cascade through natural language. The AI assistant calls the right tools behind the scenes.Browse your agent’s activity
“What projects do I have in Cascade?”Returns your projects with trace counts and time ranges.
“Show me the last 5 traces for travel_agent”Returns recent traces with status, duration, and span counts. Pick one to inspect further.
“Show me the execution tree for that trace”Returns the hierarchical structure: which agents ran, what LLM calls were made, which tools were called, and how they’re nested.
“Show me the exact LLM prompt for that hallucination span”Drills into a specific span to see the full prompt, completion, model, tokens, and cost.
Run evaluations on demand
“Run hallucination and helpfulness on my latest trace”Evaluates the most recent trace against those rubrics and returns pass/fail with scores and reasoning for each.
“Check tool correctness on all LLM calls in that trace”Runs span-level evaluation on every LLM span in the trace individually, showing which specific calls passed or failed.
Create rubrics
“Create a rubric that checks if my agent always confirms before booking”Creates a new rubric on your project with an appropriate evaluation prompt, variables, and threshold. The more specific your prompt, the better the rubric. Compare: Vague (will produce unreliable scores):
“Create a rubric that checks if my agent is good”Specific (clear criteria, the AI knows exactly what to build):
“Create a trace-level rubric for travel_agent called ‘Booking Confirmation Check’. It should be categorical Y/N. Check whether the agent explicitly asks the user for confirmation before calling the book_flight or book_hotel tools. If the agent books without asking, it should fail.”Span-level rubric + active eval task in one go:
“I want to create a new rubric for our travel_agent. Create a span-level eval that checks whether the tool output includes temperature in Fahrenheit and gives information about wind. Then create an active eval task that runs this rubric specifically on the get_weather_forecast tool.”This creates the rubric scoped to tool spans and sets up a scheduled task filtered to
span_type="tool" with span_name_pattern="get_weather_forecast" — all in one conversation.
“My agent keeps making redundant API calls — create a rubric to catch that”Uses
generate_scorer_from_comment — analyzes a recent trace and auto-generates a rubric tailored to detect the specific issue you described. Best for when you’ve observed a problem but don’t want to craft the evaluation prompt yourself.
“What pre-built rubrics are available?”Lists Cascade’s built-in templates: helpfulness, hallucination, tool usage efficiency, toxicity, conciseness, and more. To activate one on your project:
“Activate the hallucination and helpfulness rubrics on my travel_agent project”
Set up monitoring
“Set up active evals on customer_support_chatbot with hallucination and helpfulness”Creates a scheduled evaluation task that automatically evaluates every new trace as it completes. Failures show up in the Failures page in real time. Scoped monitoring — target specific agents or tools:
“Set up active evals on travel_agent with my ‘LLM Response Quality’ rubric, but only on LLM spans from PlannerAgent”This creates a task with
agent_name="PlannerAgent" and span_type="llm", so only PlannerAgent’s LLM calls get evaluated.
“What’s actively monitoring my travel_agent?”Shows which rubrics are running as active evals on that project, with their scopes and thresholds.
“Pause active evals on travel_agent, I’m doing a refactor”Pauses auto-evaluation. Resume when ready.
Backtest historical traces
“Backtest my hallucination rubric on the last 2 weeks of traces for travel_agent”Creates a batch task that runs the rubric against all historical traces in the date range. After creating, it runs the task automatically. Check results when it completes.
“I just updated my hallucination rubric’s prompt. Backtest it on all traces from the last month to see if the new prompt catches more issues.”Creates a batch task with
date_from set to 30 days ago, runs it, and you can ask for results once it finishes to compare against previous scores.
“Show me the backtest results”Returns pass rates, average scores, and individual results per trace.
Evaluate sessions
“Run my ‘Multi-Turn Consistency’ rubric on the latest session for customer_support_chatbot”Finds the most recent session, evaluates the combined trajectory across all turns, and returns whether the agent maintained context throughout the conversation.
“Create a session-level rubric that checks if the agent ever proceeds with an action after the user says no. Then run it on my last 3 sessions.”Creates a session-scoped rubric using
{{session_trajectory}}, then evaluates it against recent sessions.
End-to-end: rubric → monitoring → iterate
Here’s the full workflow from creating a rubric to having it run automatically on every new trace: Step 1 — Create the rubric:“Create a numeric rubric for travel_agent called ‘Response Accuracy’. Score 0 to 1 based on whether the agent’s final response accurately reflects the data returned by tool calls. If the agent states prices, dates, or availability that don’t match what the tools returned, score low. Threshold 0.7.”This creates a trace-level rubric with
output_type="numeric", min_score=0, max_score=1, and threshold=0.7.
Step 2 — Set up active eval:
“Create an active eval task on travel_agent that runs ‘Response Accuracy’ on every new trace”Now every trace that completes on
travel_agent is automatically evaluated. Failures appear in the Failures page.
Step 3 — Run your agent and check:
Run your agent from your terminal as usual. Then:
“How did my latest trace do?”The active eval has already scored it. You’ll see the result immediately — no manual evaluation needed. If it failed, ask:
“What caused the failure?”And you’ll get the reasoning, trace tree, and exact span that caused the issue.
Debug failures
“What’s my failure breakdown this week?”Returns aggregate stats: total failures, daily trend, and which rubrics fail most.
“Show me the hallucination failures”Lists recent hallucination failures with the scorer’s reasoning explaining why each one failed.
“Show me the trace tree for that failure”Opens the execution tree so you can see exactly where in the agent’s flow the issue occurred.
Iteration loop
- First time: Run your agent — this will also create a project.
- Once the project is created: Create your rubrics, set up auto evals, and so on.
- See how well the rubrics do, and iterate.
Available Tools
Projects & Traces
| Tool | Description |
|---|---|
list_projects | List all projects in your organization |
list_traces | List recent traces for a project with filtering |
get_trace | Get all spans for a trace with full data |
get_trace_tree | Get a trace as a lightweight hierarchical tree |
get_span | Get full details for a specific span |
Sessions
| Tool | Description |
|---|---|
list_sessions | List multi-turn sessions for a project |
get_session | Get a session with all its traces |
Rubrics
| Tool | Description |
|---|---|
list_scorers | List rubrics for a project |
get_scorer | Get full rubric configuration |
list_builtin_scorers | List pre-built rubric templates |
create_scorer | Create a new rubric |
update_scorer | Update an existing rubric |
delete_scorer | Delete a rubric (results preserved) |
generate_scorer_from_comment | Auto-generate a rubric from a description |
Evaluations
| Tool | Description |
|---|---|
evaluate_trace | Run rubrics on a trace |
evaluate_spans | Run span-level rubrics on matching spans |
evaluate_session | Run session-level rubrics on a session |
Results
| Tool | Description |
|---|---|
list_results | List past evaluation results with filters |
Tasks
| Tool | Description |
|---|---|
create_task | Create a backtest or active eval task |
list_tasks | List evaluation tasks |
get_task | Get task details and progress |
run_task | Run a batch task |
pause_task | Pause an active eval |
resume_task | Resume a paused active eval |
delete_task | Delete a task |
get_task_results | Get results from a task run |
Failures
| Tool | Description |
|---|---|
list_failures | List evaluation failures with reasoning |
get_failure_stats | Get aggregate failure statistics and trends |
Convenience
| Tool | Description |
|---|---|
get_active_evals | Get rubrics actively monitoring a project |
Template Variables
When creating rubrics, use{{variable}} placeholders in the evaluation prompt. They’re automatically mapped to the right data based on the rubric’s scope.
Trace-level (scope="trace")
| Variable | Description |
|---|---|
{{input}} | Initial input to the agent |
{{actual_output}} | Final output of the trace |
{{trajectory}} | Full execution trajectory (all LLM and tool calls grouped by agent) |
{{tool_calls}} | All tool calls with name, input, and output |
{{llm_calls}} | All LLM calls with model, prompt, and completion |
{{context}} | Retrieval context (if available) |
{{duration}} | Total trace duration |
{{has_error}} | Whether the trace has errors |
{{span_count}} | Total number of spans |
Span-level for LLM spans (scope="span")
| Variable | Description |
|---|---|
{{prompt}} | LLM input prompt/messages |
{{completion}} | LLM completion text |
{{model}} | Model name |
Span-level for tool spans (scope="span")
| Variable | Description |
|---|---|
{{tool_name}} | Tool function name |
{{tool_input}} | Tool input parameters |
{{tool_output}} | Tool return value |
Session-level (scope="session")
| Variable | Description |
|---|---|
{{session_trajectory}} | Full trajectory across all traces in the session |
Key Concepts
Project — A named agent or service (e.g.travel_agent, customer_support_chatbot). All data is scoped to a project.
Trace — One complete agent execution. Contains a hierarchy of spans showing every step the agent took.
Span — A single step within a trace: an LLM call, tool call, sub-agent delegation, or function. Spans are nested as parent-child.
Session — A group of traces from the same multi-turn conversation. Used for chatbots and conversational agents.
Scorer / Rubric — An evaluation criteria that scores agent behavior. Can be trace-level (evaluates the whole execution), span-level (evaluates individual steps), or session-level (evaluates across turns). Outputs a numeric score (0-1) or categorical label (e.g. Y/N) with a pass/fail threshold.
Task — An evaluation job. Batch tasks (backtests) run rubrics on historical traces. Scheduled tasks (active evals) automatically evaluate every new trace as it completes.
FAQ
Do MCP actions show up in the dashboard? Yes. Everything is the same data. Rubrics created via MCP appear in the Rubrics page, evaluation results appear in trace detail views, and failures from MCP-triggered evals show up in the Failures page. Can I use scorer names instead of UUIDs? Yes. Tools likeevaluate_trace, evaluate_spans, evaluate_session, and create_task all accept human-readable scorer names. They’re resolved to UUIDs automatically.
What happens when I pause active evals?
New traces stop being auto-evaluated until you resume. Existing results are preserved.
Do I need the dashboard open to use the MCP?
No. The MCP talks directly to the Cascade backend. The dashboard is optional — use it when you want visual charts, trend analysis, or trace tree visualization.
Can I scope evaluations to a specific agent or tool?
Yes. When creating tasks, you can set agent_name to scope to a specific agent, and span_type / span_name_pattern to target specific span types or tool names.