Skip to main content

Overview

Safety Signals are the runtime indicators Cascade uses to evaluate agent behavior against defined safety policies and historical baselines. As agents execute, Cascade continuously analyzes tool calls, reasoning traces, and outputs to identify policy violations, behavioral drift, and anomalous patterns. These signals provide both immediate feedback for enforcement and longer-term insight into how agent behavior evolves over time. Safety Signals are designed to support enforcement, debugging, and iteration without relying on static checks or offline evaluation.

Policy Violation Signals

Policy violation signals are generated when agent behavior does not comply with the active set of safety policies. Each tool call and reasoning trace is evaluated against all selected policies for the agent or workflow. When a violation is identified, Cascade records the signal and applies an action based on the configured enforcement mode. These signals are evaluated in real time and are always contextualized within the full execution trace, ensuring that violations are interpreted correctly rather than in isolation.

Drift Signals

Drift signals surface longer-term changes in agent behavior across executions. In addition to evaluating individual events, Cascade maintains a rolling baseline derived from recent completed agent runs. At the end of each run, the system compares the run’s behavior against this baseline to detect statistically significant deviations. Only deviations that exceed significance thresholds are surfaced as drift signals.

Key Characteristics

  • Baselines are built from recent agent executions
  • Comparisons occur after run completion
  • Only statistically significant deviations are surfaced
  • Drift signals do not trigger alerts or enforcement actions
Drift signals are intended to help teams understand how changes in prompts, tools, models, or infrastructure affect agent behavior over time.

Classification Signals

Classification signals label agent behavior using structured semantic categories. Cascade uses internal categorization models to classify model events into one of 12 categories:
  • Informational
  • Out of Scope
  • Unsafe
  • Harmful
  • Hateful
  • Sexual Content
  • Violence
  • Self Harm
  • Deceptive
  • Privacy Risk
  • Criminal
  • None
Each classification is associated with a confidence score. These confidence scores are used during categorization policy evaluation to determine whether behavior is considered out of policy. Classification signals do not enforce behavior directly. They provide structured inputs that policies and enforcement modes act upon.

Safety Metrics

Safety Signals are aggregated into metrics that provide visibility into agent behavior and policy effectiveness. Collected metrics include:
  • Policy violations by type and severity
  • Actions taken in response to violations
  • Distribution of classification categories
  • Statistically significant drift signals
  • Trends over time by agent or workflow
These metrics are surfaced in the Cascade dashboard and are used to monitor safety posture, tune policies, and identify high-risk agents or workflows.