Cluster list
Each row is a failure cluster — a group of traces that share the same normalised error signature (error type, message skeleton, and top contributing span). Columns:- Signature — human-readable cluster label (e.g.
timeout in llm:openai.chat.completions). - Count — traces in the cluster within the selected window.
- Trend — sparkline of cluster volume, last 24 h.
- First / last seen — helpful for spotting regressions tied to a deploy.
- Impact — unique sessions and unique users affected.
- Status —
new,acknowledged,resolved.
Cluster detail
Click any cluster to open the root-cause view.- Top contributing spans — the 3–5 span names most frequently marked as the failure origin across traces in the cluster, with counts and average latency.
- Representative error messages — de-duplicated error strings with per-variant counts. Click to pivot to a matching trace.
- Linked traces — paginated list of every trace in the cluster; click through to the trace detail and span waterfall.
- Feedback overlay — any negative user feedback attached to cluster traces shows up here for context.
status = resolved to see them.
Alert rules
From Failures → Alert rules, create rules that fire on cluster or failure-rate conditions. Rule fields:- Name — shown on the alert payload.
- Trigger — one of:
failure_rate > thresholdover a rolling window (e.g.> 2% over 5 minutes)cluster_count > thresholdfor a new or existing clustercluster.first_seen— fires the first time a signature appears
- Scope — project, environment, model, or metadata filter.
- Channel — webhook URL (JSON payload) or email recipients.
- Cooldown — suppress repeated fires within the interval (default 15 minutes).
Common workflows
- New deploy monitoring. Filter to
first_seen > deploy_timeto see clusters introduced by the latest release. - Triage the weekly on-call. Sort clusters by Impact desc, work top-down.
- Close the loop with ownership. Add a metadata filter (
metadata.team = "payments") to alert rules so only the right team gets paged.