The Ingestion health page is a per-project operational dashboard for the path your spans take from SDK to storage. When traces aren’t showing up, or something looks off, this is the first place to look.Documentation Index
Fetch the complete documentation index at: https://docs.trulayer.ai/llms.txt
Use this file to discover all available pages before exploring further.
Why this exists
You shouldn’t need to email support to know whether your data is flowing. Every project gets a live view of ingest success rate, latency, dead-letter depth, top error categories, and redaction activity. If the dashboard is green, the problem is almost always in your app or network. If it’s red, it points you at the exact failure mode.How to get there
Settings → Projects → select a project → Health tab. The same data is also summarised on the Traces page in a compact tile in the top-right — click through from there to open the full dashboard.Stat cards
The top of the page shows six cards. All values respect the window selector.Success rate
Percentage of ingest requests that landed a span in storage. Anything above 99.5% is normal background noise. A sustained drop below 99% usually means a bad deploy, a credential rotation, or a schema mismatch.Error rate
Inverse of success rate, broken down by category (see Top errors below). Shown as both a percentage and an absolute count so you can distinguish “small sample, one failure” from “sustained outage”.Ingest lag (p50 / p95 / p99)
End-to-end latency from SDKflush() to span visible in the dashboard, in milliseconds. Healthy ranges:
- p50 — under 500 ms
- p95 — under 2,000 ms
- p99 — under 5,000 ms
DLQ depth
Number of spans parked in the dead-letter queue for this project. Spans land in the DLQ when ingest fails in a way that isn’t worth retrying inline — usually schema validation errors or over-size payloads. DLQ depth should be zero. If it isn’t, see Troubleshooting below.Last successful span
Timestamp of the most recently accepted span. If this is more than a few minutes stale while your app is running, ingest is stuck for this project even if overall success rate looks okay — you may be hitting a per-project rate limit or a project-scoped auth problem.Redaction matches
Count of fields your redaction rules matched and scrubbed in this window. Useful as a sanity check — if you added a new rule and this counter stays at zero, your rule probably isn’t matching.Window selector
A segmented control at the top-right of the page controls the time range for every card and table.- 1h — use during an active incident or right after a deploy. Tightest signal, noisiest numbers.
- 24h — default for daily ops. Good balance of signal and stability.
- 7d — use for trend analysis — “did error rate creep up this week?” Avoid for incident response; rolling windows smear short outages.
Top errors
A table below the stat cards breaks errors down by category. Columns:- Category — one of the values below
- Count — occurrences in the window
- Last seen — timestamp of the most recent occurrence
- Example — redacted error message from a recent instance
Categories
auth— rejected API key or expired token. Usually a rotated key that didn’t make it into your deploy env.schema— span payload didn’t match the expected shape. Almost always an SDK version mismatch or a hand-rolled HTTP call.rate_limit— you’re over the per-project ingest quota. Check your plan or batch more aggressively.payload_too_large— a single span exceeded the size cap (1 MB). Usually a prompt or tool-call result that needs trimming or redacting before it hitstrace().downstream— our side. If you see this, check status.trulayer.ai; we’re already paged.unknown— anything we couldn’t classify. If this is non-trivial, send the trace IDs to support.
Roles and permissions
Any member of the project can view this page — it’s read-only operational data, not billing or credentials. DLQ replay (re-ingesting parked spans after you fix the root cause) requires the owner role. The replay control isn’t shipped yet; for now, contact support if you need spans reprocessed.Troubleshooting
DLQ depth > 0 — what to do
- Open Top errors and find the category driving the count — almost always
schemaorpayload_too_large. - Click through to the example — it’ll show the offending span’s ID and the validation failure.
- Fix the root cause in your app (update SDK, trim payload, correct field type).
- Redeploy and confirm new spans land successfully (success rate back to ~100%, no new DLQ additions).
- Ask support to replay the DLQ once you’ve confirmed the fix is live — otherwise the replayed spans will just fail again.
High error rate — common causes
In rough order of frequency:- Wrong or rotated API key. Check
TRULAYER_API_KEYin your deploy env matches the key in Settings → API keys. Rotated keys are the #1 cause of suddenauthspikes. - SDK version skew. An old SDK sending a deprecated field, or a new SDK sending a field the server hasn’t shipped yet. Pin your SDK version and upgrade deliberately.
- Rate limit. Sudden
rate_limitspikes usually mean a batch job or retry storm. Add jitter and batch withflush()less frequently. - Oversize payloads. Long prompts or tool-call results will trip
payload_too_large. Redact or truncate before tracing. - Schema mismatch from hand-rolled HTTP. If you’re not using the SDK, you’re on your own for schema drift. Use the SDK.
Ingest lag high but errors low
Your spans are landing, just slowly. Check:- Are you calling
flush()synchronously on a hot path? Move it off the request path. - Is your network egress congested?
p95is usually dominated by client-side network, not our ingest. - Is it a specific region? If so, ping support — we may have a regional hot spot.
Last successful span is stale
Even if overall counts look fine, a stale Last successful span timestamp for this project means nothing recent has landed. Likely:- Project-scoped API key revoked — check Settings → API keys.
- App stopped running (container crashed, cron paused). Check your own deploy logs.
- SDK buffer stuck — if your app is running but not flushing, look for errors in the SDK’s own logs.