Skip to main content
The Evals page is where you configure evaluators, run them against traces or datasets, and track regression over time.

Sub-sections

  • Results — every eval result your app has triggered, newest first.
  • Datasets — curated trace collections used as fixed test inputs for regression. Open a dataset to see its runs and trigger new ones.
  • Evaluators — built-in + custom evaluators you can trigger.

Results

One row per eval result. Columns: trace ID, evaluator, metric, score, label, latency, timestamp. Click any row for the reasoning (LLM evals) or the rule output.

Filters

A filter bar above the results table narrows the list. All filter state lives in the URL, so any filtered view is shareable — bookmark it or paste the URL into a PR comment.
  • Project — scope results to a single project in the current organization.
  • Metric — restrict to one evaluator’s metric (options come from the public eval catalog at GET /v1/eval-catalog).
  • Score rangemin and max inputs accept a float between 0.0 and 1.0. Values outside that range are ignored, and if min > max the conflicting value is dropped.
  • Date range — last 1h / 24h / 7d / 30d presets that map to from / to timestamps.
  • Clear — resets every active filter and removes each param from the URL.
These map 1:1 to the query parameters on GET /v1/eval: project_id, metric, score_min, score_max, from, to.

Export

The Export button at the top-right of the results table downloads the current filtered list as CSV or JSONL. Clicking it opens a small menu with:
  • Include reasoning — when checked, the LLM-judge rationale column is included (truncated to 500 characters per row). Off by default because rationales can be long and multiline, which makes CSVs unwieldy.
  • Download as CSVevals-YYYY-MM-DD.csv.
  • Download as JSONLevals-YYYY-MM-DD.jsonl, one JSON object per line.
Row caps are per plan: 100 for Starter, 5,000 for Pro/Team. If the export hits the cap, a toast appears indicating how many rows were returned and suggests either tighter filters or a plan upgrade.

Datasets

A dataset is a named set of trace IDs. Create one by:
  • Selecting traces from the Traces page and choosing Add to dataset
  • Filtering in the Feedback page and pushing highly-rated or highly-disputed traces into a dataset
  • Uploading a JSONL file via the dashboard or POST /v1/datasets
Every dataset has a stable ID — reference it from CI to run regression on every PR.

Runs

Runs live inside each dataset’s detail page — open Datasets, pick a dataset, and the runs panel lists every batch executed against it. Click Run evaluators on a dataset to trigger an evaluator over every trace in it. The resulting run is a row in that panel showing:
  • Dataset + evaluator + metric
  • Completion status and progress
  • Aggregate score (mean, median, histogram)
  • Pass/fail ratio if the metric is categorical
Runs can be compared pairwise — pick two runs over the same dataset and the dashboard diffs them by trace, highlighting regressions.

Export a run

The run detail page has an Export button in the header that downloads the full run — metadata, per-item scores, and any LLM-judge reasoning — as a single JSON file named eval-run-<id>.json. Useful for archiving, sharing with reviewers, or diffing across runs outside the dashboard.

Evaluators

Built-in evaluators (always available):
EvaluatorTypeMeasures
correctnessllmDoes the output match the ground-truth answer?
hallucinationllmDoes the output contain claims not grounded in retrieved context?
relevancellmDoes the output address what was asked?
toxicityllmIs the output safe and non-toxic?
json_schemaruleDoes the output match a provided JSON Schema?
latency_p95ruleIs trace latency under a threshold?
has_citationruleDoes the output include a citation pattern?
Custom evaluators can be created from the Evaluators tab — provide a rubric (LLM) or a Python function (rule).

Trigger programmatically

curl https://api.trulayer.ai/v1/eval \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"trace_id":"01j...","evaluator_type":"llm","metric_name":"correctness"}'
Or configure evaluators to run automatically on every ingested trace matching a filter — see Evaluators → Triggers. The Trends tab on a dataset or evaluator plots the aggregate score (mean / pass rate) over time, one line per run. Use it to spot regressions introduced by a prompt tweak, a model swap, or a framework upgrade. Click a point to jump to the underlying run.

Regression tests

Pin a dataset as a regression dataset under a project’s settings. When a deployment publishes a new model id or prompt version (via the deployment.created webhook or the /v1/deployments API), TruLayer automatically runs every pinned dataset against the new configuration and posts the diff back to the triggering PR.
  • New failures (regressions) block the deployment when enforce is on.
  • Score deltas greater than the configured threshold (default 5%) are flagged.
  • A full comparison run shows up under the deployment’s detail page in the Control dashboard.