Evals

The Evals page is where you configure evaluators, run them against traces or datasets, and track regression over time.

Sub-sections

Results — every eval result your app has triggered, newest first.
Datasets — curated trace collections used as fixed test inputs for regression. Open a dataset to see its runs and trigger new ones.
Evaluators — built-in + custom evaluators you can trigger.

Results

One row per eval result. Columns: trace ID, evaluator, metric, score, label, latency, timestamp. Click any row for the reasoning (LLM evals) or the rule output.

Filters

A filter bar above the results table narrows the list. All filter state lives in the URL, so any filtered view is shareable — bookmark it or paste the URL into a PR comment.

Project — scope results to a single project in the current organization.
Metric — restrict to one evaluator’s metric (options come from the public eval catalog at GET /v1/eval-catalog).
Score range — min and max inputs accept a float between 0.0 and 1.0. Values outside that range are ignored, and if min > max the conflicting value is dropped.
Date range — last 1h / 24h / 7d / 30d presets that map to from / to timestamps.
Clear — resets every active filter and removes each param from the URL.

These map 1:1 to the query parameters on GET /v1/eval: project_id, metric, score_min, score_max, from, to.

Export

The Export button at the top-right of the results table downloads the current filtered list as CSV or JSONL. Clicking it opens a small menu with:

Include reasoning — when checked, the LLM-judge rationale column is included (truncated to 500 characters per row). Off by default because rationales can be long and multiline, which makes CSVs unwieldy.
Download as CSV — evals-YYYY-MM-DD.csv.
Download as JSONL — evals-YYYY-MM-DD.jsonl, one JSON object per line.

Row caps are per plan: 100 for Starter, 5,000 for Pro/Team. If the export hits the cap, a toast appears indicating how many rows were returned and suggests either tighter filters or a plan upgrade.

Datasets

A dataset is a named set of trace IDs. Create one by:

Selecting traces from the Traces page and choosing Add to dataset
Filtering in the Feedback page and pushing highly-rated or highly-disputed traces into a dataset
Uploading a JSONL file via the dashboard or POST /v1/datasets

Every dataset has a stable ID — reference it from CI to run regression on every PR.

Runs

Runs live inside each dataset’s detail page — open Datasets, pick a dataset, and the runs panel lists every batch executed against it. Click Run evaluators on a dataset to trigger an evaluator over every trace in it. The resulting run is a row in that panel showing:

Dataset + evaluator + metric
Completion status and progress
Aggregate score (mean, median, histogram)
Pass/fail ratio if the metric is categorical

Runs can be compared pairwise — pick two runs over the same dataset and the dashboard diffs them by trace, highlighting regressions.

Export a run

The run detail page has an Export button in the header that downloads the full run — metadata, per-item scores, and any LLM-judge reasoning — as a single JSON file named eval-run-<id>.json. Useful for archiving, sharing with reviewers, or diffing across runs outside the dashboard.

Evaluators

Built-in evaluators (always available):

Evaluator	Type	Measures
`correctness`	`llm`	Does the output match the ground-truth answer?
`hallucination`	`llm`	Does the output contain claims not grounded in retrieved context?
`relevance`	`llm`	Does the output address what was asked?
`toxicity`	`llm`	Is the output safe and non-toxic?
`json_schema`	`rule`	Does the output match a provided JSON Schema?
`latency_p95`	`rule`	Is trace latency under a threshold?
`has_citation`	`rule`	Does the output include a citation pattern?

Custom evaluators can be created from the Evaluators tab — provide a rubric (LLM) or a Python function (rule).

Trigger programmatically

curl https://api.trulayer.ai/v1/eval \
  -H "Authorization: Bearer $TRULAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"trace_id":"01j...","evaluator_type":"llm","metric_name":"correctness"}'

Or configure evaluators to run automatically on every ingested trace matching a filter — see Evaluators → Triggers.

Trends

The Trends tab on a dataset or evaluator plots the aggregate score (mean / pass rate) over time, one line per run. Use it to spot regressions introduced by a prompt tweak, a model swap, or a framework upgrade. Click a point to jump to the underlying run.

Regression tests

Pin a dataset as a regression dataset under a project’s settings. When a deployment publishes a new model id or prompt version (via the deployment.created webhook or the /v1/deployments API), TruLayer automatically runs every pinned dataset against the new configuration and posts the diff back to the triggering PR.

New failures (regressions) block the deployment when enforce is on.
Score deltas greater than the configured threshold (default 5%) are flagged.
A full comparison run shows up under the deployment’s detail page in the Control dashboard.

Getting started

Core concepts

Python SDK

TypeScript SDK

Go SDK

SDK features

Dashboard

Integrations

Control loop

Guides

Best practices

Reference

Contributing

Sub-sections

Results

Filters

Export

Datasets

Runs

Export a run

Evaluators

Trigger programmatically

Trends

Regression tests

​Sub-sections

​Results

​Filters

​Export

​Datasets

​Runs

​Export a run

​Evaluators

​Trigger programmatically

​Trends

​Regression tests

Sub-sections

Results

Filters

Export

Datasets

Runs

Export a run

Evaluators

Trigger programmatically

Trends

Regression tests