Test Runs

View execution results when you run a test set against an endpoint. Each test run contains all individual test results with detailed metrics, conversation history, and performance data.

What are Test Runs? A test run is created when you execute a test set against an endpoint. It captures all test results, execution metadata, and evaluation metrics for analysis.

Test Runs Page

By clicking on a test run, you’ll see the overview of how the test run went. From here you can:

Review: Manually revise the automated test evaluation
Compare: Compare against a baseline test run
Re-run: Execute the test set again with the same configuration
Download: Export test run data as CSV

Test Run Detail Page

Review Test Runs

Reviews allow human evaluators to validate or override automated test evaluations.

Select a Test Run from the Test Runs overview page
Click on the Reviews tab, then Add Your Review
Select a Review Status:
- Pass: The test passes the metric
- Fail: The test fails the metric
Add a comment explaining your review decision

Compare Test Runs

Compare test runs to identify regressions and improvements between executions.

How to Compare:

Click the Compare button (top right, next to Download)
Select a baseline test run to compare against
View the test-by-test comparison of the test set

Comparison Filters:

Use filters to focus on specific changes:

All Tests: Show all tests from both runs
Improved: Tests that now pass but failed in baseline
Regressed: Tests that now fail but passed in baseline
Unchanged: Tests with the same pass/fail status

Test Run Metrics

When a test run executes, Rhesis evaluates each test against a set of metrics. The metrics used follow a priority hierarchy that allows flexibility at different levels.

Metrics Resolution Hierarchy

Rhesis resolves which metrics to use based on the following priority order:

Priority	Source	Description
1 (Highest)	Execution-time metrics	Metrics specified when triggering the test run. These completely override all other metric configurations.
2	Test set metrics	Metrics configured on the test set entity. Override behavior-level defaults.
3 (Lowest)	Behavior metrics	Default metrics defined at the behavior level for each test.

Execution-time metrics completely override other levels. There is no merging of metrics between levels.

Specifying Execution-time Metrics

When executing a test set from the UI:

Open the Execute Test Set drawer from a test set detail page
In the Test Run Metrics section, select “Define metrics for this execution”
Click Add Metric to select which metrics to use
Only metrics applicable to your test set type (Single-Turn or Multi-Turn) will be shown

Execution-time metrics only apply to that specific test run. They are not saved to the test set configuration.

When to Use Each Level

Behavior metrics: Set default evaluation criteria for all tests with a specific behavior. This is the most common configuration.
Test set metrics: Override defaults for specialized test sets, such as Garak security tests that use detector metrics.
Execution-time metrics: Quick validation with specific metrics without modifying test set configuration. Useful for one-off experiments or A/B testing different evaluation criteria.

Next Steps - Review failed tests to understand issues - Compare against baseline runs to detect regressions - Add human reviews to test results - Export results for reporting or analysis