Test Runs
View execution results when you run a test set against an endpoint. Each test run contains all individual test results with detailed metrics, conversation history, and performance data.
What are Test Runs? A test run is created when you execute a test set against an endpoint. It captures all test results, execution metadata, and evaluation metrics for analysis.

By clicking on a test run, you’ll see the overview of how the test run went. From here you can:
- Review: Manually revise the automated test evaluation
- Compare: Compare against a baseline test run
- Re-run: Execute the test set again with the same configuration
- Download: Export test run data as CSV

Review Test Runs
Reviews allow human evaluators to validate or override automated test evaluations.
- Select a Test Run from the Test Runs overview page
- Click on the Reviews tab, then Add Your Review
- Select a Review Status:
- Pass: The test passes the metric
- Fail: The test fails the metric
- Add a comment explaining your review decision
Compare Test Runs
Compare test runs to identify regressions and improvements between executions.
How to Compare:
- Click the Compare button (top right, next to Download)
- Select a baseline test run to compare against
- View the test-by-test comparison of the test set
Comparison Filters:
Use filters to focus on specific changes:
- All Tests: Show all tests from both runs
- Improved: Tests that now pass but failed in baseline
- Regressed: Tests that now fail but passed in baseline
- Unchanged: Tests with the same pass/fail status
Test Run Metrics
When a test run executes, Rhesis evaluates each test against a set of metrics. The metrics used follow a priority hierarchy that allows flexibility at different levels.
Metrics Resolution Hierarchy
Rhesis resolves which metrics to use based on the following priority order:
| Priority | Source | Description |
|---|---|---|
| 1 (Highest) | Execution-time metrics | Metrics specified when triggering the test run. These completely override all other metric configurations. |
| 2 | Test set metrics | Metrics configured on the test set entity. Override behavior-level defaults. |
| 3 (Lowest) | Behavior metrics | Default metrics defined at the behavior level for each test. |
Execution-time metrics completely override other levels. There is no merging of metrics between levels.
Specifying Execution-time Metrics
When executing a test set from the UI:
- Open the Execute Test Set drawer from a test set detail page
- In the Test Run Metrics section, select “Define metrics for this execution”
- Click Add Metric to select which metrics to use
- Only metrics applicable to your test set type (Single-Turn or Multi-Turn) will be shown
Execution-time metrics only apply to that specific test run. They are not saved to the test set configuration.
When to Use Each Level
- Behavior metrics: Set default evaluation criteria for all tests with a specific behavior. This is the most common configuration.
- Test set metrics: Override defaults for specialized test sets, such as Garak security tests that use detector metrics.
- Execution-time metrics: Quick validation with specific metrics without modifying test set configuration. Useful for one-off experiments or A/B testing different evaluation criteria.
Next Steps - Review failed tests to understand issues - Compare against baseline runs to detect regressions - Add human reviews to test results - Export results for reporting or analysis