Skip to Content
PlatformTest Runs

Test Runs

View execution results when you run a test set against an endpoint. Each test run contains all individual test results with detailed metrics, conversation history, and performance data.

What are Test Runs? A test run is created when you execute a test set against an endpoint. It captures all test results, execution metadata, and evaluation metrics for analysis.

Test Runs Page

By clicking on a test run, you’ll see the overview of how the test run went. From here you can:

  • Review: Manually revise the automated test evaluation
  • Compare: Compare against a baseline test run
  • Re-run: Execute the test set again with the same configuration
  • Download: Export test run data as CSV

Test Run Detail Page

Review Test Runs

Reviews allow human evaluators to validate or override automated test evaluations.

  1. Select a Test Run from the Test Runs  overview page
  2. Click on the Reviews tab, then Add Your Review
  3. Select a Review Status:
    • Pass: The test passes the metric
    • Fail: The test fails the metric
  4. Add a comment explaining your review decision

Compare Test Runs

Compare test runs to identify regressions and improvements between executions.

How to Compare:

  1. Click the Compare button (top right, next to Download)
  2. Select a baseline test run to compare against
  3. View the test-by-test comparison of the test set

Comparison Filters:

Use filters to focus on specific changes:

  • All Tests: Show all tests from both runs
  • Improved: Tests that now pass but failed in baseline
  • Regressed: Tests that now fail but passed in baseline
  • Unchanged: Tests with the same pass/fail status

Test Run Metrics

When a test run executes, Rhesis evaluates each test against a set of metrics. The metrics used follow a priority hierarchy that allows flexibility at different levels.

Metrics Resolution Hierarchy

Rhesis resolves which metrics to use based on the following priority order:

PrioritySourceDescription
1 (Highest)Execution-time metricsMetrics specified when triggering the test run. These completely override all other metric configurations.
2Test set metricsMetrics configured on the test set entity. Override behavior-level defaults.
3 (Lowest)Behavior metricsDefault metrics defined at the behavior level for each test.

Execution-time metrics completely override other levels. There is no merging of metrics between levels.

Specifying Execution-time Metrics

When executing a test set from the UI:

  1. Open the Execute Test Set drawer from a test set detail page
  2. In the Test Run Metrics section, select “Define metrics for this execution”
  3. Click Add Metric to select which metrics to use
  4. Only metrics applicable to your test set type (Single-Turn or Multi-Turn) will be shown

Execution-time metrics only apply to that specific test run. They are not saved to the test set configuration.

When to Use Each Level

  • Behavior metrics: Set default evaluation criteria for all tests with a specific behavior. This is the most common configuration.
  • Test set metrics: Override defaults for specialized test sets, such as Garak security tests that use detector metrics.
  • Execution-time metrics: Quick validation with specific metrics without modifying test set configuration. Useful for one-off experiments or A/B testing different evaluation criteria.

Next Steps - Review failed tests to understand issues - Compare against baseline runs to detect regressions - Add human reviews to test results - Export results for reporting or analysis