Skip to main content

Running Evaluations

Once you’ve created metrics, it’s time to run evaluations. This guide covers all the ways to execute evaluations in TurnWise.

Evaluation Options

You can run evaluations at different granularities:

Single Cell

Evaluate one metric for one conversation/message/step

Row

Evaluate all metrics for one conversation/message/step

Column

Evaluate one metric for all conversations/messages/steps

All

Evaluate all metrics for all entities

Running via UI

Single Cell Evaluation

Evaluate one metric for one entity:
  1. Navigate to Dataset
    • Open your dataset
    • Find the conversation/message/step row
  2. Click Cell
    • Click the cell you want to evaluate
    • Or right-click and select “Run Evaluation”
  3. Wait for Results
    • Evaluation starts immediately
    • Progress indicator shows status
    • Results stream in real-time
  4. View Results
    • Results appear in the cell
    • Click cell to view details
    • See full evaluation output

Row Evaluation

Evaluate all metrics for one entity:
  1. Select Row
    • Click the row header (conversation/message/step)
  2. Click “Run All”
    • Button appears in row header
    • Or right-click row → “Run All Metrics”
  3. Monitor Progress
    • Each metric cell shows progress
    • Results appear as they complete
  4. Review Results
    • All metrics evaluated for this entity
    • Compare results across metrics

Column Evaluation

Evaluate one metric for all entities:
  1. Select Column
    • Click the column header (metric name)
  2. Click “Run All”
    • Button appears in column header
    • Or right-click column → “Run All”
  3. Monitor Progress
    • Progress bars show for each row
    • Results stream in real-time
  4. Review Results
    • All entities evaluated with this metric
    • Compare results across conversations

Run All Evaluations

Evaluate everything:
  1. Click “Run All”
    • Button in dataset header
    • Or use keyboard shortcut
  2. Confirm
    • Dialog shows number of evaluations
    • Click “Run” to confirm
  3. Monitor Progress
    • Progress bar shows overall progress
    • Individual cells update as they complete
  4. Review Results
    • All evaluations complete
    • Export or analyze results

Understanding Execution Modes

Execution Mode: Sync vs Async

TurnWise supports two execution modes:

Sync Mode (Default)

  • Evaluations run sequentially
  • One at a time
  • Slower but more predictable
  • Better for debugging

Async Mode

  • Evaluations run concurrently
  • Multiple at once
  • Faster execution
  • Better for batch processing
Use async for large batches - Significantly faster for evaluating many conversations.

Streaming Results

TurnWise streams evaluation results in real-time:

Progress Indicators

During evaluation, you’ll see:
  • Pending: Not started yet
  • Processing: Currently evaluating (progress bar)
  • Complete: Evaluation finished (result shown)
  • Error: Evaluation failed (error message)

Result Display

Results are displayed based on output type:

Text Results

"Response is helpful and addresses the user's question directly."

Number Results

8.5

Checkbox Results

✓ Yes
✗ No

Progress Results

[████████░░] 85%

JSON Results

{
  "score": 0.85,
  "reasoning": "Response is helpful",
  "accuracy": 0.9
}
Click cell to view full JSON.

Running via API

Single Evaluation

POST /evaluation-pipeline-executions/run
{
  "dataset_id": 1,
  "pipeline_node_id": 5,
  "entity_type": "conversation",
  "entity_id": 123
}

Batch Evaluation

POST /evaluation-pipeline-executions/run
{
  "dataset_id": 1,
  "pipeline_node_id": 5,
  "entity_type": "conversation",
  "entity_id": null  # null = all conversations
}

Streaming Results

Use Server-Sent Events (SSE) for streaming:
const eventSource = new EventSource(
  '/evaluation-pipeline-executions/run?dataset_id=1&pipeline_node_id=5'
);

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Progress:', data);
};

Evaluation Status

Pending

Evaluation hasn’t started yet.

Processing

Evaluation is running:
  • Progress indicator shows
  • Estimated time remaining
  • Current step

Complete

Evaluation finished successfully:
  • Result displayed
  • Can be re-run
  • Can be exported

Error

Evaluation failed:
  • Error message shown
  • Can retry
  • Check logs for details

Retrying Failed Evaluations

Via UI

  1. Click Failed Cell
  2. Click “Retry”
  3. Wait for Completion

Via API

POST /evaluation-pipeline-executions/run
{
  "dataset_id": 1,
  "pipeline_node_id": 5,
  "entity_type": "conversation",
  "entity_id": 123,
  "retry": true
}

Canceling Evaluations

Via UI

  1. Click “Cancel” button
  2. Confirm Cancellation
  3. Partial Results may be saved

Via API

DELETE /evaluation-pipeline-executions/{execution_id}

Performance Tips

Use Async Mode

Enable async mode for faster batch evaluations

Run in Batches

Split large datasets into batches

Monitor Progress

Watch for errors and adjust as needed

Export Results

Export results periodically for backup

Common Issues

Evaluation Stuck

Symptom: Progress bar doesn’t move Solutions:
  • Refresh page
  • Check API status
  • Retry evaluation

Slow Evaluations

Symptom: Takes too long Solutions:
  • Use faster model (gpt-5-nano vs gpt-4)
  • Enable async mode
  • Reduce prompt complexity

Memory Errors

Symptom: “Out of memory” errors Solutions:
  • Reduce batch size
  • Use rolling summaries (automatic)
  • Simplify prompts

Next Steps