Skip to main content

Welcome to TurnWise

TurnWise is a comprehensive platform for evaluating AI agent conversations. Upload your conversation data, create custom evaluation metrics, and gain insights into your AI agent’s performance.

What is TurnWise?

TurnWise helps you:
  • Evaluate AI Conversations: Create custom metrics to measure quality, safety, helpfulness, and more
  • Track Performance: Monitor how your AI agents perform across different scenarios
  • Identify Issues: Spot problematic conversations and understand failure patterns
  • Improve Agents: Use insights to make data-driven improvements to your AI systems

Key Concepts

Datasets

Collections of conversations you want to evaluate

Conversations

Individual chat threads between users and AI agents

Metrics

Custom evaluation criteria you define

Pipelines

Workflows that run metrics across your data

Getting Started

You can use TurnWise in two ways:

Web Interface

1

Create a Dataset

Start by creating a new dataset to hold your conversations
2

Import Conversations

Upload your conversation data in TurnWise JSON format
3

Add Evaluation Metrics

Create metrics to evaluate your conversations
4

Run Evaluations

Execute your evaluation pipeline and review results

Python SDK

Use the TurnWise Python SDK to run evaluations programmatically with your own API keys:
from turnwise import TurnWiseClient, Metric, EvaluationLevel, OutputType

client = TurnWiseClient(
    turnwise_api_key="tw_xxx",
    openrouter_api_key="sk-or-xxx"
)

metric = Metric(
    name="Helpfulness",
    prompt="Evaluate: @CURRENT_MESSAGE.output",
    evaluation_level=EvaluationLevel.MESSAGE,
    output_type=OutputType.PROGRESS,
)

results = await client.evaluate(dataset_id=1, metric=metric)
See the Python SDK documentation for more details.

Data Format

TurnWise uses a hierarchical JSON format:
Dataset
└── Conversations
    ├── Agents (optional)
    └── Messages
        └── Steps (optional)
See the Data Format Overview for complete documentation.