Creating Metrics

Metrics are the heart of TurnWise - they define what you want to evaluate in your conversations. This guide shows you how to create metrics using the UI or API.

What Are Metrics?

Metrics are evaluation criteria that measure specific aspects of conversations:

Helpfulness: Is the response helpful?
Accuracy: Is the information correct?
Politeness: Is the tone appropriate?
Tool Usage: Was the right tool selected?
Goal Achievement: Did the conversation achieve its goal?

Each metric consists of:

Name: Descriptive name
Evaluation Level: conversation, message, or step
Prompt: Instructions for the evaluator LLM
Output Type: text, number, checkbox, progress, or JSON
Model: LLM to use for evaluation

Creating Metrics via UI

Step 1: Open Your Dataset

Navigate to your dataset and click “Add Column” in the table header.

Step 2: Choose Creation Method

You have two options:

AI-Powered Generation

Describe what you want to evaluate in natural language

Manual Creation

Create the metric configuration yourself

Method 1: AI-Powered Generation

Click “Generate with AI”
Describe Your Metric
- Example: “Check if the assistant is being polite and professional”
- Example: “Rate the helpfulness of each response on a scale of 1-10”
- Example: “Evaluate if the correct tool was selected”
Review Generated Configuration
- TurnWise generates:
  - Metric name
  - Evaluation level
  - Prompt
  - Output type
  - JSON schema (if JSON output)
Edit if Needed
- Modify the generated configuration
- Adjust prompt wording
- Change output type
Save
- Click “Create” to add the metric

AI generation is smart - It analyzes your description and chooses appropriate evaluation level, output type, and prompt structure.

Method 2: Manual Creation

Fill in Basic Info
- Name: Descriptive name (e.g., “Response Helpfulness”)
- Description: Optional explanation

Choose Evaluation Level

Level	When to Use	Example
Conversation	Overall quality, goal achievement	”Did the conversation solve the user’s problem?”
Message	Individual responses	”Is this response helpful?”
Step	Reasoning steps, tool usage	”Was the correct tool selected?”

Write Your Prompt Basic prompt (no template variables):
```
Is this response helpful? Answer yes or no.
```
Advanced prompt (with template variables):
```
Evaluate @CURRENT_MESSAGE.output for helpfulness given the user's question: @PREVIOUS_USER_MSG
```
See Advanced Metrics for template variables.

Select Output Type

Type	Description	Example Use
Text	Free-form text response	Explanations, reasoning
Number	Numeric value	Scores, ratings
Checkbox	Yes/No, Pass/Fail	Binary evaluations
Progress	0-1 normalized score	Quality scores, percentages
JSON	Structured output	Multi-dimensional analysis

Configure JSON Schema (if JSON output)

{
  "type": "object",
  "properties": {
    "score": {
      "type": "number",
      "description": "Helpfulness score from 0-1"
    },
    "reasoning": {
      "type": "string",
      "description": "Explanation of the score"
    }
  },
  "required": ["score", "reasoning"]
}

Choose Model Select the LLM to use:
- Default: openai/gpt-5-nano (cost-effective)
- For complex evaluations: openai/gpt-4 or anthropic/claude-sonnet
Save
- Click “Create” to add the metric

Understanding Evaluation Levels

Conversation Level

Evaluates the entire conversation: Use when:

Measuring overall goal achievement
Assessing conversation quality holistically
Evaluating conversation patterns

Example Prompt:

Did this conversation successfully help the user achieve their goal? 
Consider: @GOAL and @HISTORY

Available Variables:

@HISTORY - Full conversation history
@GOAL - User’s goal
@LIST_AGENT - Available agents and tools
@MESSAGES - All messages
@USER_MESSAGES - User messages only
@ASSISTANT_MESSAGES - Assistant messages only

Message Level

Evaluates individual assistant responses: Use when:

Measuring response quality
Checking tone and style
Verifying accuracy per message

Example Prompt:

Is @CURRENT_MESSAGE.output helpful given @PREVIOUS_USER_MSG?
Rate on a scale of 0-1.

Available Variables:

All conversation-level variables
@PREVIOUS_USER_MSG - Previous user message
@PREVIOUS_ASSISTANT_MSG - Previous assistant message
@CURRENT_MESSAGE.output - Current message content
@CURRENT_MESSAGE.role - Current message role
@CURRENT_STEPS - Steps in current message

Step Level

Evaluates individual reasoning steps: Use when:

Evaluating tool selection
Checking reasoning quality
Verifying parameter accuracy

Example Prompt:

Was @CURRENT_STEP.tool_call the correct tool to use given @PREVIOUS_STEP.tool_result?
Answer yes or no.

Available Variables:

All message-level variables
@PREVIOUS_STEP.* - Previous step details
@CURRENT_STEP.* - Current step details
@STEP_NUMBER - Step position

Output Types Explained

Text

Free-form text responses:

Prompt: "Explain why this response is helpful or not."
Output: "The response directly addresses the user's question about order status and provides clear next steps."

Use for: Explanations, detailed reasoning, qualitative feedback

Number

Numeric values:

Prompt: "Rate helpfulness from 1-10."
Output: 8.5

Use for: Scores, ratings, counts

Checkbox

Binary yes/no:

Prompt: "Is this response helpful? Answer yes or no."
Output: true

Use for: Pass/fail, yes/no evaluations

Progress

Normalized 0-1 score (displayed as 0-100%):

Prompt: "Rate helpfulness from 0-1."
Output: 0.85
Display: 85%

Use for: Quality scores, percentages, normalized ratings

JSON

Structured multi-field output:

Prompt: "Evaluate response quality with score and reasoning."
Output: {
  "score": 0.85,
  "reasoning": "Response is helpful and accurate",
  "accuracy": 0.9,
  "completeness": 0.8
}

Use for: Multi-dimensional analysis, structured evaluations

Example Metrics

Example 1: Simple Helpfulness Check

Name: Response Helpfulness
Level: Message
Prompt: Is @CURRENT_MESSAGE.output helpful? Answer yes or no.
Output Type: Checkbox

Example 2: Detailed Quality Score

Name: Response Quality Score
Level: Message
Prompt: Rate @CURRENT_MESSAGE.output for quality considering:
- Accuracy
- Completeness
- Helpfulness
Provide a score from 0-1.
Output Type: Progress

Example 3: Tool Selection Evaluation

Name: Correct Tool Selection
Level: Step
Prompt: Given @PREVIOUS_STEP.tool_result, was @CURRENT_STEP.tool_call the correct next tool?
Consider the available tools: @LIST_AGENT
Answer yes or no.
Output Type: Checkbox

Example 4: Multi-Dimensional Analysis

Name: Comprehensive Quality Analysis
Level: Message
Prompt: Evaluate @CURRENT_MESSAGE.output across multiple dimensions:
- Helpfulness
- Accuracy
- Tone
- Completeness
Output Type: JSON
Schema: {
  "helpfulness": "number (0-1)",
  "accuracy": "number (0-1)",
  "tone": "string (polite/neutral/rude)",
  "completeness": "number (0-1)",
  "reasoning": "string"
}

Best Practices

Be Specific

Write clear, specific prompts with evaluation criteria

Use Template Variables

Leverage @HISTORY, @GOAL, etc. for context-aware evaluation

Choose Right Level

Match evaluation level to what you’re measuring

Test First

Test metrics on a few conversations before running on all

Next Steps

Advanced Metrics

Learn about template variables and advanced features

Running Evaluations

Run your metrics on conversations

Getting Started

Data Format

Datasets

Metrics

Evaluation

Examples

Python SDK

Creating Metrics

Creating Metrics

What Are Metrics?

Creating Metrics via UI

Step 1: Open Your Dataset

Step 2: Choose Creation Method

AI-Powered Generation

Manual Creation

Method 1: AI-Powered Generation

Method 2: Manual Creation

Understanding Evaluation Levels

Conversation Level

Message Level

Step Level

Output Types Explained

Text

Number

Checkbox

Progress

JSON

Example Metrics

Example 1: Simple Helpfulness Check

Example 2: Detailed Quality Score

Example 3: Tool Selection Evaluation

Example 4: Multi-Dimensional Analysis

Best Practices

Be Specific

Use Template Variables

Choose Right Level

Test First

Next Steps

Advanced Metrics

Running Evaluations

Getting Started

Data Format

Datasets

Metrics

Evaluation

Examples

Python SDK

​Creating Metrics

​What Are Metrics?

​Creating Metrics via UI

​Step 1: Open Your Dataset

​Step 2: Choose Creation Method

AI-Powered Generation

Manual Creation

​Method 1: AI-Powered Generation

​Method 2: Manual Creation

​Understanding Evaluation Levels

​Conversation Level

​Message Level

​Step Level

​Output Types Explained

​Text

​Number

​Checkbox

​Progress

​JSON

​Example Metrics

​Example 1: Simple Helpfulness Check

​Example 2: Detailed Quality Score

​Example 3: Tool Selection Evaluation

​Example 4: Multi-Dimensional Analysis

​Best Practices

Be Specific

Use Template Variables

Choose Right Level

Test First

​Next Steps

Advanced Metrics

Running Evaluations

Creating Metrics

What Are Metrics?

Creating Metrics via UI

Step 1: Open Your Dataset

Step 2: Choose Creation Method

Method 1: AI-Powered Generation

Method 2: Manual Creation

Understanding Evaluation Levels

Conversation Level

Message Level

Step Level

Output Types Explained

Text

Number

Checkbox

Progress

JSON

Example Metrics

Example 1: Simple Helpfulness Check

Example 2: Detailed Quality Score

Example 3: Tool Selection Evaluation

Example 4: Multi-Dimensional Analysis

Best Practices

Next Steps