Creating Metrics
Metrics are the heart of TurnWise - they define what you want to evaluate in your conversations. This guide shows you how to create metrics using the UI or API.What Are Metrics?
Metrics are evaluation criteria that measure specific aspects of conversations:- Helpfulness: Is the response helpful?
- Accuracy: Is the information correct?
- Politeness: Is the tone appropriate?
- Tool Usage: Was the right tool selected?
- Goal Achievement: Did the conversation achieve its goal?
- Name: Descriptive name
- Evaluation Level: conversation, message, or step
- Prompt: Instructions for the evaluator LLM
- Output Type: text, number, checkbox, progress, or JSON
- Model: LLM to use for evaluation
Creating Metrics via UI
Step 1: Open Your Dataset
Navigate to your dataset and click “Add Column” in the table header.Step 2: Choose Creation Method
You have two options:AI-Powered Generation
Describe what you want to evaluate in natural language
Manual Creation
Create the metric configuration yourself
Method 1: AI-Powered Generation
- Click “Generate with AI”
-
Describe Your Metric
- Example: “Check if the assistant is being polite and professional”
- Example: “Rate the helpfulness of each response on a scale of 1-10”
- Example: “Evaluate if the correct tool was selected”
-
Review Generated Configuration
- TurnWise generates:
- Metric name
- Evaluation level
- Prompt
- Output type
- JSON schema (if JSON output)
- TurnWise generates:
-
Edit if Needed
- Modify the generated configuration
- Adjust prompt wording
- Change output type
-
Save
- Click “Create” to add the metric
Method 2: Manual Creation
-
Fill in Basic Info
- Name: Descriptive name (e.g., “Response Helpfulness”)
- Description: Optional explanation
-
Choose Evaluation Level
Level When to Use Example Conversation Overall quality, goal achievement ”Did the conversation solve the user’s problem?” Message Individual responses ”Is this response helpful?” Step Reasoning steps, tool usage ”Was the correct tool selected?” -
Write Your Prompt
Basic prompt (no template variables):
Advanced prompt (with template variables):See Advanced Metrics for template variables.
-
Select Output Type
Type Description Example Use Text Free-form text response Explanations, reasoning Number Numeric value Scores, ratings Checkbox Yes/No, Pass/Fail Binary evaluations Progress 0-1 normalized score Quality scores, percentages JSON Structured output Multi-dimensional analysis -
Configure JSON Schema (if JSON output)
-
Choose Model
Select the LLM to use:
- Default:
openai/gpt-5-nano(cost-effective) - For complex evaluations:
openai/gpt-4oranthropic/claude-sonnet
- Default:
-
Save
- Click “Create” to add the metric
Understanding Evaluation Levels
Conversation Level
Evaluates the entire conversation: Use when:- Measuring overall goal achievement
- Assessing conversation quality holistically
- Evaluating conversation patterns
@HISTORY- Full conversation history@GOAL- User’s goal@LIST_AGENT- Available agents and tools@MESSAGES- All messages@USER_MESSAGES- User messages only@ASSISTANT_MESSAGES- Assistant messages only
Message Level
Evaluates individual assistant responses: Use when:- Measuring response quality
- Checking tone and style
- Verifying accuracy per message
- All conversation-level variables
@PREVIOUS_USER_MSG- Previous user message@PREVIOUS_ASSISTANT_MSG- Previous assistant message@CURRENT_MESSAGE.output- Current message content@CURRENT_MESSAGE.role- Current message role@CURRENT_STEPS- Steps in current message
Step Level
Evaluates individual reasoning steps: Use when:- Evaluating tool selection
- Checking reasoning quality
- Verifying parameter accuracy
- All message-level variables
@PREVIOUS_STEP.*- Previous step details@CURRENT_STEP.*- Current step details@STEP_NUMBER- Step position
Output Types Explained
Text
Free-form text responses:Number
Numeric values:Checkbox
Binary yes/no:Progress
Normalized 0-1 score (displayed as 0-100%):JSON
Structured multi-field output:Example Metrics
Example 1: Simple Helpfulness Check
Example 2: Detailed Quality Score
Example 3: Tool Selection Evaluation
Example 4: Multi-Dimensional Analysis
Best Practices
Be Specific
Write clear, specific prompts with evaluation criteria
Use Template Variables
Leverage @HISTORY, @GOAL, etc. for context-aware evaluation
Choose Right Level
Match evaluation level to what you’re measuring
Test First
Test metrics on a few conversations before running on all