Skip to main content

Creating Metrics

Metrics are the heart of TurnWise - they define what you want to evaluate in your conversations. This guide shows you how to create metrics using the UI or API.

What Are Metrics?

Metrics are evaluation criteria that measure specific aspects of conversations:
  • Helpfulness: Is the response helpful?
  • Accuracy: Is the information correct?
  • Politeness: Is the tone appropriate?
  • Tool Usage: Was the right tool selected?
  • Goal Achievement: Did the conversation achieve its goal?
Each metric consists of:
  • Name: Descriptive name
  • Evaluation Level: conversation, message, or step
  • Prompt: Instructions for the evaluator LLM
  • Output Type: text, number, checkbox, progress, or JSON
  • Model: LLM to use for evaluation

Creating Metrics via UI

Step 1: Open Your Dataset

Navigate to your dataset and click “Add Column” in the table header.

Step 2: Choose Creation Method

You have two options:

AI-Powered Generation

Describe what you want to evaluate in natural language

Manual Creation

Create the metric configuration yourself

Method 1: AI-Powered Generation

  1. Click “Generate with AI”
  2. Describe Your Metric
    • Example: “Check if the assistant is being polite and professional”
    • Example: “Rate the helpfulness of each response on a scale of 1-10”
    • Example: “Evaluate if the correct tool was selected”
  3. Review Generated Configuration
    • TurnWise generates:
      • Metric name
      • Evaluation level
      • Prompt
      • Output type
      • JSON schema (if JSON output)
  4. Edit if Needed
    • Modify the generated configuration
    • Adjust prompt wording
    • Change output type
  5. Save
    • Click “Create” to add the metric
AI generation is smart - It analyzes your description and chooses appropriate evaluation level, output type, and prompt structure.

Method 2: Manual Creation

  1. Fill in Basic Info
    • Name: Descriptive name (e.g., “Response Helpfulness”)
    • Description: Optional explanation
  2. Choose Evaluation Level
    LevelWhen to UseExample
    ConversationOverall quality, goal achievement”Did the conversation solve the user’s problem?”
    MessageIndividual responses”Is this response helpful?”
    StepReasoning steps, tool usage”Was the correct tool selected?”
  3. Write Your Prompt Basic prompt (no template variables):
    Is this response helpful? Answer yes or no.
    
    Advanced prompt (with template variables):
    Evaluate @CURRENT_MESSAGE.output for helpfulness given the user's question: @PREVIOUS_USER_MSG
    
    See Advanced Metrics for template variables.
  4. Select Output Type
    TypeDescriptionExample Use
    TextFree-form text responseExplanations, reasoning
    NumberNumeric valueScores, ratings
    CheckboxYes/No, Pass/FailBinary evaluations
    Progress0-1 normalized scoreQuality scores, percentages
    JSONStructured outputMulti-dimensional analysis
  5. Configure JSON Schema (if JSON output)
    {
      "type": "object",
      "properties": {
        "score": {
          "type": "number",
          "description": "Helpfulness score from 0-1"
        },
        "reasoning": {
          "type": "string",
          "description": "Explanation of the score"
        }
      },
      "required": ["score", "reasoning"]
    }
    
  6. Choose Model Select the LLM to use:
    • Default: openai/gpt-5-nano (cost-effective)
    • For complex evaluations: openai/gpt-4 or anthropic/claude-sonnet
  7. Save
    • Click “Create” to add the metric

Understanding Evaluation Levels

Conversation Level

Evaluates the entire conversation: Use when:
  • Measuring overall goal achievement
  • Assessing conversation quality holistically
  • Evaluating conversation patterns
Example Prompt:
Did this conversation successfully help the user achieve their goal? 
Consider: @GOAL and @HISTORY
Available Variables:
  • @HISTORY - Full conversation history
  • @GOAL - User’s goal
  • @LIST_AGENT - Available agents and tools
  • @MESSAGES - All messages
  • @USER_MESSAGES - User messages only
  • @ASSISTANT_MESSAGES - Assistant messages only

Message Level

Evaluates individual assistant responses: Use when:
  • Measuring response quality
  • Checking tone and style
  • Verifying accuracy per message
Example Prompt:
Is @CURRENT_MESSAGE.output helpful given @PREVIOUS_USER_MSG?
Rate on a scale of 0-1.
Available Variables:
  • All conversation-level variables
  • @PREVIOUS_USER_MSG - Previous user message
  • @PREVIOUS_ASSISTANT_MSG - Previous assistant message
  • @CURRENT_MESSAGE.output - Current message content
  • @CURRENT_MESSAGE.role - Current message role
  • @CURRENT_STEPS - Steps in current message

Step Level

Evaluates individual reasoning steps: Use when:
  • Evaluating tool selection
  • Checking reasoning quality
  • Verifying parameter accuracy
Example Prompt:
Was @CURRENT_STEP.tool_call the correct tool to use given @PREVIOUS_STEP.tool_result?
Answer yes or no.
Available Variables:
  • All message-level variables
  • @PREVIOUS_STEP.* - Previous step details
  • @CURRENT_STEP.* - Current step details
  • @STEP_NUMBER - Step position

Output Types Explained

Text

Free-form text responses:
Prompt: "Explain why this response is helpful or not."
Output: "The response directly addresses the user's question about order status and provides clear next steps."
Use for: Explanations, detailed reasoning, qualitative feedback

Number

Numeric values:
Prompt: "Rate helpfulness from 1-10."
Output: 8.5
Use for: Scores, ratings, counts

Checkbox

Binary yes/no:
Prompt: "Is this response helpful? Answer yes or no."
Output: true
Use for: Pass/fail, yes/no evaluations

Progress

Normalized 0-1 score (displayed as 0-100%):
Prompt: "Rate helpfulness from 0-1."
Output: 0.85
Display: 85%
Use for: Quality scores, percentages, normalized ratings

JSON

Structured multi-field output:
Prompt: "Evaluate response quality with score and reasoning."
Output: {
  "score": 0.85,
  "reasoning": "Response is helpful and accurate",
  "accuracy": 0.9,
  "completeness": 0.8
}
Use for: Multi-dimensional analysis, structured evaluations

Example Metrics

Example 1: Simple Helpfulness Check

Name: Response Helpfulness
Level: Message
Prompt: Is @CURRENT_MESSAGE.output helpful? Answer yes or no.
Output Type: Checkbox

Example 2: Detailed Quality Score

Name: Response Quality Score
Level: Message
Prompt: Rate @CURRENT_MESSAGE.output for quality considering:
- Accuracy
- Completeness
- Helpfulness
Provide a score from 0-1.
Output Type: Progress

Example 3: Tool Selection Evaluation

Name: Correct Tool Selection
Level: Step
Prompt: Given @PREVIOUS_STEP.tool_result, was @CURRENT_STEP.tool_call the correct next tool?
Consider the available tools: @LIST_AGENT
Answer yes or no.
Output Type: Checkbox

Example 4: Multi-Dimensional Analysis

Name: Comprehensive Quality Analysis
Level: Message
Prompt: Evaluate @CURRENT_MESSAGE.output across multiple dimensions:
- Helpfulness
- Accuracy
- Tone
- Completeness
Output Type: JSON
Schema: {
  "helpfulness": "number (0-1)",
  "accuracy": "number (0-1)",
  "tone": "string (polite/neutral/rude)",
  "completeness": "number (0-1)",
  "reasoning": "string"
}

Best Practices

Be Specific

Write clear, specific prompts with evaluation criteria

Use Template Variables

Leverage @HISTORY, @GOAL, etc. for context-aware evaluation

Choose Right Level

Match evaluation level to what you’re measuring

Test First

Test metrics on a few conversations before running on all

Next Steps