Skip to main content

Steps

Steps capture the internal processing of an AI agent, including reasoning (thinking), tool calls, and outputs. They’re optional but provide valuable insight for evaluation.

Structure

{
  "steps": [
    {
      "model_name": "gpt-4",
      "agent_name": "order_assistant",
      "thinking": "User wants order status. I should look it up.",
      "tool_call": {
        "name": "lookup_order",
        "arguments": { "order_id": "ORD-123" }
      }
    },
    {
      "model_name": "gpt-4",
      "agent_name": "order_assistant",
      "tool_result": {
        "status": "shipped",
        "tracking": "1Z999..."
      }
    },
    {
      "model_name": "gpt-4",
      "agent_name": "order_assistant",
      "thinking": "Order found. Let me tell the customer.",
      "output_content": "Your order has shipped! Tracking: 1Z999..."
    }
  ]
}
Note: The order of steps is automatically inferred from their position in the steps array (0-indexed). You don’t need to specify step_order.

Fields

model_name (optional)

The model used for this step. Defaults to "unknown".
"model_name": "gpt-4-turbo"

agent_name (optional)

The name of the agent that executed this step. Useful for multi-agent systems to identify which agent and which tools to reference.
"agent_name": "order_assistant"

thinking (optional)

The model’s internal reasoning or chain-of-thought.
"thinking": "The user is asking about their order. I should use the lookup_order tool to find the current status before responding."

tool_call (optional)

Information about a tool invocation.
"tool_call": {
  "name": "search_products",
  "arguments": {
    "query": "wireless mouse",
    "category": "electronics"
  }
}

tool_result (optional)

The result returned from a tool execution. Can be an object or string.
"tool_result": {
  "products": [
    { "id": "MOUSE-001", "name": "Pro Wireless Mouse", "price": 79.99 }
  ]
}

output_structured (optional)

Structured output data (JSON object).

output_content (optional)

The final text output from this step.
"output_content": "I found a great option - the Pro Wireless Mouse for $79.99!"

Content Requirement

Each step must have at least one content field: thinking, tool_call, tool_result, output_structured, or output_content.

Common Patterns

Simple Response (No Tools)

{
  "steps": [
    {
      "model_name": "gpt-4",
      "thinking": "User is greeting me. I should respond warmly.",
      "output_content": "Hello! How can I help you today?"
    }
  ]
}

Tool Usage

{
  "steps": [
    {
      "model_name": "gpt-4",
      "thinking": "I need to look up the order status.",
      "tool_call": {
        "name": "lookup_order",
        "arguments": { "order_id": "ORD-123" }
      },
      "tool_result": {
        "status": "delivered",
        "delivered_at": "2024-01-19"
      }
    },
    {
      "model_name": "gpt-4",
      "thinking": "Order was delivered. I'll let the customer know.",
      "output_content": "Great news! Your order was delivered on January 19th."
    }
  ]
}

Multiple Tool Calls

{
  "steps": [
    {
      "thinking": "Need to check order first",
      "tool_call": { "name": "lookup_order", "arguments": { "order_id": "ORD-123" } },
      "tool_result": { "status": "pending", "amount": 49.99 }
    },
    {
      "thinking": "Order qualifies for refund, processing now",
      "tool_call": { "name": "process_refund", "arguments": { "order_id": "ORD-123", "amount": 49.99 } },
      "tool_result": { "refund_id": "REF-789", "status": "processed" }
    },
    {
      "thinking": "Refund processed successfully",
      "output_content": "I've processed your refund of $49.99. Reference: REF-789"
    }
  ]
}

Why Use Steps?

Steps enable more detailed evaluation:
  • Reasoning Quality: Evaluate if the agent’s thinking is sound
  • Tool Selection: Check if the right tools were used
  • Error Handling: See how the agent responds to tool failures
  • Efficiency: Measure unnecessary steps or redundant tool calls