Description
Feature Request: More Flexible Tool Trajectory Evaluation Options
Summary
The current TrajectoryEvaluator
in ADK requires exact matching of both tool names and all arguments, which is too strict for practical agent development and testing. This leads to evaluation scores of 0.0 even when the agent correctly calls the intended tools with slightly different argument values.
Current Behavior
The TrajectoryEvaluator._are_tool_calls_equal()
method uses strict comparison:
def _are_tool_calls_equal(self, actual_tool_calls, expected_tool_calls):
# ...
for actual, expected in zip(actual_tool_calls, expected_tool_calls):
if actual.name != expected.name or actual.args != expected.args:
return False # Returns 0.0 score for ANY difference
return True
This means that even minor differences in arguments result in complete evaluation failure.
Problem Examples
Example 1: Slight Argument Variations
// Expected
{
"tool_name": "dynamic_send_agent_plan",
"tool_input": {
"selected_agent_type": "llm_agent",
"plan_summary": "Generate response to user greeting",
"plan_steps": [{"step_type": "text_generation", "order": 1}]
}
}
// Actual (from LLM)
{
"name": "dynamic_send_agent_plan",
"args": {
"selected_agent_type": "llm_agent",
"plan_summary": "Generate a response to user greeting", // Slight wording difference
"plan_steps": [{"step_type": "text_generation", "order": 1}]
}
}
Result: 0.0 score despite correct tool usage and agent type
Example 2: Additional Fields from LLM
// Expected
{
"tool_name": "dynamic_send_agent_plan",
"tool_input": {
"selected_agent_type": "gpt_image_agent"
}
}
// Actual (from LLM)
{
"name": "dynamic_send_agent_plan",
"args": {
"selected_agent_type": "gpt_image_agent",
"plan_summary": "Generate image as requested", // Extra field added by LLM
"execution_context": {"timestamp": "2024-01-01T00:00:00Z"} // Extra field
}
}
Result: 0.0 score despite correct tool and agent type
Proposed Solutions
Option 1: Partial Argument Matching
Add support for evaluating only specified arguments, ignoring unspecified ones:
class TrajectoryEvaluator:
def __init__(self, threshold: float, partial_matching: bool = False):
self._threshold = threshold
self._partial_matching = partial_matching
def _are_tool_calls_equal(self, actual_tool_calls, expected_tool_calls):
# When partial_matching=True, only check specified arguments
if self._partial_matching:
return self._partial_args_match(actual_tool_calls, expected_tool_calls)
# ... existing exact matching logic
Option 2: Evaluation Modes
Introduce different evaluation modes:
from enum import Enum
class ToolEvaluationMode(Enum):
EXACT = "exact" # Current behavior
TOOL_NAME_ONLY = "tool_name_only" # Only check tool names
PARTIAL_ARGS = "partial_args" # Check specified args only
FUZZY_MATCHING = "fuzzy_matching" # Allow minor string differences
class TrajectoryEvaluator:
def __init__(self, threshold: float, evaluation_mode: ToolEvaluationMode = ToolEvaluationMode.EXACT):
# ...
Option 3: Configurable Argument Filters
Allow users to specify which arguments are critical:
# Test configuration
test_config = {
"tool_name": "dynamic_send_agent_plan",
"required_args": ["selected_agent_type"], # Only these args matter
"optional_args": ["plan_summary", "plan_steps"] # These can vary
}
Real-World Use Cases
- Agent Type Validation: Often we only care that the correct agent type was selected, not the exact wording of the plan summary
- Step Type Verification: We want to verify the correct sequence of step types without worrying about specific descriptions
- Tool Usage Patterns: Testing that tools are called in the right order with the right general parameters
- LLM Output Variations: Different LLM runs may produce slightly different but functionally equivalent arguments
Benefits
- More Practical Testing: Allows focusing on important aspects of tool usage
- Reduced Test Brittleness: Tests won't break due to minor LLM output variations
- Better Developer Experience: More realistic evaluation scores that reflect actual agent performance
- Flexible Test Design: Different evaluation criteria for different testing scenarios
Backward Compatibility
This can be implemented as an opt-in feature, maintaining exact matching as the default behavior to ensure no breaking changes.
Example Implementation
Here's a potential API design:
# Current usage (unchanged)
evaluator = TrajectoryEvaluator(threshold=0.8)
# New flexible options
evaluator = TrajectoryEvaluator(
threshold=0.8,
evaluation_mode=ToolEvaluationMode.PARTIAL_ARGS
)
# Or with argument specification
evaluator = TrajectoryEvaluator(
threshold=0.8,
required_args_only=True,
fuzzy_string_matching=True
)
This enhancement would make tool trajectory evaluation much more practical for real-world agent development and testing scenarios.