Feature Request: Add partial argument matching to TrajectoryEvaluator for   flexible tool evaluation

# Feature Request: More Flexible Tool Trajectory Evaluation Options

## Summary

The current `TrajectoryEvaluator` in ADK requires **exact matching** of both tool names and all arguments, which is too strict for practical agent development and testing. This leads to evaluation scores of 0.0 even when the agent correctly calls the intended tools with slightly different argument values.

## Current Behavior

The `TrajectoryEvaluator._are_tool_calls_equal()` method uses strict comparison:

```python
def _are_tool_calls_equal(self, actual_tool_calls, expected_tool_calls):
    # ...
    for actual, expected in zip(actual_tool_calls, expected_tool_calls):
        if actual.name != expected.name or actual.args != expected.args:
            return False  # Returns 0.0 score for ANY difference
    return True
```

This means that even minor differences in arguments result in complete evaluation failure.

## Problem Examples

### Example 1: Slight Argument Variations
```json
// Expected
{
  "tool_name": "dynamic_send_agent_plan",
  "tool_input": {
    "selected_agent_type": "llm_agent",
    "plan_summary": "Generate response to user greeting",
    "plan_steps": [{"step_type": "text_generation", "order": 1}]
  }
}

// Actual (from LLM)
{
  "name": "dynamic_send_agent_plan",
  "args": {
    "selected_agent_type": "llm_agent", 
    "plan_summary": "Generate a response to user greeting",  // Slight wording difference
    "plan_steps": [{"step_type": "text_generation", "order": 1}]
  }
}
```
**Result**: 0.0 score despite correct tool usage and agent type

### Example 2: Additional Fields from LLM
```json
// Expected
{
  "tool_name": "dynamic_send_agent_plan",
  "tool_input": {
    "selected_agent_type": "gpt_image_agent"
  }
}

// Actual (from LLM) 
{
  "name": "dynamic_send_agent_plan",
  "args": {
    "selected_agent_type": "gpt_image_agent",
    "plan_summary": "Generate image as requested",  // Extra field added by LLM
    "execution_context": {"timestamp": "2024-01-01T00:00:00Z"}  // Extra field
  }
}
```
**Result**: 0.0 score despite correct tool and agent type

## Proposed Solutions

### Option 1: Partial Argument Matching
Add support for evaluating only specified arguments, ignoring unspecified ones:

```python
class TrajectoryEvaluator:
    def __init__(self, threshold: float, partial_matching: bool = False):
        self._threshold = threshold
        self._partial_matching = partial_matching
    
    def _are_tool_calls_equal(self, actual_tool_calls, expected_tool_calls):
        # When partial_matching=True, only check specified arguments
        if self._partial_matching:
            return self._partial_args_match(actual_tool_calls, expected_tool_calls)
        # ... existing exact matching logic
```

### Option 2: Evaluation Modes
Introduce different evaluation modes:

```python
from enum import Enum

class ToolEvaluationMode(Enum):
    EXACT = "exact"           # Current behavior
    TOOL_NAME_ONLY = "tool_name_only"  # Only check tool names
    PARTIAL_ARGS = "partial_args"      # Check specified args only
    FUZZY_MATCHING = "fuzzy_matching"  # Allow minor string differences

class TrajectoryEvaluator:
    def __init__(self, threshold: float, evaluation_mode: ToolEvaluationMode = ToolEvaluationMode.EXACT):
        # ...
```

### Option 3: Configurable Argument Filters
Allow users to specify which arguments are critical:

```python
# Test configuration
test_config = {
    "tool_name": "dynamic_send_agent_plan",
    "required_args": ["selected_agent_type"],  # Only these args matter
    "optional_args": ["plan_summary", "plan_steps"]  # These can vary
}
```

## Real-World Use Cases

1. **Agent Type Validation**: Often we only care that the correct agent type was selected, not the exact wording of the plan summary
2. **Step Type Verification**: We want to verify the correct sequence of step types without worrying about specific descriptions
3. **Tool Usage Patterns**: Testing that tools are called in the right order with the right general parameters
4. **LLM Output Variations**: Different LLM runs may produce slightly different but functionally equivalent arguments

## Benefits

1. **More Practical Testing**: Allows focusing on important aspects of tool usage
2. **Reduced Test Brittleness**: Tests won't break due to minor LLM output variations  
3. **Better Developer Experience**: More realistic evaluation scores that reflect actual agent performance
4. **Flexible Test Design**: Different evaluation criteria for different testing scenarios

## Backward Compatibility

This can be implemented as an opt-in feature, maintaining exact matching as the default behavior to ensure no breaking changes.

## Example Implementation

Here's a potential API design:

```python
# Current usage (unchanged)
evaluator = TrajectoryEvaluator(threshold=0.8)

# New flexible options
evaluator = TrajectoryEvaluator(
    threshold=0.8,
    evaluation_mode=ToolEvaluationMode.PARTIAL_ARGS
)

# Or with argument specification
evaluator = TrajectoryEvaluator(
    threshold=0.8,
    required_args_only=True,
    fuzzy_string_matching=True
)
```

This enhancement would make tool trajectory evaluation much more practical for real-world agent development and testing scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Add partial argument matching to TrajectoryEvaluator for flexible tool evaluation #1591

Feature Request: More Flexible Tool Trajectory Evaluation Options

Summary

Current Behavior

Problem Examples

Example 1: Slight Argument Variations

Example 2: Additional Fields from LLM

Proposed Solutions

Option 1: Partial Argument Matching

Option 2: Evaluation Modes

Option 3: Configurable Argument Filters

Real-World Use Cases

Benefits

Backward Compatibility

Example Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add partial argument matching to TrajectoryEvaluator for flexible tool evaluation #1591

Description

Feature Request: More Flexible Tool Trajectory Evaluation Options

Summary

Current Behavior

Problem Examples

Example 1: Slight Argument Variations

Example 2: Additional Fields from LLM

Proposed Solutions

Option 1: Partial Argument Matching

Option 2: Evaluation Modes

Option 3: Configurable Argument Filters

Real-World Use Cases

Benefits

Backward Compatibility

Example Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions