Skip to content

Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462

Open
mmkawale wants to merge 1 commit into
mainfrom
mk/enable-tool-evals
Open

Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462
mmkawale wants to merge 1 commit into
mainfrom
mk/enable-tool-evals

Conversation

@mmkawale

Copy link
Copy Markdown
Contributor

…sations and add [STATUS] pass-through for ToolCallSuccess

Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that do reach it.

Changes

ToolCallAccuracy / ToolInputAccuracy:

  • Set check_for_unsupported_tools=False on the input validator in _tool_call_accuracy.py and _tool_input_accuracy.py. The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body.
  • Export _ToolInputAccuracyEvaluator from the azure.ai.evaluation top-level namespace, matching its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator). Consumers (notably the Foundry evaluations service catalog) can now import it directly instead of reaching into the private _evaluators._tool_input_accuracy submodule.

ToolCallSuccess -- [STATUS] pass-through:

  • Added _format_status_suffix helper and wired it into _get_tool_calls_results so every [TOOL_CALL] / [TOOL_RESULT] line carries a [STATUS] suffix when the source content block has a status field. Back-compat preserved: empty/None/non-string status emits the empty string, so output is byte-identical to the prior format when status is absent.
  • Prompty: added an ERROR-CASES bullet that names [STATUS] failed and [STATUS] incomplete as authoritative failure signals that override bland payload appearance, with two illustrative examples (bland-payload+failed-status and completed-status+error-payload). The bullet matches the Responses-API tool-call status enum (in_progress | completed | incomplete | failed) -- only 'failed' and 'incomplete' are listed as primary values because no current emitter (Responses API, Threads/v1 Agents, ACA trace converter, tool-server gRPC) produces error/cancelled/canceled on a tool_call block. The _format_status_suffix helper remains permissive (any non-empty string) for forward-compat; only the rubric wording is narrowed.
  • Prompty: added an explicit clause that [STATUS] is optional and that [STATUS] completed does not by itself imply success -- payload-based rules still apply.
  • Prompty: fixed invalid trailing commas in every few-shot EXAMPLE OUTPUT. Each example had a trailing comma after the only failed_tools field of properties, producing invalid JSON. Under gpt-4o + response_format=json_object this caused the model to disambiguate the trailing comma by nesting score/status inside properties (a syntactically-valid alternative), which broke the SDK's top-level score extractor and silently flipped passing evaluations to fail. Validated end-to-end on a SharePoint-grounded transcript: with the commas stripped, gpt-4o reliably emits the canonical shape with score/status as siblings of properties, and pass/fail rows are classified correctly.

Tests:

  • New test_unsupported_tools_validation.py (26 tests): 15 parametrized cases (3 evaluators x 5 restricted tools) asserting validate_eval_input returns True for response= payloads, 1 mixed-tools case, 10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True.
  • Replaced test_tool_call_success_evaluator.py with status-passthrough coverage (12 tests on _format_status_suffix and _get_tool_calls_results topologies).
  • One test was flipped from test_tool_call_success_accepts_restricted_tool to test_tool_call_success_still_rejects_restricted_tool in test_unsupported_tools_validation.py, with the module docstring scope narrowed to TCA/TIA only.

Versioning:

  • Bumped _version.py 1.17.0 -> 1.17.1.
  • Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added covering TCA/TIA enablement on restricted-tool conversations and TCS [STATUS] pass-through.

All 38 impacted unit tests pass.

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

…sations and add [STATUS] pass-through for ToolCallSuccess

Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that *do* reach it.

Changes
-------

ToolCallAccuracy / ToolInputAccuracy:
- Set check_for_unsupported_tools=False on the input validator in _tool_call_accuracy.py and _tool_input_accuracy.py. The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body.
- Export _ToolInputAccuracyEvaluator from the azure.ai.evaluation top-level namespace, matching its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator). Consumers (notably the Foundry evaluations service catalog) can now import it directly instead of reaching into the private _evaluators._tool_input_accuracy submodule.

ToolCallSuccess -- [STATUS] pass-through:
- Added _format_status_suffix helper and wired it into _get_tool_calls_results so every [TOOL_CALL] / [TOOL_RESULT] line carries a [STATUS] <value> suffix when the source content block has a status field. Back-compat preserved: empty/None/non-string status emits the empty string, so output is byte-identical to the prior format when status is absent.
- Prompty: added an ERROR-CASES bullet that names [STATUS] failed and [STATUS] incomplete as authoritative failure signals that override bland payload appearance, with two illustrative examples (bland-payload+failed-status and completed-status+error-payload). The bullet matches the Responses-API tool-call status enum (in_progress | completed | incomplete | failed) -- only 'failed' and 'incomplete' are listed as primary values because no current emitter (Responses API, Threads/v1 Agents, ACA trace converter, tool-server gRPC) produces error/cancelled/canceled on a tool_call block. The _format_status_suffix helper remains permissive (any non-empty string) for forward-compat; only the rubric wording is narrowed.
- Prompty: added an explicit clause that [STATUS] is optional and that [STATUS] completed does not by itself imply success -- payload-based rules still apply.
- Prompty: fixed invalid trailing commas in every few-shot EXAMPLE OUTPUT. Each example had a trailing comma after the only failed_tools field of properties, producing invalid JSON. Under gpt-4o + response_format=json_object this caused the model to disambiguate the trailing comma by nesting score/status inside properties (a syntactically-valid alternative), which broke the SDK's top-level score extractor and silently flipped passing evaluations to fail. Validated end-to-end on a SharePoint-grounded transcript: with the commas stripped, gpt-4o reliably emits the canonical shape with score/status as siblings of properties, and pass/fail rows are classified correctly.

Tests:
- New test_unsupported_tools_validation.py (26 tests): 15 parametrized cases (3 evaluators x 5 restricted tools) asserting validate_eval_input returns True for response= payloads, 1 mixed-tools case, 10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True.
- Replaced test_tool_call_success_evaluator.py with status-passthrough coverage (12 tests on _format_status_suffix and _get_tool_calls_results topologies).
- One test was flipped from test_tool_call_success_accepts_restricted_tool to test_tool_call_success_still_rejects_restricted_tool in test_unsupported_tools_validation.py, with the module docstring scope narrowed to TCA/TIA only.

Versioning:
- Bumped _version.py 1.17.0 -> 1.17.1.
- Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added covering TCA/TIA enablement on restricted-tool conversations and TCS [STATUS] pass-through.

All 38 impacted unit tests pass.
Copilot AI review requested due to automatic review settings June 11, 2026 17:43
@mmkawale mmkawale requested a review from a team as a code owner June 11, 2026 17:43
@github-actions github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates azure-ai-evaluation tool evaluators to (1) allow ToolCallAccuracy and ToolInputAccuracy to run on conversations that include restricted built-in tools (since they don’t require tool output bodies), and (2) improve ToolCallSuccess grading by passing runtime tool-call status through into the rubric via [STATUS] ... annotations. It also exposes _ToolInputAccuracyEvaluator from the top-level package namespace, adds/updates unit tests, and bumps the package version.

Changes:

  • Lifted restricted-tool validation for ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator by disabling unsupported-tool checks in their validators.
  • Added [STATUS] <value> suffix pass-through for ToolCallSuccess’s formatted [TOOL_CALL] / [TOOL_RESULT] lines and updated the prompty rubric/examples accordingly.
  • Exported _ToolInputAccuracyEvaluator from azure.ai.evaluation, added targeted unit tests, and bumped version/changelog.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_unsupported_tools_validation.py New regression tests covering restricted-tool acceptance for TCA/TIA and continued rejection for TCS, plus validator-level regression.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_success_evaluator.py New unit tests covering _format_status_suffix and [STATUS] emission topology in _get_tool_calls_results.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Adds 1.17.1 (Unreleased) entry documenting restricted-tool enablement, status pass-through, and export change.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py Version bump to 1.17.1.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py Disables unsupported-tool validation for ToolInputAccuracy evaluator inputs.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/tool_call_success.prompty Updates rubric to account for [STATUS] and fixes JSON example formatting (trailing commas).
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py Implements _format_status_suffix and appends status suffix to formatted tool call/result lines.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py Disables unsupported-tool validation for ToolCallAccuracy evaluator inputs.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/init.py Exports _ToolInputAccuracyEvaluator and adds it to __all__.

Comment on lines +76 to +80
@pytest.mark.usefixtures("mock_model_config")
@pytest.mark.unittest
class TestRestrictedToolValidationLifted:
"""Validator should no longer reject restricted tools for these three evaluators."""

Comment on lines +30 to +32
from azure.ai.evaluation import ToolCallAccuracyEvaluator
from azure.ai.evaluation._evaluators._tool_call_success import _ToolCallSuccessEvaluator
from azure.ai.evaluation._evaluators._tool_input_accuracy import _ToolInputAccuracyEvaluator
Comment on lines +8 to +11
content block carries a ``status`` field. The prompty rubric is taught to treat
these annotations as a strong (authoritative) failure signal when the status is
in {failed, error, incomplete, cancelled, canceled}, and to fall back to
payload-only judgment when ``status`` is absent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants