Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462
Open
mmkawale wants to merge 1 commit into
Open
Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462mmkawale wants to merge 1 commit into
mmkawale wants to merge 1 commit into
Conversation
…sations and add [STATUS] pass-through for ToolCallSuccess Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that *do* reach it. Changes ------- ToolCallAccuracy / ToolInputAccuracy: - Set check_for_unsupported_tools=False on the input validator in _tool_call_accuracy.py and _tool_input_accuracy.py. The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body. - Export _ToolInputAccuracyEvaluator from the azure.ai.evaluation top-level namespace, matching its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator). Consumers (notably the Foundry evaluations service catalog) can now import it directly instead of reaching into the private _evaluators._tool_input_accuracy submodule. ToolCallSuccess -- [STATUS] pass-through: - Added _format_status_suffix helper and wired it into _get_tool_calls_results so every [TOOL_CALL] / [TOOL_RESULT] line carries a [STATUS] <value> suffix when the source content block has a status field. Back-compat preserved: empty/None/non-string status emits the empty string, so output is byte-identical to the prior format when status is absent. - Prompty: added an ERROR-CASES bullet that names [STATUS] failed and [STATUS] incomplete as authoritative failure signals that override bland payload appearance, with two illustrative examples (bland-payload+failed-status and completed-status+error-payload). The bullet matches the Responses-API tool-call status enum (in_progress | completed | incomplete | failed) -- only 'failed' and 'incomplete' are listed as primary values because no current emitter (Responses API, Threads/v1 Agents, ACA trace converter, tool-server gRPC) produces error/cancelled/canceled on a tool_call block. The _format_status_suffix helper remains permissive (any non-empty string) for forward-compat; only the rubric wording is narrowed. - Prompty: added an explicit clause that [STATUS] is optional and that [STATUS] completed does not by itself imply success -- payload-based rules still apply. - Prompty: fixed invalid trailing commas in every few-shot EXAMPLE OUTPUT. Each example had a trailing comma after the only failed_tools field of properties, producing invalid JSON. Under gpt-4o + response_format=json_object this caused the model to disambiguate the trailing comma by nesting score/status inside properties (a syntactically-valid alternative), which broke the SDK's top-level score extractor and silently flipped passing evaluations to fail. Validated end-to-end on a SharePoint-grounded transcript: with the commas stripped, gpt-4o reliably emits the canonical shape with score/status as siblings of properties, and pass/fail rows are classified correctly. Tests: - New test_unsupported_tools_validation.py (26 tests): 15 parametrized cases (3 evaluators x 5 restricted tools) asserting validate_eval_input returns True for response= payloads, 1 mixed-tools case, 10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True. - Replaced test_tool_call_success_evaluator.py with status-passthrough coverage (12 tests on _format_status_suffix and _get_tool_calls_results topologies). - One test was flipped from test_tool_call_success_accepts_restricted_tool to test_tool_call_success_still_rejects_restricted_tool in test_unsupported_tools_validation.py, with the module docstring scope narrowed to TCA/TIA only. Versioning: - Bumped _version.py 1.17.0 -> 1.17.1. - Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added covering TCA/TIA enablement on restricted-tool conversations and TCS [STATUS] pass-through. All 38 impacted unit tests pass.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates azure-ai-evaluation tool evaluators to (1) allow ToolCallAccuracy and ToolInputAccuracy to run on conversations that include restricted built-in tools (since they don’t require tool output bodies), and (2) improve ToolCallSuccess grading by passing runtime tool-call status through into the rubric via [STATUS] ... annotations. It also exposes _ToolInputAccuracyEvaluator from the top-level package namespace, adds/updates unit tests, and bumps the package version.
Changes:
- Lifted restricted-tool validation for
ToolCallAccuracyEvaluatorand_ToolInputAccuracyEvaluatorby disabling unsupported-tool checks in their validators. - Added
[STATUS] <value>suffix pass-through for ToolCallSuccess’s formatted[TOOL_CALL]/[TOOL_RESULT]lines and updated the prompty rubric/examples accordingly. - Exported
_ToolInputAccuracyEvaluatorfromazure.ai.evaluation, added targeted unit tests, and bumped version/changelog.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_unsupported_tools_validation.py | New regression tests covering restricted-tool acceptance for TCA/TIA and continued rejection for TCS, plus validator-level regression. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_success_evaluator.py | New unit tests covering _format_status_suffix and [STATUS] emission topology in _get_tool_calls_results. |
| sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | Adds 1.17.1 (Unreleased) entry documenting restricted-tool enablement, status pass-through, and export change. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py | Version bump to 1.17.1. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py | Disables unsupported-tool validation for ToolInputAccuracy evaluator inputs. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/tool_call_success.prompty | Updates rubric to account for [STATUS] and fixes JSON example formatting (trailing commas). |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py | Implements _format_status_suffix and appends status suffix to formatted tool call/result lines. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Disables unsupported-tool validation for ToolCallAccuracy evaluator inputs. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/init.py | Exports _ToolInputAccuracyEvaluator and adds it to __all__. |
Comment on lines
+76
to
+80
| @pytest.mark.usefixtures("mock_model_config") | ||
| @pytest.mark.unittest | ||
| class TestRestrictedToolValidationLifted: | ||
| """Validator should no longer reject restricted tools for these three evaluators.""" | ||
|
|
Comment on lines
+30
to
+32
| from azure.ai.evaluation import ToolCallAccuracyEvaluator | ||
| from azure.ai.evaluation._evaluators._tool_call_success import _ToolCallSuccessEvaluator | ||
| from azure.ai.evaluation._evaluators._tool_input_accuracy import _ToolInputAccuracyEvaluator |
Comment on lines
+8
to
+11
| content block carries a ``status`` field. The prompty rubric is taught to treat | ||
| these annotations as a strong (authoritative) failure signal when the status is | ||
| in {failed, error, incomplete, cancelled, canceled}, and to fall back to | ||
| payload-only judgment when ``status`` is absent. |
posaninagendra
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…sations and add [STATUS] pass-through for ToolCallSuccess
Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that do reach it.
Changes
ToolCallAccuracy / ToolInputAccuracy:
ToolCallSuccess -- [STATUS] pass-through:
Tests:
Versioning:
All 38 impacted unit tests pass.
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines