Skip to content

update sdk with new adds#467

Open
luke-e-schaefer wants to merge 5 commits into
masterfrom
update-nuc-sdk-for-new-eval-stuff-pt1
Open

update sdk with new adds#467
luke-e-schaefer wants to merge 5 commits into
masterfrom
update-nuc-sdk-for-new-eval-stuff-pt1

Conversation

@luke-e-schaefer

@luke-e-schaefer luke-e-schaefer commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Added

  • Evaluations V2 slice scoping and exclusion rules. create_evaluation_v2() accepts slice_id (restrict the evaluation to a slice's items) and exclusion_rules (drop items/annotations before metrics are computed) via the new MetadataExclusionRule, LabelExclusionRule, and BoxAreaExclusionRule types (or equivalent dicts). The EvaluationV2 resource exposes slice_id, exclusion_rules, and exclusion_stats. EvaluationV2FilterArgs gains gt_area_range (filter by ground-truth box area, e.g. COCO small/medium/large bands) and slice_ids, applied by both charts() and examples().
  • Evaluation V2 presets. Save and reuse evaluation configurations (name + allowed_label_matches + exclusion_rules) via NucleusClient.list_evaluation_v2_presets(), create_evaluation_v2_preset(), update_evaluation_v2_preset(), and delete_evaluation_v2_preset(), plus the new EvaluationV2Preset resource (with update() / delete()). Apply a preset directly when creating an evaluation: create_evaluation_v2(model_run_id, preset=preset) seeds the matches and rules (explicit arguments override the preset).
  • create_evaluation_v2() accepts only_items_with_predictions to restrict the evaluation to items that have at least one prediction.
  • Batch create. create_evaluations_v2_batch() creates one evaluation per (model_run_id, slice_id) pair with a shared configuration, running concurrently and returning a BatchEvaluationResult per job (capturing the created evaluation or the per-job error).
  • Cancel & retry. EvaluationV2.cancel() stops a running evaluation; EvaluationV2.retry() re-runs a failed one, reusing its slice/matches/exclusion rules.
  • Dataset.evaluation_label_schema() returns the dataset's ground-truth and prediction label vocabularies (gt_labels / prediction_labels) for building label matches and label exclusion rules.

Changed

  • EvaluationV2.examples() now treats match_type as optional — omit it to return examples of all match types.

Fixed

  • EvaluationV2.charts() issues a POST (matching the backend route) instead of a GET with a query string, which did not reach the server.

Greptile Summary

This PR expands Evaluation V2 support in the Python SDK. The main changes are:

  • Adds slice scoping, exclusion rules, and prediction-only evaluation creation options.
  • Adds Evaluation V2 preset CRUD helpers and preset-based evaluation creation.
  • Adds batch Evaluation V2 creation, cancel, retry, and label schema helpers.
  • Updates Evaluation V2 charts to use POST and examples to allow all match types.

Confidence Score: 5/5

The SDK changes are merge-safe based on the reviewed API surface and tests.

The implementation is covered by targeted Evaluation V2 and preset tests, and no blocking correctness issues were identified.

T-Rex T-Rex Logs

What T-Rex did

  • Baseline state showed charts using GET query strings and missing newer preset/batch/dataset surfaces.
  • Head state after the change shows POST-based evaluationsV2/{id}/charts, examples POST bodies, POST cancel/retry routes, evaluationV2Presets GET/POST/PATCH/DELETE routes, dataset/{id}/labelSchema GET-equivalent call, and batch cross-product result capture.
  • Baseline state lacked the five new imports and nucleus.__all__ omitted them.
  • Head state now has import_ok: true, all requested names exported in __all__, serialized exclusion rules, succeeded states with with_evaluation: true and with_error: false, and parsed presets for both camelCase and snake_case payloads; the test script used for both runs is saved as an artifact.

View all artifacts

T-Rex Ran code and verified through T-Rex

Reviews (4): Last reviewed commit: "greptile" | Re-trigger Greptile

@luke-e-schaefer luke-e-schaefer requested a review from edwinpav June 25, 2026 21:41
@luke-e-schaefer luke-e-schaefer self-assigned this Jun 25, 2026
@luke-e-schaefer luke-e-schaefer requested a review from vinay553 June 25, 2026 21:42
Comment thread nucleus/evaluation_v2_preset.py Outdated
Comment thread nucleus/__init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant