[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A) by varunursekar · Pull Request #4 · scaleapi/vero

varunursekar · 2026-06-24T18:12:23Z

Draft · Stack 2 of 4 — targets harbor-1-core (review [1/4] first; its diff is the base here).

The evaluation engine run in a sidecar container, plus the trust boundary that makes a run leaderboard-gradeable. This is the security-critical PR — a focused/security review is worthwhile here.

EvaluationSidecar (server.py): commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) + tier-gated result write-routing across the agent-readable and admin volumes.
Verifier (verifier.py): commit selection (submit | auto_best) + hidden-split scoring.
Admin token (auth.py): per-trial, written root:600 — unreadable by the de-privileged optimizer, readable by the verifier (root, shared mode).
FastAPI (app.py): /eval /submit /status (agent; metered, redacted) and /finalize (verifier; token-gated). vero harbor serve assembles engine+sidecar+verifier from a ServeConfig.
CLI clients (cli.py) + HarborConfig/partition helpers so the package imports cleanly (build/run light up in [3/4]).

Stack: [1/4] core → this → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the Harbor eval sidecar and verifier path. The main changes are:

New FastAPI endpoints for eval, submit, status, health, and finalize.
Admin bearer-token helpers for verifier-only finalize access.
Sidecar commit transfer from the agent repo and tier-based result routing.
Verifier commit selection and hidden-split reward scoring.
Harbor config, dataset partition helpers, CLI clients, and coverage tests.

Confidence Score: 4/5

Several correctness issues in result isolation, commit selection, configuration validation, and startup behavior should be addressed before merging.

The changed code is covered by focused tests and the review identified concrete runtime-impacting paths with reproducible behavior, while the remaining surface is relatively contained to the Harbor sidecar and verifier components.

vero/src/vero/harbor/server.py, vero/src/vero/harbor/verifier.py, and vero/src/vero/harbor/serve.py

T-Rex Logs

What T-Rex did

- Reproduced stale Harbor sidecar results behavior by running the harness twice on the same split/commit, verifying the second run reused summary.json with n_samples=1 and left stale per-sample files from the first run.
- Reproduced the cross-dataset selection path with the focused verifier harness, observing target_ds 0.51 vs other_ds 0.99 and auto_best choosing other_ds commit.
- Reproduced reward_mode handling in the repro, demonstrating that reward_mode='submitted' is accepted, submission.json is used for selection, and no candidate results were found when experiments are missing.
- Reproduced Guard Mode B startup path and observed failure due to ModuleNotFoundError: No module named 'vero.harbor.runner'.
- Reproduced sidecar trust routing pre/post captures, including a base state with ModuleNotFoundError and an after state with real git fetch and expected routing outputs.
- Inspected token-based access and endpoint behavior under trust routing, verifying token mode configuration, token round-trip, and unauthenticated vs. authenticated endpoint results.
- Reproduced verifier selection harness results, showing before lack of vero.harbor import and after submission results with NoCandidateError paths using a fake DB and evaluator.
- Reproduced mode-a serve assembly flow, with before showing missing harbor import and after demonstrating successful imports, JSON parsing, dataset construction, and serve app creation.

_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
vero/src/vero/harbor/server.py:164-165
**Avoid stale results**

This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run.

### Issue 2 of 3
vero/src/vero/harbor/verifier.py:103-113
**Filter selection dataset**

`auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.

### Issue 3 of 3
vero/src/vero/harbor/serve.py:63
**Validate reward mode**

`reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates.

_{Reviews (1): Last reviewed commit: "Harbor eval sidecar: verifier, token aut..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

The evaluation engine, run in a sidecar container, with the trust boundary that makes an optimization run leaderboard-gradeable: - `EvaluationSidecar` (server.py): agent-facing handlers — commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) and tier-gated write-routing of results across the agent-readable and admin volumes. - `Verifier` (verifier.py): commit selection (submit | auto_best) + hidden-split scoring. - Per-trial admin token (auth.py), written root:600 so the optimizer (de-privileged) cannot read it; only the verifier (root, shared mode) can. - FastAPI surface (app.py): /eval, /submit, /status for the agent (metered, redacted); /finalize for the verifier (token-gated). `vero harbor serve` (serve.py) assembles the engine + sidecar + verifier from a ServeConfig and runs it under uvicorn. - `vero harbor` CLI clients (cli.py): serve | eval | submit | status | finalize (build/run land with the compiler). HarborConfig + the Mode-B dataset partition helpers (config.py, dataset.py) are included so the harbor package imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-24T18:28:11Z

+        commit = experiment.run.candidate.commit
+        dest = self.agent_volume / "results" / f"{split}__{commit[:12]}"


Avoid stale results

This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write 1.json and 2.json; a later sample_ids=[0] eval overwrites summary.json with n_samples=1 but leaves the old per-sample files in place. The returned result_path can then expose stale files that were not part of the current metered run.

Artifacts

Repro: generated Harbor sidecar stale results harness

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness execution output showing stale per-sample files after second eval

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/server.py Line: 164-165 Comment: **Avoid stale results** This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-24T18:28:12Z

+        split_df = df[df["dataset_subset_split"] == self.selection_split]
+        if self.base_commit is not None:
+            split_df = split_df[split_df["candidate_commit"] != self.base_commit]
+        if len(split_df) == 0:
+            raise NoCandidateError(
+                f"auto_best mode but no candidate experiments on split "
+                f"'{self.selection_split}'."
+            )
+        best = split_df.sort_values(
+            by=["mean_score", "candidate_created_at"], ascending=[False, False]
+        ).iloc[0]


Filter selection dataset

auto_best filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from other_ds/validation can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.

Artifacts

Repro: focused verifier harness that seeds two dataset validation rows and asserts cross-dataset commit selection

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: command output showing seeded DB rows, selected other_ds commit, and target_ds hidden scoring call

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/verifier.py Line: 103-113 Comment: **Filter selection dataset** `auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-24T18:28:12Z

+    harbor: dict | None = None  # HarborConfig kwargs
+
+    # selection / reward
+    reward_mode: str = "auto_best"


Validate reward mode

reward_mode accepts any string here, and Verifier._select_commit() treats every value except exactly submit as auto_best. A typo such as submitted on a submit-based task will silently ignore submission.json and either select from prior experiments or fail with no auto-best candidates.

Artifacts

Repro: focused script exercising ServeConfig and Verifier reward_mode selection

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: script output showing invalid reward_mode accepted and treated as auto_best

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/serve.py Line: 63 Comment: **Validate reward mode** `reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates. How can I resolve this? If you propose a fix, please make it concise.

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:17

varunursekar marked this pull request as ready for review June 24, 2026 18:21

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4
varunursekar wants to merge 1 commit into
harbor-1-corefrom
harbor-2-sidecar

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		commit = experiment.run.candidate.commit
		dest = self.agent_volume / "results" / f"{split}__{commit[:12]}"

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

T-Rex Logs

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading