Skip to content

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4

Open
varunursekar wants to merge 1 commit into
harbor-1-corefrom
harbor-2-sidecar
Open

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4
varunursekar wants to merge 1 commit into
harbor-1-corefrom
harbor-2-sidecar

Conversation

@varunursekar

@varunursekar varunursekar commented Jun 24, 2026

Copy link
Copy Markdown

Draft · Stack 2 of 4 — targets harbor-1-core (review [1/4] first; its diff is the base here).

The evaluation engine run in a sidecar container, plus the trust boundary that makes a run leaderboard-gradeable. This is the security-critical PR — a focused/security review is worthwhile here.

  • EvaluationSidecar (server.py): commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) + tier-gated result write-routing across the agent-readable and admin volumes.
  • Verifier (verifier.py): commit selection (submit | auto_best) + hidden-split scoring.
  • Admin token (auth.py): per-trial, written root:600 — unreadable by the de-privileged optimizer, readable by the verifier (root, shared mode).
  • FastAPI (app.py): /eval /submit /status (agent; metered, redacted) and /finalize (verifier; token-gated). vero harbor serve assembles engine+sidecar+verifier from a ServeConfig.
  • CLI clients (cli.py) + HarborConfig/partition helpers so the package imports cleanly (build/run light up in [3/4]).

Stack: [1/4] core → this → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the Harbor eval sidecar and verifier path. The main changes are:

  • New FastAPI endpoints for eval, submit, status, health, and finalize.
  • Admin bearer-token helpers for verifier-only finalize access.
  • Sidecar commit transfer from the agent repo and tier-based result routing.
  • Verifier commit selection and hidden-split reward scoring.
  • Harbor config, dataset partition helpers, CLI clients, and coverage tests.

Confidence Score: 4/5

Several correctness issues in result isolation, commit selection, configuration validation, and startup behavior should be addressed before merging.

The changed code is covered by focused tests and the review identified concrete runtime-impacting paths with reproducible behavior, while the remaining surface is relatively contained to the Harbor sidecar and verifier components.

vero/src/vero/harbor/server.py, vero/src/vero/harbor/verifier.py, and vero/src/vero/harbor/serve.py

T-Rex T-Rex Logs

What T-Rex did

    • Reproduced stale Harbor sidecar results behavior by running the harness twice on the same split/commit, verifying the second run reused summary.json with n_samples=1 and left stale per-sample files from the first run.
    • Reproduced the cross-dataset selection path with the focused verifier harness, observing target_ds 0.51 vs other_ds 0.99 and auto_best choosing other_ds commit.
    • Reproduced reward_mode handling in the repro, demonstrating that reward_mode='submitted' is accepted, submission.json is used for selection, and no candidate results were found when experiments are missing.
    • Reproduced Guard Mode B startup path and observed failure due to ModuleNotFoundError: No module named 'vero.harbor.runner'.
    • Reproduced sidecar trust routing pre/post captures, including a base state with ModuleNotFoundError and an after state with real git fetch and expected routing outputs.
    • Inspected token-based access and endpoint behavior under trust routing, verifying token mode configuration, token round-trip, and unauthenticated vs. authenticated endpoint results.
    • Reproduced verifier selection harness results, showing before lack of vero.harbor import and after submission results with NoCandidateError paths using a fake DB and evaluator.
    • Reproduced mode-a serve assembly flow, with before showing missing harbor import and after demonstrating successful imports, JSON parsing, dataset construction, and serve app creation.

View all artifacts

T-Rex Ran code and verified through T-Rex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
vero/src/vero/harbor/server.py:164-165
**Avoid stale results**

This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run.

### Issue 2 of 3
vero/src/vero/harbor/verifier.py:103-113
**Filter selection dataset**

`auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.

### Issue 3 of 3
vero/src/vero/harbor/serve.py:63
**Validate reward mode**

`reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates.

Reviews (1): Last reviewed commit: "Harbor eval sidecar: verifier, token aut..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

The evaluation engine, run in a sidecar container, with the trust boundary that
makes an optimization run leaderboard-gradeable:

- `EvaluationSidecar` (server.py): agent-facing handlers — commit transfer from the
  untrusted agent repo (git fetch, hooks off, object copy) and tier-gated write-routing
  of results across the agent-readable and admin volumes.
- `Verifier` (verifier.py): commit selection (submit | auto_best) + hidden-split scoring.
- Per-trial admin token (auth.py), written root:600 so the optimizer (de-privileged)
  cannot read it; only the verifier (root, shared mode) can.
- FastAPI surface (app.py): /eval, /submit, /status for the agent (metered, redacted);
  /finalize for the verifier (token-gated). `vero harbor serve` (serve.py) assembles the
  engine + sidecar + verifier from a ServeConfig and runs it under uvicorn.
- `vero harbor` CLI clients (cli.py): serve | eval | submit | status | finalize (build/run
  land with the compiler). HarborConfig + the Mode-B dataset partition helpers (config.py,
  dataset.py) are included so the harbor package imports cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@varunursekar varunursekar requested a review from a team June 24, 2026 18:17
@varunursekar varunursekar marked this pull request as ready for review June 24, 2026 18:21
Comment on lines +164 to +165
commit = experiment.run.candidate.commit
dest = self.agent_volume / "results" / f"{split}__{commit[:12]}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Avoid stale results

This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write 1.json and 2.json; a later sample_ids=[0] eval overwrites summary.json with n_samples=1 but leaves the old per-sample files in place. The returned result_path can then expose stale files that were not part of the current metered run.

Artifacts

Repro: generated Harbor sidecar stale results harness

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness execution output showing stale per-sample files after second eval

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 164-165

Comment:
**Avoid stale results**

This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +103 to +113
split_df = df[df["dataset_subset_split"] == self.selection_split]
if self.base_commit is not None:
split_df = split_df[split_df["candidate_commit"] != self.base_commit]
if len(split_df) == 0:
raise NoCandidateError(
f"auto_best mode but no candidate experiments on split "
f"'{self.selection_split}'."
)
best = split_df.sort_values(
by=["mean_score", "candidate_created_at"], ascending=[False, False]
).iloc[0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Filter selection dataset

auto_best filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from other_ds/validation can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.

Artifacts

Repro: focused verifier harness that seeds two dataset validation rows and asserts cross-dataset commit selection

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: command output showing seeded DB rows, selected other_ds commit, and target_ds hidden scoring call

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 103-113

Comment:
**Filter selection dataset**

`auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

harbor: dict | None = None # HarborConfig kwargs

# selection / reward
reward_mode: str = "auto_best"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Validate reward mode

reward_mode accepts any string here, and Verifier._select_commit() treats every value except exactly submit as auto_best. A typo such as submitted on a submit-based task will silently ignore submission.json and either select from prior experiments or fail with no auto-best candidates.

Artifacts

Repro: focused script exercising ServeConfig and Verifier reward_mode selection

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: script output showing invalid reward_mode accepted and treated as auto_best

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/serve.py
Line: 63

Comment:
**Validate reward mode**

`reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant