[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4
[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4varunursekar wants to merge 1 commit into
Conversation
The evaluation engine, run in a sidecar container, with the trust boundary that makes an optimization run leaderboard-gradeable: - `EvaluationSidecar` (server.py): agent-facing handlers — commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) and tier-gated write-routing of results across the agent-readable and admin volumes. - `Verifier` (verifier.py): commit selection (submit | auto_best) + hidden-split scoring. - Per-trial admin token (auth.py), written root:600 so the optimizer (de-privileged) cannot read it; only the verifier (root, shared mode) can. - FastAPI surface (app.py): /eval, /submit, /status for the agent (metered, redacted); /finalize for the verifier (token-gated). `vero harbor serve` (serve.py) assembles the engine + sidecar + verifier from a ServeConfig and runs it under uvicorn. - `vero harbor` CLI clients (cli.py): serve | eval | submit | status | finalize (build/run land with the compiler). HarborConfig + the Mode-B dataset partition helpers (config.py, dataset.py) are included so the harbor package imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| commit = experiment.run.candidate.commit | ||
| dest = self.agent_volume / "results" / f"{split}__{commit[:12]}" |
There was a problem hiding this comment.
This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write 1.json and 2.json; a later sample_ids=[0] eval overwrites summary.json with n_samples=1 but leaves the old per-sample files in place. The returned result_path can then expose stale files that were not part of the current metered run.
Artifacts
Repro: generated Harbor sidecar stale results harness
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: harness execution output showing stale per-sample files after second eval
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 164-165
Comment:
**Avoid stale results**
This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run.
How can I resolve this? If you propose a fix, please make it concise.| split_df = df[df["dataset_subset_split"] == self.selection_split] | ||
| if self.base_commit is not None: | ||
| split_df = split_df[split_df["candidate_commit"] != self.base_commit] | ||
| if len(split_df) == 0: | ||
| raise NoCandidateError( | ||
| f"auto_best mode but no candidate experiments on split " | ||
| f"'{self.selection_split}'." | ||
| ) | ||
| best = split_df.sort_values( | ||
| by=["mean_score", "candidate_created_at"], ascending=[False, False] | ||
| ).iloc[0] |
There was a problem hiding this comment.
auto_best filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from other_ds/validation can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.
Artifacts
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 103-113
Comment:
**Filter selection dataset**
`auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.
How can I resolve this? If you propose a fix, please make it concise.| harbor: dict | None = None # HarborConfig kwargs | ||
|
|
||
| # selection / reward | ||
| reward_mode: str = "auto_best" |
There was a problem hiding this comment.
reward_mode accepts any string here, and Verifier._select_commit() treats every value except exactly submit as auto_best. A typo such as submitted on a submit-based task will silently ignore submission.json and either select from prior experiments or fail with no auto-best candidates.
Artifacts
Repro: focused script exercising ServeConfig and Verifier reward_mode selection
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: script output showing invalid reward_mode accepted and treated as auto_best
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/serve.py
Line: 63
Comment:
**Validate reward mode**
`reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates.
How can I resolve this? If you propose a fix, please make it concise.
Draft · Stack 2 of 4 — targets
harbor-1-core(review [1/4] first; its diff is the base here).The evaluation engine run in a sidecar container, plus the trust boundary that makes a run leaderboard-gradeable. This is the security-critical PR — a focused/security review is worthwhile here.
EvaluationSidecar(server.py): commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) + tier-gated result write-routing across the agent-readable and admin volumes.Verifier(verifier.py): commit selection (submit | auto_best) + hidden-split scoring.root:600— unreadable by the de-privileged optimizer, readable by the verifier (root, shared mode)./eval/submit/status(agent; metered, redacted) and/finalize(verifier; token-gated).vero harbor serveassembles engine+sidecar+verifier from aServeConfig.HarborConfig/partition helpers so the package imports cleanly (build/runlight up in [3/4]).Stack: [1/4] core → this → [3/4] compiler → [4/4] docs.
🤖 Generated with Claude Code
Greptile Summary
This PR adds the Harbor eval sidecar and verifier path. The main changes are:
Confidence Score: 4/5
Several correctness issues in result isolation, commit selection, configuration validation, and startup behavior should be addressed before merging.
The changed code is covered by focused tests and the review identified concrete runtime-impacting paths with reproducible behavior, while the remaining surface is relatively contained to the Harbor sidecar and verifier components.
vero/src/vero/harbor/server.py, vero/src/vero/harbor/verifier.py, and vero/src/vero/harbor/serve.py
What T-Rex did
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Harbor eval sidecar: verifier, token aut..." | Re-trigger Greptile