[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5
[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5varunursekar wants to merge 1 commit into
Conversation
- Mode B (runner.py): `HarborRunner`, an `EvalStrategy` that — for each candidate — runs a *nested* `harbor run` of the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into vero `SampleResult`s. One Harbor task = one sample; inference is fully delegated, scoring comes from Harbor's verifier. - The compiler (build/): `vero harbor build` renders a `BuildConfig` into a runnable Harbor task directory — a Docker Compose environment (optimizer workbench `main` + the eval sidecar + three volumes), two Dockerfiles, instruction.md, tests/test.sh, and the seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Supports Mode A (local dataset/scorer) and Mode B (a registry or local Harbor benchmark, passed through to the HarborConfig). - `.gitignore`: un-ignore src/vero/harbor/build/ (the repo's `build/` rule was hiding the compiler package). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
||
| sessions = vero_home / "sessions" | ||
| datasets = vero_home / "datasets" | ||
| (sessions / SESSION_ID).mkdir(parents=True, exist_ok=True) | ||
| datasets.mkdir(parents=True, exist_ok=True) | ||
| if not isinstance(dataset, str): # a DatasetDict -> save_to_disk first | ||
| path = tmp / "ds" | ||
| dataset.save_to_disk(str(path)) | ||
| dataset = str(path) | ||
| return resolve_and_save_dataset(dataset, sessions, datasets, SESSION_ID) | ||
|
|
There was a problem hiding this comment.
git archive exit code is silently discarded
archive.wait() is called but its return value is not checked. If git archive fails (e.g., the rel subpath doesn't exist at HEAD, or the ref is invalid), tar receives an empty or truncated stream. Depending on the tar implementation, it may exit 0 with partial content, causing _prepare_baseline_repo to return a commit SHA from a near-empty directory and bake that corrupt baseline into both containers. Additionally, if subprocess.run(tar, ..., check=True) raises before reaching archive.wait(), the git archive subprocess is never reaped until GC collects the Popen object.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/build/compiler.py
Line: 120-130
Comment:
**`git archive` exit code is silently discarded**
`archive.wait()` is called but its return value is not checked. If `git archive` fails (e.g., the `rel` subpath doesn't exist at `HEAD`, or the ref is invalid), `tar` receives an empty or truncated stream. Depending on the tar implementation, it may exit 0 with partial content, causing `_prepare_baseline_repo` to return a commit SHA from a near-empty directory and bake that corrupt baseline into both containers. Additionally, if `subprocess.run(tar, ..., check=True)` raises before reaching `archive.wait()`, the `git archive` subprocess is never reaped until GC collects the `Popen` object.
How can I resolve this? If you propose a fix, please make it concise.| name=config.name, | ||
| description=config.description, | ||
| mode=config.mode, |
There was a problem hiding this comment.
Temp directory created by
mkdtemp() is never cleaned up
The directory created by tempfile.mkdtemp() on line 239 is used only to stage the dataset before registration, but is never deleted — not on success and not on exception. Over repeated builds this accumulates large scratch directories (the dataset can be gigabytes). Wrapping tmp creation with tempfile.TemporaryDirectory() as a context manager would clean it up automatically.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/build/compiler.py
Line: 237-239
Comment:
**Temp directory created by `mkdtemp()` is never cleaned up**
The directory created by `tempfile.mkdtemp()` on line 239 is used only to stage the dataset before registration, but is never deleted — not on success and not on exception. Over repeated builds this accumulates large scratch directories (the dataset can be gigabytes). Wrapping `tmp` creation with `tempfile.TemporaryDirectory()` as a context manager would clean it up automatically.
How can I resolve this? If you propose a fix, please make it concise.
Draft · Stack 3 of 4 — targets
harbor-2-sidecar.HarborRunner, anEvalStrategythat — per candidate — runs a nestedharbor runof the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into veroSampleResults. One Harbor task = one sample; inference is delegated, scoring comes from Harbor.vero harbor buildrenders aBuildConfiginto a runnable Harbor task — a Docker Compose env (optimizer workbench + eval sidecar + 3 volumes), two Dockerfiles,instruction.md,tests/test.sh, seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar'sServeConfig. Mode A and Mode B.Review focus: the nested-run + collation, and the trust-boundary plumbing baked into the generated compose (volumes,
root:600token, secrets → sidecar only). Also un-ignoressrc/vero/harbor/build/(the repo'sbuild/rule was hiding the package).Stack: [1/4] core → [2/4] sidecar → this → [4/4] docs.
🤖 Generated with Claude Code
Greptile Summary
This PR adds Mode B nested-harbor-run evaluation (
HarborRunner) and thevero harbor buildcompiler that renders aBuildConfiginto a self-contained Harbor task directory (Docker Compose workbench + eval sidecar, Dockerfiles, scripts, and bakedServeConfig). The trust-boundary design is sound — secrets are sidecar-only, the admin token isroot:600, and the optimizer runs as a de-privilegedagentuser.runner.py:HarborRunner.produce_sample_resultsshells out to a nestedharbor run, then collates per-trialresult.jsonfiles intoSampleResults; resume logic skips already-persisted samples.compiler.py:compile_taskmaterialises the baseline repo viagit archive, registers the dataset into a bakedVERO_HOME, emits aServeConfig, and Jinja2-renders all environment files in one pass.config.py:BuildConfigcleanly separates Mode A (dataset + scorer baked in) from Mode B (partition + inner Harbor task baked sidecar-only).Confidence Score: 3/5
The trust-boundary plumbing (secrets sidecar-only, admin token root:600, agent de-privileged) is correctly wired. Three bugs need fixing before this should run in production: two in the runner and one in the compiler.
The runner silently ignores n_attempts/max_retries from HarborConfig, so retry behaviour is always the harbor default regardless of what the caller configures — directly affecting budget usage on live runs. _load_trials overwrites earlier results for the same task_name in filesystem-iteration order, meaning a successful retry can be replaced by a failed one, producing incorrect scores. In the compiler, git archive exit code is discarded: a failed archive populates an empty baseline directory whose SHA is recorded in the ServeConfig and baked into the image without any error signal.
Both runner.py (retry flag forwarding and collation deduplication) and compiler.py (git archive error handling and mkdtemp cleanup) need attention before this ships.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant V as vero (HarborRunner) participant H as harbor run (nested) participant S as eval-sidecar (vero harbor serve) participant FS as jobs_dir (filesystem) V->>V: "_task_names_for(params) -> [(sample_id, task_name)]" V->>V: filter pending (skip existing SampleResults) V->>H: uv run --project worktree harbor run -i task0 -i task1 --jobs-dir ... H->>S: POST /eval (agent commit) S-->>H: reward JSON H->>FS: write jobs/ts/trial/result.json H-->>V: exit code (non-zero tolerated) V->>FS: "rglob result.json -> task_name to result_dict" V->>V: _sample_result() per (sample_id, task_name) V->>V: "save_sample_result() -> VERO_HOME/sessions/"%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant V as vero (HarborRunner) participant H as harbor run (nested) participant S as eval-sidecar (vero harbor serve) participant FS as jobs_dir (filesystem) V->>V: "_task_names_for(params) -> [(sample_id, task_name)]" V->>V: filter pending (skip existing SampleResults) V->>H: uv run --project worktree harbor run -i task0 -i task1 --jobs-dir ... H->>S: POST /eval (agent commit) S-->>H: reward JSON H->>FS: write jobs/ts/trial/result.json H-->>V: exit code (non-zero tolerated) V->>FS: "rglob result.json -> task_name to result_dict" V->>V: _sample_result() per (sample_id, task_name) V->>V: "save_sample_result() -> VERO_HOME/sessions/"Comments Outside Diff (2)
vero/src/vero/harbor/runner.py, line 739-752 (link)n_attemptsandmax_retriessilently dropped from harbor runHarborConfigexposesn_attemptsandmax_retriesas typed fields (notextra_argspass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in_build_command, so everyharbor runinvocation uses whatever the harbor defaults are regardless of what the caller configured. A user who setsmax_retries=0to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.Prompt To Fix With AI
vero/src/vero/harbor/runner.py, line 799-813 (link)task_namewhen harbor retries a taskrglob("result.json")visits all result files across all timestamp directories underjobs_dir. If harbor'sn_attemptsormax_retriescauses the same task to produce multiple trial entries with the sametask_name,trials[task_name] = datasimply overwrites with whichever filerglobhappens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Harbor: Mode B (nested harbor run) + the..." | Re-trigger Greptile