[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor by varunursekar · Pull Request #3 · scaleapi/vero

varunursekar · 2026-06-24T18:12:06Z

Draft · Stack 1 of 4 — review/merge in order (1→4). No Harbor concepts here; foundations the rest builds on (and useful on their own).

3-tier split visibility: SplitAccessLevel gains no_access (hidden) alongside visible / non_viewable (aggregate-only).
BudgetLedger (core/budget.py): metered, optionally-durable per-split eval budget.
Staged eval: TaskOutput is a model; inference and scoring run as separate resumable disk-backed stages, with label-field scrubbing before inference.
Evaluation refactor: consolidate under vero/evaluation/ as a shared EvaluationEngine + Evaluator, port ExperimentRunnerTool onto it, and add the injectable EvalStrategy seam (Mode B's extension point).

Review focus: the visibility enum, BudgetLedger, the EvaluationEngine API, and the strategy seam. Much of evaluation/ is relocated code (skim).

Part of the Harbor integration (full squashed view: #2). Stack: this → [2/4] sidecar → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR lays the foundational plumbing for the Harbor integration: a three-tier SplitAccessLevel enum (viewable / non_viewable / no_access), a metered BudgetLedger with optional durable JSON persistence, and a refactored evaluation stack that splits inference and scoring into two resumable disk-backed stages with label-field scrubbing.

BudgetLedger (core/budget.py): in-memory by default; persist_path enables a crash-safe durable variant for the Harbor sidecar, with reserve() providing an atomic check+decrement under an asyncio.Lock. One issue: _flush() does synchronous file I/O while the lock is held, which blocks the event loop in the durable path.
Staged evaluation (core/task/task.py): run_inference_stage and run_scoring_stage each persist per-sample results incrementally and skip already-completed samples on resume; label_fields are scrubbed from inference inputs but passed intact to scoring.
EvaluationEngine + EvalStrategy seam (evaluation/): shared engine consolidates budget metering and sample resolution; a Protocol-based EvalStrategy seam allows Harbor Mode B to inject a HarborRunner without coupling vero.evaluation to harbor-specific code.

Confidence Score: 3/5

Safe to merge for the in-process ExperimentRunnerTool path; the durable BudgetLedger path (Harbor sidecar) has a blocking I/O call under an asyncio lock that will degrade the event loop under concurrency.

The evaluation refactor, visibility enum, and EvalStrategy seam are clean and well-tested. The concern is in BudgetLedger.reserve(): it holds an asyncio.Lock and then calls _flush(), which does synchronous Path.write_text() on the main thread. For the in-memory path (persist_path=None) _flush() is a no-op and this is harmless. But the durable path — the variant the Harbor sidecar is documented to use — will block the event loop on every metered eval request, causing latency spikes and throttling throughput under concurrent load.

vero/src/vero/core/budget.py — the _flush() / reserve() interaction; vero/src/vero/core/task/task.py — double _load_and_prepare_data() call across the two stages.

Important Files Changed

Filename	Overview
vero/src/vero/core/budget.py	New BudgetLedger with durable JSON persistence; _flush() does blocking I/O inside the asyncio lock in reserve(), which will stall the event loop in the Harbor sidecar path.
vero/src/vero/core/task/task.py	Refactored into two resumable stages (inference + scoring) with per-sample disk persistence and label-field scrubbing; both stages redundantly call _load_and_prepare_data().
vero/src/vero/evaluation/engine.py	New EvaluationEngine shared core; clean budget metering with admin bypass; no_access gating via ledger allowlist is well-designed.
vero/src/vero/evaluation/evaluator.py	Evaluator moved from vero/evaluator.py; gains eval_strategy injection seam for Mode B; logic is unchanged and correct.
vero/src/vero/evaluation/strategy.py	Clean runtime-checkable Protocol defining the EvalStrategy seam; keeps vero.evaluation harbor-agnostic.
vero/src/vero/core/dataset/base.py	Adds no_access tier to SplitAccessLevel and updates get_non_viewable_splits to include it; default test split promoted from non_viewable to no_access.
vero/src/vero/core/db/result.py	TaskOutput converted from dataclass to Pydantic BaseModel with Exception-coercion validator; adds is_scored() to SampleResult; clean and correct.
vero/src/vero/evaluator.py	Reduced to a back-compat re-export shim; unnecessarily exports private _resolve_vero_dependency.
vero/src/vero/tools/experiment_runner.py	ExperimentRunnerTool ported onto EvaluationEngine; SplitBudget moved to vero.core.budget and re-exported for back-compat; delegation is clean.
vero/src/vero/policy.py	Adds harbor field and lazy HarborRunner strategy injection; import path updates are correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[VeroTask.run] --> B[run_inference_stage]
    B --> B1{sample persisted?}
    B1 -- yes --> B2[skip]
    B1 -- no --> B3[_scrub_inputs]
    B3 --> B4[inference fn]
    B4 --> B5[_save_inference to disk]

    A --> C[run_scoring_stage]
    C --> C1{error or already scored?}
    C1 -- yes --> C2[skip]
    C1 -- no --> C3[load TaskOutput from disk]
    C3 --> C4[scoring fn with full row]
    C4 --> C5[_save_score to disk]

    A --> D[compute_metrics from disk]

    E[EvaluationEngine.evaluate] --> F{admin?}
    F -- no --> G[BudgetLedger.reserve]
    G --> G1[asyncio.Lock]
    G1 --> G2[check]
    G2 --> G3[record + _flush]
    G3 -. blocking I/O .-> G4[blocks event loop]
    F -- yes --> H[bypass budget]
    G --> I[Evaluator.evaluate]
    H --> I
    I --> J{eval_strategy set?}
    J -- Mode B --> K[EvalStrategy.produce_sample_results]
    J -- Mode A --> L[_run_task_in_subprocess]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[VeroTask.run] --> B[run_inference_stage]
    B --> B1{sample persisted?}
    B1 -- yes --> B2[skip]
    B1 -- no --> B3[_scrub_inputs]
    B3 --> B4[inference fn]
    B4 --> B5[_save_inference to disk]

    A --> C[run_scoring_stage]
    C --> C1{error or already scored?}
    C1 -- yes --> C2[skip]
    C1 -- no --> C3[load TaskOutput from disk]
    C3 --> C4[scoring fn with full row]
    C4 --> C5[_save_score to disk]

    A --> D[compute_metrics from disk]

    E[EvaluationEngine.evaluate] --> F{admin?}
    F -- no --> G[BudgetLedger.reserve]
    G --> G1[asyncio.Lock]
    G1 --> G2[check]
    G2 --> G3[record + _flush]
    G3 -. blocking I/O .-> G4[blocks event loop]
    F -- yes --> H[bypass budget]
    G --> I[Evaluator.evaluate]
    H --> I
    I --> J{eval_strategy set?}
    J -- Mode B --> K[EvalStrategy.produce_sample_results]
    J -- Mode A --> L[_run_task_in_subprocess]

Comments Outside Diff (1)

vero/src/vero/core/budget.py, line 175-193 (link)

Blocking I/O inside asyncio.Lock in reserve()

_flush() calls Path.write_text() (synchronous) while reserve() holds self._lock — an asyncio.Lock. Any concurrent coroutine that awaits reserve() will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes in asyncio.to_thread() (or use aiofiles) and await them inside reserve(), keeping the lock protecting only the in-memory check+decrement step.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/core/budget.py
Line: 175-193

Comment:
**Blocking I/O inside `asyncio.Lock` in `reserve()`**

`_flush()` calls `Path.write_text()` (synchronous) while `reserve()` holds `self._lock` — an `asyncio.Lock`. Any concurrent coroutine that awaits `reserve()` will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes in `asyncio.to_thread()` (or use `aiofiles`) and `await` them inside `reserve()`, keeping the lock protecting only the in-memory check+decrement step.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
vero/src/vero/core/budget.py:175-193
**Blocking I/O inside `asyncio.Lock` in `reserve()`**

`_flush()` calls `Path.write_text()` (synchronous) while `reserve()` holds `self._lock` — an `asyncio.Lock`. Any concurrent coroutine that awaits `reserve()` will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes in `asyncio.to_thread()` (or use `aiofiles`) and `await` them inside `reserve()`, keeping the lock protecting only the in-memory check+decrement step.

### Issue 2 of 3
vero/src/vero/core/task/task.py:745-748
**Dataset loaded twice per evaluation**

`run_scoring_stage` calls `_load_and_prepare_data(params)` even though `run_inference_stage` already called it earlier in `run()`. On large datasets loaded from disk, this doubles the I/O and deserialization cost. Consider passing the already-loaded `(tasks, task_data)` tuple into both stage methods, or loading once at the `run()` level and threading it through.

### Issue 3 of 3
vero/src/vero/evaluator.py:8-13
The back-compat shim re-exports `_resolve_vero_dependency`, a private implementation detail that should not be part of any public surface. Callers importing via `vero.evaluator` have no reason to use it, and exporting it risks unintended coupling. Remove it from the shim — it's already accessible via `vero.evaluation.evaluator._resolve_vero_dependency` for internal use.

```suggestion
from vero.evaluation.evaluator import (  # noqa: F401
    Evaluator,
    isolate_project,
    run_evaluation,
)
```

_{Reviews (1): Last reviewed commit: "vero: 3-tier split visibility, budget le..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

…ne refactor Foundational changes that the Harbor integration builds on, useful on their own: - 3-tier split visibility: `SplitAccessLevel` gains `no_access` (hidden) alongside `visible` and `non_viewable` (aggregate-only). - `BudgetLedger` (core/budget.py): a metered, optionally durable per-split eval budget extracted from the experiment runner. - `TaskOutput` is now a model; the task pipeline runs inference and scoring as separate, resumable, disk-backed stages, with label-field scrubbing before inference. - Evaluation refactor: consolidate the eval path under `vero/evaluation/` as a shared `EvaluationEngine` + `Evaluator`, port `ExperimentRunnerTool` onto it, and add an injectable `EvalStrategy` seam (the extension point Harbor Mode B uses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-24T18:32:38Z

+        tasks, _ = self._load_and_prepare_data(params)
+        sample_ids = params.run.dataset_subset.sample_ids
+        if sample_ids is None:
+            sample_ids = list(range(len(tasks)))


Dataset loaded twice per evaluation

run_scoring_stage calls _load_and_prepare_data(params) even though run_inference_stage already called it earlier in run(). On large datasets loaded from disk, this doubles the I/O and deserialization cost. Consider passing the already-loaded (tasks, task_data) tuple into both stage methods, or loading once at the run() level and threading it through.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/core/task/task.py Line: 745-748 Comment: **Dataset loaded twice per evaluation** `run_scoring_stage` calls `_load_and_prepare_data(params)` even though `run_inference_stage` already called it earlier in `run()`. On large datasets loaded from disk, this doubles the I/O and deserialization cost. Consider passing the already-loaded `(tasks, task_data)` tuple into both stage methods, or loading once at the `run()` level and threading it through. How can I resolve this? If you propose a fix, please make it concise.

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:17

varunursekar marked this pull request as ready for review June 24, 2026 18:21

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3

[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3
varunursekar wants to merge 1 commit into
mainfrom
harbor-1-core

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading