Skip to content

refactor: make sandbox readiness gateway-owned across compute drivers #1951

@elezar

Description

@elezar

Description

Make sandbox readiness semantics uniform across compute drivers. Drivers should report backend/runtime state, and the gateway should compose that with supervisor-session state to decide the public SandboxPhase.

Context

The current behavior is inconsistent across drivers:

  • Docker uses an in-process SupervisorReadiness callback to avoid reporting Ready=True until the gateway has a registered supervisor session.
  • VM deliberately lets the gateway promote a sandbox to Ready when the supervisor session connects.
  • Podman can report Ready=True when the container is running, without checking gateway supervisor-session state.
  • Kubernetes forwards Agent Sandbox CRD conditions, so the gateway trusts the controller-reported Ready condition.

The gateway also has generic supervisor-session promotion and demotion. However, a later driver snapshot with Ready=True can promote a sandbox back to public Ready even if no supervisor session is registered. This makes public SandboxPhase::Ready mean either backend-ready or supervisor-connected depending on the driver path.

Proposed Design

Define the driver contract around backend readiness only:

  • Drivers report whether the backend resource exists, is starting, is backend-ready, is deleting, or has hit a terminal failure.
  • The gateway owns public sandbox readiness.
  • Public Ready requires both backend readiness and a registered supervisor session.
  • Backend terminal failure still maps to public Error.
  • Backend deleting still maps to public Deleting.
  • Backend-ready without a supervisor session remains public Provisioning with a clear supervisor-not-connected condition.
  • open_relay should keep its existing wait as race protection for reconnects and short readiness gaps.

Once the gateway composition is uniform, remove Docker-specific access to the gateway supervisor registry instead of spreading that pattern to other drivers.

HA Consideration

Supervisor sessions are currently process-local. The implementation should explicitly decide how readiness composition behaves in multi-gateway deployments. At minimum, driver snapshots from a gateway that does not own the live supervisor session must not incorrectly demote or re-promote public readiness. A more complete solution may require a persisted or leased supervisor-presence record.

Definition of Done

  • Introduce a clear gateway-side composition path for backend readiness plus supervisor-session state.
  • Ensure SandboxPhase::Ready consistently means the sandbox is usable through the gateway.
  • Remove Docker SupervisorReadiness once readiness no longer depends on driver access to gateway-local session state.
  • Update Docker, Podman, Kubernetes, and VM behavior or tests to match the contract.
  • Add compute-layer tests for backend-ready without supervisor, backend-ready with supervisor, backend-not-ready with supervisor, terminal failure precedence, disconnect demotion, reconnect promotion without a fresh driver snapshot, and prevention of re-promotion from a later backend-ready snapshot without a supervisor session.
  • Document the readiness contract in the relevant architecture or driver documentation.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions