Skip to content

RFC: build system — large images, persistent cache, and the build API surface#288

Draft
rgarcia wants to merge 1 commit into
mainfrom
raf/rfc-builds-large-images-and-cache
Draft

RFC: build system — large images, persistent cache, and the build API surface#288
rgarcia wants to merge 1 commit into
mainfrom
raf/rfc-builds-large-images-and-cache

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Design doc only — no code changes. Written after an empirical attempt to build the kernel-images chromium-headful production browser image with hypeman build on a dev host (main @ 1b153f8).

Findings (all measured, timeline in the doc's appendix):

  • BuildKit's entire data root lives on a hardcoded 3GB RAM-backed tmpfs in the builder VM, making buildable image size ≈ builder RAM, with no knob — and the memory caps that would compensate are stacked at 16GB (MaxBuildMemoryMB) and 32GB (max_memory_per_instance). With those temporarily lifted, the chromium-headful image builds in 2m42s and boots with all artifacts verified.
  • Build caching is opt-in and ephemeral: an identical rebuild produced 0 CACHED steps; --mount=type=cache contents never survive the VM.
  • is_admin_build is an ungated boolean — any build:write token can push to the global cache (fine while the control plane is the only token-holder; must be fixed before that changes).
  • Four bugs: SSE build stream dies on long builds (CLI reports failure for builds that succeed), build cancel 404s on building builds, image_name re-tag triggers a second full rootfs conversion of the same digest, and a cloud-hypervisor RSDP boot panic at --memory 4096.

Proposals (ordered, separable):

  • P1: disk-backed BuildKit root on a dedicated ext4 volume (removes the size ceiling at stock memory limits)
  • P2/P2a/P2b: persistent per-scope cache volumes; thin tenancy via a scope claim in the JWT (no user modeling needed); whole-build dedup by host-computed input hash so a bit-identical second build (templates, same public repo) completes in seconds with no VM
  • P3: cache on by default; --no-cache/--pull
  • P4: API fixes — build_args is dead wiring (one multipart field short of working), target, network mode exposure, config-derived caps
  • P5: the four bugs above

Review notes

  • The tenancy section (P2a) is the part most worth adversarial review: the claim is that a safe multi-tenant cache needs exactly one unforgeable scope claim plus quotas, not tenant/user modeling.
  • The whole-build dedup safety argument (P2b) is the other: identical inputs ⇒ the first builder had no influence the victim didn't choose themselves. Exclusions (secrets, TTL, digest-resolved bases) are listed.

🤖 Generated with Claude Code

Written after an empirical attempt to build the kernel-images
chromium-headful production browser image with `hypeman build`. Documents
the 3GB tmpfs BuildKit-root ceiling and the stacked memory caps behind it,
the opt-in/ephemeral state of build caching (verified: 0 CACHED steps on
an identical rebuild), gaps vs docker build/buildx, and four bugs found
along the way (SSE stream death, cancel 404, re-tag double conversion,
a CH RSDP boot panic).

Proposes: disk-backed BuildKit root on a dedicated volume, persistent
per-scope cache volumes, whole-build dedup by host-computed input hash,
scope-in-token thin tenancy (incl. gating is_admin_build, which is
currently an ungated boolean), and an ordered list of API fixes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant