docs(native): host-provisioning contract + provisioning script#94
Open
ephpm-claude[bot] wants to merge 10 commits into
Open
docs(native): host-provisioning contract + provisioning script#94ephpm-claude[bot] wants to merge 10 commits into
ephpm-claude[bot] wants to merge 10 commits into
Conversation
…in, static concurrency
…filesystem write isolation
Run GHA jobs directly on the macOS host instead of per-job VMs, enabling 4+ concurrent jobs (vs Apple's 2-VM cap) with zero boot overhead. Configured per-repo under [runner.macos] with "org/repo" keys, "org/*" wildcards, and a separate nativeMacSem concurrency gate. The VM path is untouched. Jobs never run as root: a hidden _ephemerd service user is created lazily (per-job ephemeral users were abandoned — macOS user deletion requires Full Disk Access and wedges opendirectoryd). Each job gets its own HOME/TMPDIR/work dir, keychain, Homebrew prefix, and a sandbox-exec profile denying localhost outbound and port binding. Also fixes uncovered along the way: - runner extraction is OS-suffixed (runners/<ver>-<goos>) so the macOS host and Linux VM no longer corrupt each other's runner on the shared data dir (Linux dispatch exit 127) - isOfficialRunnerImage prefixes had a trailing dash that never matched the runner-ci-linux tag, breaking custom-image dispatch - DEVELOPER_DIR resolved via xcode-select -p instead of hardcoded Xcode.app path (broke git on CLT-only hosts) - macOS VM runner monitor logs pgrep results at debug level Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Security follow-ups from review of the native runner. Native jobs run directly on the host with no VM boundary, so the sandbox profile and unix permissions are the entire isolation story — two concrete holes closed here, plus one documented as needing live-macOS work. 1. Sibling-job + daemon-state isolation. Every native job runs as the same _ephemerd uid and all workspaces live under <dataDir>/native/, so a job could read a concurrent job's checkout token or source. The profile now denies read AND write of the whole <dataDir>/native subtree and re-allows only the job's own dir (sandbox-exec applies the last matching rule). config.toml, ephemerd.sock, and the vm dir gain write denies to match their existing read denies. 2. .ssh write hole. .ssh was read-denied but writable, leaving an authorized_keys append vector on any host where the runner uid can reach the target home. Now denied for write too. 3. Dedicated primary group instead of staff (gid 20). staff is the default group for every normal macOS account, so the runner process inherited group access to the many staff-group-owned files on a typical Mac. The service user now gets a dedicated _ephemerd group. Provisioning is best-effort: any failure falls back to staff (the previously-tested behavior), so a group hiccup never blocks jobs. Not done here (documented in a code comment as a follow-up): flipping the profile from allow-by-default to deny-by-default. That is the stronger posture for native execution but requires enumerating every path the GHA runner + toolchains touch and live-testing on macOS so jobs don't break — can't be verified blind from a non-macOS host. The LAN-egress gap (sandbox-exec has no CIDR support; pf rules still a follow-up) is unchanged and remains the reason native mode should stay restricted to trusted first-party repos.
The hardened sandbox blocked the GHA runner from starting. Three distinct macOS sandbox-exec behaviors, each found via local repro: 1. deny file-read* on the native subtree blocked file-read-metadata, which realpath() needs to traverse through native/ to the job dir. The .NET host died with "Failed to resolve full path of the current executable" (exit 133). Fixed: deny only file-read-data. 2. getcwd() and bash walk UP from the job's runner dir and must readdir(native/) to learn the job-id component name; the read-data deny on the native subtree blocked that, giving "getcwd: cannot access parent directories" and "run.sh: Operation not permitted" (exit 126). Fixed: allow file-read-data on the native dir node (literal) — leaks only the non-secret list of concurrent job ids. 3. macOS sandbox resolves a specific-operation deny (file-read-data) over a later wildcard allow (file-read*), so the per-job re-allow must name file-read-data explicitly to win. Added an explicit file-read-data re-allow on the job subtree alongside file-read*. Job-to-job isolation is preserved: a sibling job's directory listing and file contents stay denied (verified). Smoke-test jobs now run end-to-end as _ephemerd with all steps green. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub deprecated runner v2.333.1: its broker now returns
403 Forbidden ("Runner version v2.333.1 is deprecated and cannot
receive messages") to that version. Because ephemerd embeds and pins
the runner and runs it with disableUpdate=true, every job on every
platform (macOS native, Linux/Windows VM dispatch) connected, got the
403, and exited cleanly in ~6s with the job left queued — no jobs
could be processed.
Bump to 2.335.1 (latest, released 2026-06-09). Verified live: ephpm
macos-aarch64 jobs go queued -> in_progress with a runner assigned and
the backlog drains; runners stay alive running real job steps instead
of the 6s deprecation exit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sandbox profile denied all socket binds:
(deny network-bind (local ip "*:*"))
This made every CI test that opens a listening socket fail with EPERM
("Operation not permitted") on bind — e.g. ephpm's `cargo nextest`
macos-arm64 suite died on the first socket test and cancelled the
remaining ~609. Reproduced directly: bind 127.0.0.1:0 fails under the
profile, succeeds without it.
The loopback denies (network-bind + localhost network-outbound) also
provided no real protection: sandbox-exec cannot express CIDR rules, so
the LAN/RFC1918 egress blocking the design intended was never actually
enforced here (still a pf-firewall follow-up). They only broke tests.
Job-to-job data isolation is provided by the filesystem rules (a sibling
job's dir is unreadable), which are unchanged.
Replace the loopback denies with (allow network-bind) + (allow
network-outbound). Verified: a bind-and-connect-to-self roundtrip
succeeds while sibling-job filesystem reads stay denied. Added a
regression guard so the bind-deny can't silently return.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…script Native mode runs jobs on the bare host with no per-job VM or container image, so every build dependency a workflow assumes present must exist on the host. In production this surfaced as ephpm macos-aarch64 `Build` jobs failing with "Unable to find libclang" — bindgen's LIBCLANG_PATH=$(brew --prefix llvm@17)/lib pointed at a formula the native host didn't have (it was baked into the old macOS VM base image). Investigation finding: the runner-ci-macos-deps OCI image only carries ephemerd's Go tooling as a VM overlay — the language/build toolchains lived in the Tart-provisioned VM base disk, for which there is no portable manifest. So there is no image to extract onto a native host; the dependency surface is the host itself. - scripts/provision-native-macos.sh: idempotent brew-based provisioner (seeded with llvm@17; --check mode) to keep native runner hosts in sync; extend the formula list as workflows need more. - docs/arch/native-macos-runner.md: new "Host provisioning contract" section explaining the bare-host dependency model and the two ways to satisfy it (provision the host vs. workflows install their own deps). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Documents the host-provisioning contract for native macOS mode and adds an idempotent provisioner. Prompted by ephpm
macos-aarch64Build jobs failing withUnable to find libclangafter the switch to native runners.Root cause
Native mode runs jobs on the bare host — no per-job VM, no container image — so any build dep a workflow assumes present must exist on the host. bindgen wanted
LIBCLANG_PATH=$(brew --prefix llvm@17)/lib; the native host hadllvm/llvm@22but notllvm@17(it was baked into the old macOS VM base disk image).Investigation also settled the "can't we just unzip the image?" question: the
runner-ci-macos-depsOCI image carries only ephemerd's Go tooling as a VM overlay — the language/build toolchains lived in the Tart-provisioned VM base disk, which has no portable manifest. There is no image to extract; the dependency surface is the host.Changes
scripts/provision-native-macos.sh— idempotentbrewprovisioner (seeded withllvm@17,--checkmode) to keep native hosts in sync; extend the formula list as workflows need more.docs/arch/native-macos-runner.md— "Host provisioning contract" section: the bare-host dependency model and the two ways to satisfy it (provision the host, or have workflowsbrew installtheir own deps — the reproducible option the per-job Homebrew overlay was designed for).Immediate unblock (already applied on the runner host)
brew install llvm@17— verified: libclangdlopens under the job sandbox andclang 17.0.6runs. Next macos Build job will find libclang.🤖 Generated with Claude Code