chore: Get a VM running#31
Draft
markovejnovic wants to merge 34 commits into
Draft
Conversation
…oup v2, suidhelper)
Mix spawns subprocesses in their own session with no controlling terminal (erl_child_setup calls setsid), so the nested `sudo install` could never prompt for a password and always failed with "a terminal is required". Split the task: `cargo xtask stamp` builds + stamps unprivileged, then the privileged copy runs via sudo only when it is already non-interactive (`sudo -n` succeeds); otherwise it prints the exact command to run by hand.
The unprivileged node used to pass `--bin <path>` to the setuid helper for
every losetup/dmsetup/blockdev op. Letting the caller name the binary the
helper escalates to run is a needless trust hole, even with SafeBin checks.
Move the paths into the helper-owned config (/etc/hyper/config.toml) with
sane defaults (/usr/sbin/{losetup,dmsetup,blockdev}); the helper validates
each as a SafeBin (absolute, root-owned, non-writable, exact basename) at
dispatch, as the real uid, before acquiring root. The `--bin` argument is
gone from both sides.
Also:
- Add a `dmsetup targets` op so the dm-target readiness probe runs through
the helper (it opens /dev/mapper/control, which needs root) instead of
shelling dmsetup directly as the BEAM user.
- An absent config file now falls back to the built-in defaults (trusted,
compiled into the root-owned binary); a present-but-untrusted file stays
fatal. Drop the Elixir-side *_path config and per-tool presence checks.
- Add :mix to the dialyzer PLT so the Mix tasks resolve Mix.raise/shell.
Document the Docker Postgres setup matching config.exs defaults, the now config-sourced (and optional) device-binary paths, and the sudo/tty caveat for mix suidhelper.install.
Spell out modprobe + dmsetup targets verify, the linux-modules-extra fallback for stripped cloud kernels, and persisting via modules-load.d - the missing_dm_targets boot failure points here.
mkdir under the cgroup-v2 hierarchy, enable cpu/memory in subtree_control (root-level fallback), and persist via systemd-tmpfiles since cgroupfs is memory-backed. The :missing_parent_cgroup boot failure points here.
Budget.Advertiser advertises on init, which calls Hyper.Node.Layer.active/0 and selects on Hyper.Node.Layer.Registry. That registry is owned by Hyper.Node.Layer, which was ordered after Budget.Supervisor - so boot crashed with `unknown registry: Hyper.Node.Layer.Registry`. Order Layer first.
ThinPool traps exits (for terminate/2 device teardown), so the normal close
of each System.cmd port arrived as an unhandled {:EXIT, port, :normal} and
logged as an error. Add a handle_info clause to ignore normal exits.
Tracing defaulted to Honeycombs endpoint but only sent the auth header when
HONEYCOMB_API_KEY was set, so keyless runs 401d on every batch. Export only
when a key or a custom OTEL endpoint is configured; otherwise disable the
exporter.
An unclean shutdown (SIGKILL, so terminate/2 never ran) leaves hyper-thinpool behind; the next boot failed with "device-mapper: create ioctl ... Device or resource busy". Best-effort remove any pre-existing pool before rebuilding, ahead of zero_metadata so a still-live pool is never corrupted.
Hyper.Img.OciLoader.test_system/0 (checks skopeo/umoci/mke2fs) existed but was never called from Hyper.Node.test_system, so a missing skopeo only surfaced at image-load time. Wire it in after Umoci.ensure_installed so the node refuses to boot when an OCI loader tool is absent, like every other host requirement.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Layer.Server, Img.Mutable and Img.Server all trap exits (for terminate/2
device teardown) and shell privileged commands through System.cmd, which links
a transient port to the process. The port's normal close arrived as
{:EXIT, port, :normal} with no matching handle_info clause - Layer.Server
crash-looped on it while mounting a layer, which surfaced to create_vm as
:no_capacity. Add the same ignore-clause already in ThinPool to each.
Test Results305 tests +55 304 ✅ +55 5s ⏱️ ±0s Results for commit f2dc209. ± Comparison against base commit 3603c45. This pull request removes 1 and adds 56 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
DynamicSupervisor.start_child rejected the FireVMM child spec with
{:invalid_child_spec, ...} because the map used a :vm_id key where a supervisor
child spec requires :id - so no VM could ever boot. Use :id.
Add @SPEC child_spec(Opts.t()) :: Supervisor.child_spec(). child_spec/1 is a
plain def (not a typed @callback), so without a spec dialyzer never compared
the returned map against the child-spec contract and the typo passed the gate.
With the spec, the bad key fails dialyzer as invalid_contract.
place/3 reduced every candidate failure into a blanket {:error, :no_capacity},
so a real boot error on the only candidate was indistinguishable from genuine
lack of capacity. Log the actual refusal reason at each candidate.
An unclean shutdown (SIGKILL / :erlang.halt, where terminate/2 never runs) leaves hyper dm devices and loop devices behind. The next boot then crashed in ThinPool.init with "device-mapper: create ioctl on hyper-thinpool: Device or resource busy", and a leaked thin volume held the pool open so the previous remove-the-pool-only reclaim could not clear it. Add read-only `dmsetup ls` and `losetup --list` ops to the suidhelper, and a Hyper.Node.Reclaim pass that runs once before any device GenServer starts: it removes every hyper-prefixed dm device leaf-first (retrying leftovers until a pass clears nothing new, so stacked snapshots and the pool-under-volume case resolve) then detaches loop devices backing files under the data dirs. Replaces ThinPool.init's pool-only reclaim. Best-effort; logs and continues.
Extends the suidhelper config with four new fields needed before the jailer exec path can be wired up: - firecracker / jailer: Option<PathBuf> — no built-in default; accessors return BinError::Unconfigured when absent so callers get a clear "operator must configure this" signal rather than a path-validation error. - parent_cgroup: String — defaults to "hyper", matching Elixir's @parent_cgroup (operators narrowing must keep both in sync). - uid_gid_range: Option<UidGidRange> — total accessor returning (900_000, 999_999) when absent; a present range with min==0 or min>max is fatal at safe_load time, consistent with the "present-but-untrusted is fatal" trust model. Adds BinError (Unconfigured | Bin) so firecracker/jailer accessors have a richer error type than the existing tool accessors (which can never be Unconfigured). Adds validate_uid_gid_range as a public pure function so the refusal contract can be property-tested without touching the file system. New test target config_uid_gid_range exercises: absent → default, valid range round-trips via TOML, min==0 always rejected, min>max always rejected. Extends e2e/config with: Unconfigured on absent keys, basename mismatch rejected, Ok on root-owned correct-name binaries (root-guarded), bad uid_gid_range binary exit 2 (root-guarded).
The BEAM no longer names a privileged binary path. `Jailer.command/1` now launches `hyper-suidhelper jailer` with only the id, uid/gid, cgroup flags, and api-sock; the helper reads firecracker/jailer binary paths, chroot base, parent cgroup, and cgroup version from its trusted /etc/hyper/config.toml, re-acquires root, and execve's the jailer (same pid, so MuonTrap owns the lifetime). Changes: - config.ex: add config_toml/0 (file → persistent_term cache); rewrite work_dir/0 over it; add firecracker_bin/0 + jailer_bin/0 via fetch_bin!/1 (raises when key absent); remove firecracker_install_dir/0. - provider.ex: deleted. - jailer.ex: rewire command/1 and exec_name/0; drop Provider alias. - daemon.ex: note that supervised process is the helper (execve → jailer). - node.ex: replace Provider.ensure_installed() with check_firecracker_bins/0. - hyper.ex: prefix gen_vm_id/0 with "v" so ids never start with "-". - jailer_test.exs: new pure test suite; load-bearing assertion that args contain no --exec-file / --chroot-base-dir / -- flags.
…r/jailer Downloads the pinned Firecracker v1.16.0 release via Redist.Targz.install/3 (download → SHA-256 verify → extract), copies the version-stamped binaries to bare-basename paths (<prefix>/firecracker and <prefix>/jailer) required by the suidhelper's SafeBin validator, and prints the config snippets the operator needs to paste.
…nstall Remove firecracker/jailer from the auto-redistributed list; operators now install them via `mix firecracker.install [--prefix <dir>]`. Update /etc/hyper/config.toml example to show the required firecracker/jailer keys (no default, root-owned + non-world-writable, bare basenames validated by the helper) and the optional [uid_gid_range] table. Update `config :hyper` snippet to drop the auto-download TODOs and point at the install-produced paths. Document that uid_gid_range in config :hyper and [uid_gid_range] in config.toml must be kept in sync.
The Elixir node was fetching TOML keys "firecracker_bin"/"jailer_bin" while
the setuid helper (and the TOML example in docs) uses "firecracker"/"jailer".
A correctly-installed host therefore crashed on launch with an unset-key
error even when the file was present.
- config.ex: fix fetch_bin! keys to "firecracker"/"jailer"; add non-raising
firecracker_bin_configured/0 and jailer_bin_configured/0 returning
{:ok, path} | :error for use by pre-launch checks.
- node.ex: rewrite check_firecracker_bins/0 to use the non-raising accessors
so a missing key returns {:error, :firecracker_not_configured} instead of
raising, honouring the @SPEC contract.
- firecracker.install.ex: drop the dead "config :hyper, firecracker_bin:"
snippet from print_config/1 — nothing reads Application env for these paths.
- docs/cookbook/intro.md: remove firecracker_bin/jailer_bin from config :hyper
block; clarify paths live only in /etc/hyper/config.toml.
- jailer_test.exs: align stub map keys to "firecracker"/"jailer" (the bug
that masked this); add positive --cgroup assertion for :micro type.
- Cargo.toml: remove redundant toml dev-dependency (already a normal dep).
The task runs unprivileged, so the binaries land owned by the invoking user. SafeBin in the suidhelper refuses any jailer/firecracker not owned by root, so every launch fails closed until they are chowned. Print the exact chown/chmod commands (with the real installed paths) instead of a vague 'ensure root-owned' note.
The supervised process is MuonTrap.Daemon, so route the jailed
process's stdout+stderr (guest serial console included) to the Logger
via log_output, and map the exit status into {:firecracker_exited,
status} so a crash names the real exit code instead of the opaque
:error_exit_status. Add explicit launch/launch-failure log lines keyed
by vm id, so a boot loop is visible without reading console scrollback.
The :awaiting_api probe swallowed the describe_instance error and
stopped with a bare :daemon_unready, hiding why a healthy-looking
firecracker is unreachable. Log the last probe error on deadline and
carry it in the stop reason ({:daemon_unready, reason}), so a host->jail
socket permission/path problem is diagnosable instead of looking like an
unexplained 5s restart loop.
… the node user The jailer drops firecracker to a per-VM uid/gid and chroots it, so the API socket it creates at <jail>/root/api.socket is owned by that per-VM id, mode 0755. Connecting a unix socket needs write permission, so the unprivileged node controller gets EACCES on connect(). grant-api confines the socket path under JAIL_BASE (SafePath + O_NOFOLLOW walk, fd-relative on the pinned root dir), verifies the leaf is a real socket via fstatat(AT_SYMLINK_NOFOLLOW) (a planted file/symlink is refused, never touched), then chowns it to the helper's caller (getuid/getgid, the real=caller ids inside the privileged scope) and chmods 0660. A not-yet-created socket is reported Pending, not an error. Adds SafeDir::stat and SafeDir::chmod fd-relative primitives.
… can connect AwaitingApi now hands the jailed API socket to the node user (via the new chroot-jail grant-api op) before each readiness probe. firecracker creates the socket owned by the per-VM uid, so the controller gets EACCES until the helper chowns it; granting first lets the probe (and every later API call) connect. The grant runs once (tracked by State.api_granted); :socket_pending and transient grant errors keep the controller waiting until the existing boot deadline rather than crashing, via a shared deadline-aware keep_probing path.
gen_vm_id used Base.url_encode64, which emits - and _. firecracker rejects _ in an instance id (InvalidInstanceId / "Invalid char (_)"), so any id containing one crash-looped the jailer at boot. Switch to lowercase base32 ([a-z2-7], alphanumeric only) - the intersection of the firecracker, dm/jailer, and registry-key constraints. Strengthen the property from "no leading -" to "strictly alphanumeric".
Starting a process with a {:via, Horde.Registry, _} name makes OTP run
gen:get_proc_name right after start: it calls whereis_name immediately
after the synchronous register. Horde materialises the name into local
ETS only asynchronously (DeltaCRDT diff loop), so under registry churn
the read loses the race and startup aborts with
{:process_not_registered_via, Horde.Registry}. The crash-storm from the
bad vm_id flooded the CRDT and tipped this over, killing FireVMM.State.
Add Routing.register_self/1 and have the supervisor, client, and state
machine register from their own init (started unnamed). Names stay
cluster-resolvable via via/1 once the diff propagates - callers already
tolerate that lag. Core starts unnamed entirely (resolved by nobody).
grant-api chowned the API socket to the caller, but the jailer leaves <id>/root as 0700 owned by the per-VM uid. connect() needs search (+x) on every ancestor, so the unprivileged node still got EACCES traversing into root - the socket's owner was irrelevant. The op now also opens that one directory to the caller's group: owner stays the per-VM uid (firecracker needs it), chgrp to the caller's gid, chmod 0710 (owner rwx, group --x traverse-not-list, other none). Unrelated users stay locked out; only the socket and its parent's group/mode move. Add SafeDir::chmod_self/chgrp_self (fchmod/fchown on the pinned root fd, not by name - TOCTOU-safe). Extend the grant test to assert root is chgrp'd to the caller and chmod'd 0710.
c9ebaaf to
4196153
Compare
The rootfs device handed to chroot-jail staging is /dev/mapper/hyper-rw-<id>, a symlink into the dm farm. SafeFile<IsBlockDevice> opens it O_NOFOLLOW (it must never follow an attacker-supplied symlink), so the dm symlink itself was rejected with "file is not of the required type" and the guest never booted. Resolve the device to its real /dev/dm-N node via canonicalize before the open. Safe because the BlockDev lexical guard already proved the name is a hyper-owned dm device and /dev/mapper is a root-owned 0755 dir an unprivileged caller cannot redirect; loop nodes canonicalize to themselves. The re-opened target is still verified IsBlockDevice.
Both staging and config-apply failures end the boot via a :one_for_all supervisor restart, so without an explicit log the reason vanished into the restart cycle and the VM just appeared to relaunch for no visible cause. Log the reason at each failure path before stopping.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.