Skip to content

chore: Get a VM running#31

Draft
markovejnovic wants to merge 34 commits into
mainfrom
chore/get-a-vm-running
Draft

chore: Get a VM running#31
markovejnovic wants to merge 34 commits into
mainfrom
chore/get-a-vm-running

Conversation

@markovejnovic

Copy link
Copy Markdown
Contributor

No description provided.

Mix spawns subprocesses in their own session with no controlling terminal
(erl_child_setup calls setsid), so the nested `sudo install` could never
prompt for a password and always failed with "a terminal is required".

Split the task: `cargo xtask stamp` builds + stamps unprivileged, then the
privileged copy runs via sudo only when it is already non-interactive
(`sudo -n` succeeds); otherwise it prints the exact command to run by hand.
The unprivileged node used to pass `--bin <path>` to the setuid helper for
every losetup/dmsetup/blockdev op. Letting the caller name the binary the
helper escalates to run is a needless trust hole, even with SafeBin checks.

Move the paths into the helper-owned config (/etc/hyper/config.toml) with
sane defaults (/usr/sbin/{losetup,dmsetup,blockdev}); the helper validates
each as a SafeBin (absolute, root-owned, non-writable, exact basename) at
dispatch, as the real uid, before acquiring root. The `--bin` argument is
gone from both sides.

Also:
- Add a `dmsetup targets` op so the dm-target readiness probe runs through
  the helper (it opens /dev/mapper/control, which needs root) instead of
  shelling dmsetup directly as the BEAM user.
- An absent config file now falls back to the built-in defaults (trusted,
  compiled into the root-owned binary); a present-but-untrusted file stays
  fatal. Drop the Elixir-side *_path config and per-tool presence checks.
- Add :mix to the dialyzer PLT so the Mix tasks resolve Mix.raise/shell.
Document the Docker Postgres setup matching config.exs defaults, the now
config-sourced (and optional) device-binary paths, and the sudo/tty caveat
for mix suidhelper.install.
Spell out modprobe + dmsetup targets verify, the linux-modules-extra
fallback for stripped cloud kernels, and persisting via modules-load.d -
the missing_dm_targets boot failure points here.
mkdir under the cgroup-v2 hierarchy, enable cpu/memory in subtree_control
(root-level fallback), and persist via systemd-tmpfiles since cgroupfs is
memory-backed. The :missing_parent_cgroup boot failure points here.
Budget.Advertiser advertises on init, which calls Hyper.Node.Layer.active/0
and selects on Hyper.Node.Layer.Registry. That registry is owned by
Hyper.Node.Layer, which was ordered after Budget.Supervisor - so boot crashed
with `unknown registry: Hyper.Node.Layer.Registry`. Order Layer first.
ThinPool traps exits (for terminate/2 device teardown), so the normal close
of each System.cmd port arrived as an unhandled {:EXIT, port, :normal} and
logged as an error. Add a handle_info clause to ignore normal exits.

Tracing defaulted to Honeycombs endpoint but only sent the auth header when
HONEYCOMB_API_KEY was set, so keyless runs 401d on every batch. Export only
when a key or a custom OTEL endpoint is configured; otherwise disable the
exporter.
An unclean shutdown (SIGKILL, so terminate/2 never ran) leaves hyper-thinpool
behind; the next boot failed with "device-mapper: create ioctl ... Device or
resource busy". Best-effort remove any pre-existing pool before rebuilding,
ahead of zero_metadata so a still-live pool is never corrupted.
Hyper.Img.OciLoader.test_system/0 (checks skopeo/umoci/mke2fs) existed but was
never called from Hyper.Node.test_system, so a missing skopeo only surfaced at
image-load time. Wire it in after Umoci.ensure_installed so the node refuses to
boot when an OCI loader tool is absent, like every other host requirement.
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Layer.Server, Img.Mutable and Img.Server all trap exits (for terminate/2
device teardown) and shell privileged commands through System.cmd, which links
a transient port to the process. The port's normal close arrived as
{:EXIT, port, :normal} with no matching handle_info clause - Layer.Server
crash-looped on it while mounting a layer, which surfaced to create_vm as
:no_capacity. Add the same ignore-clause already in ThinPool to each.
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Test Results

305 tests  +55   304 ✅ +55   5s ⏱️ ±0s
 55 suites + 6     1 💤 ± 0 
  2 files   ± 0     0 ❌ ± 0 

Results for commit f2dc209. ± Comparison against base commit 3603c45.

This pull request removes 1 and adds 56 tests. Note that renamed tests count towards both.
hyper-suidhelper::e2e_config ‑ missing_config_exits_2
Elixir.Hyper.Node.FireVMM.JailerTest ‑ property gen_vm_id/0 produces only alphanumeric ids
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args contain --id, --uid, --gid with the opts values
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args do not contain privileged flags owned by the suidhelper
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args end with --api-sock /api.socket
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args include --cgroup cpu.max and memory.max for :micro type
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args start with the jailer subcommand
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test binary is the suid helper
Elixir.Hyper.SuidHelper.DmsetupPropertiesTest ‑ property parse_names recovers the device name from each `dmsetup ls` row
Elixir.Hyper.SuidHelper.DmsetupPropertiesTest ‑ test parse_names reads the empty-table sentinel as no devices
Elixir.Hyper.SuidHelper.LosetupTest ‑ test an empty listing yields no pairs
…

♻️ This comment has been updated with latest results.

DynamicSupervisor.start_child rejected the FireVMM child spec with
{:invalid_child_spec, ...} because the map used a :vm_id key where a supervisor
child spec requires :id - so no VM could ever boot. Use :id.

Add @SPEC child_spec(Opts.t()) :: Supervisor.child_spec(). child_spec/1 is a
plain def (not a typed @callback), so without a spec dialyzer never compared
the returned map against the child-spec contract and the typo passed the gate.
With the spec, the bad key fails dialyzer as invalid_contract.
place/3 reduced every candidate failure into a blanket {:error, :no_capacity},
so a real boot error on the only candidate was indistinguishable from genuine
lack of capacity. Log the actual refusal reason at each candidate.
An unclean shutdown (SIGKILL / :erlang.halt, where terminate/2 never runs)
leaves hyper dm devices and loop devices behind. The next boot then crashed in
ThinPool.init with "device-mapper: create ioctl on hyper-thinpool: Device or
resource busy", and a leaked thin volume held the pool open so the previous
remove-the-pool-only reclaim could not clear it.

Add read-only `dmsetup ls` and `losetup --list` ops to the suidhelper, and a
Hyper.Node.Reclaim pass that runs once before any device GenServer starts: it
removes every hyper-prefixed dm device leaf-first (retrying leftovers until a
pass clears nothing new, so stacked snapshots and the pool-under-volume case
resolve) then detaches loop devices backing files under the data dirs. Replaces
ThinPool.init's pool-only reclaim. Best-effort; logs and continues.
Extends the suidhelper config with four new fields needed before the
jailer exec path can be wired up:

- firecracker / jailer: Option<PathBuf> — no built-in default; accessors
  return BinError::Unconfigured when absent so callers get a clear
  "operator must configure this" signal rather than a path-validation error.
- parent_cgroup: String — defaults to "hyper", matching Elixir's
  @parent_cgroup (operators narrowing must keep both in sync).
- uid_gid_range: Option<UidGidRange> — total accessor returning
  (900_000, 999_999) when absent; a present range with min==0 or min>max
  is fatal at safe_load time, consistent with the "present-but-untrusted
  is fatal" trust model.

Adds BinError (Unconfigured | Bin) so firecracker/jailer accessors have a
richer error type than the existing tool accessors (which can never be
Unconfigured). Adds validate_uid_gid_range as a public pure function so
the refusal contract can be property-tested without touching the file system.

New test target config_uid_gid_range exercises: absent → default,
valid range round-trips via TOML, min==0 always rejected, min>max always
rejected. Extends e2e/config with: Unconfigured on absent keys, basename
mismatch rejected, Ok on root-owned correct-name binaries (root-guarded),
bad uid_gid_range binary exit 2 (root-guarded).
The BEAM no longer names a privileged binary path.  `Jailer.command/1`
now launches `hyper-suidhelper jailer` with only the id, uid/gid, cgroup
flags, and api-sock; the helper reads firecracker/jailer binary paths,
chroot base, parent cgroup, and cgroup version from its trusted
/etc/hyper/config.toml, re-acquires root, and execve's the jailer (same
pid, so MuonTrap owns the lifetime).

Changes:
- config.ex: add config_toml/0 (file → persistent_term cache); rewrite
  work_dir/0 over it; add firecracker_bin/0 + jailer_bin/0 via
  fetch_bin!/1 (raises when key absent); remove firecracker_install_dir/0.
- provider.ex: deleted.
- jailer.ex: rewire command/1 and exec_name/0; drop Provider alias.
- daemon.ex: note that supervised process is the helper (execve → jailer).
- node.ex: replace Provider.ensure_installed() with check_firecracker_bins/0.
- hyper.ex: prefix gen_vm_id/0 with "v" so ids never start with "-".
- jailer_test.exs: new pure test suite; load-bearing assertion that args
  contain no --exec-file / --chroot-base-dir / -- flags.
…r/jailer

Downloads the pinned Firecracker v1.16.0 release via Redist.Targz.install/3
(download → SHA-256 verify → extract), copies the version-stamped binaries to
bare-basename paths (<prefix>/firecracker and <prefix>/jailer) required by the
suidhelper's SafeBin validator, and prints the config snippets the operator
needs to paste.
…nstall

Remove firecracker/jailer from the auto-redistributed list; operators now
install them via `mix firecracker.install [--prefix <dir>]`. Update
/etc/hyper/config.toml example to show the required firecracker/jailer keys
(no default, root-owned + non-world-writable, bare basenames validated by the
helper) and the optional [uid_gid_range] table. Update `config :hyper` snippet
to drop the auto-download TODOs and point at the install-produced paths.
Document that uid_gid_range in config :hyper and [uid_gid_range] in
config.toml must be kept in sync.
The Elixir node was fetching TOML keys "firecracker_bin"/"jailer_bin" while
the setuid helper (and the TOML example in docs) uses "firecracker"/"jailer".
A correctly-installed host therefore crashed on launch with an unset-key
error even when the file was present.

- config.ex: fix fetch_bin! keys to "firecracker"/"jailer"; add non-raising
  firecracker_bin_configured/0 and jailer_bin_configured/0 returning
  {:ok, path} | :error for use by pre-launch checks.
- node.ex: rewrite check_firecracker_bins/0 to use the non-raising accessors
  so a missing key returns {:error, :firecracker_not_configured} instead of
  raising, honouring the @SPEC contract.
- firecracker.install.ex: drop the dead "config :hyper, firecracker_bin:"
  snippet from print_config/1 — nothing reads Application env for these paths.
- docs/cookbook/intro.md: remove firecracker_bin/jailer_bin from config :hyper
  block; clarify paths live only in /etc/hyper/config.toml.
- jailer_test.exs: align stub map keys to "firecracker"/"jailer" (the bug
  that masked this); add positive --cgroup assertion for :micro type.
- Cargo.toml: remove redundant toml dev-dependency (already a normal dep).
The task runs unprivileged, so the binaries land owned by the invoking
user. SafeBin in the suidhelper refuses any jailer/firecracker not owned
by root, so every launch fails closed until they are chowned. Print the
exact chown/chmod commands (with the real installed paths) instead of a
vague 'ensure root-owned' note.
The supervised process is MuonTrap.Daemon, so route the jailed
process's stdout+stderr (guest serial console included) to the Logger
via log_output, and map the exit status into {:firecracker_exited,
status} so a crash names the real exit code instead of the opaque
:error_exit_status. Add explicit launch/launch-failure log lines keyed
by vm id, so a boot loop is visible without reading console scrollback.
The :awaiting_api probe swallowed the describe_instance error and
stopped with a bare :daemon_unready, hiding why a healthy-looking
firecracker is unreachable. Log the last probe error on deadline and
carry it in the stop reason ({:daemon_unready, reason}), so a host->jail
socket permission/path problem is diagnosable instead of looking like an
unexplained 5s restart loop.
… the node user

The jailer drops firecracker to a per-VM uid/gid and chroots it, so the
API socket it creates at <jail>/root/api.socket is owned by that per-VM
id, mode 0755. Connecting a unix socket needs write permission, so the
unprivileged node controller gets EACCES on connect().

grant-api confines the socket path under JAIL_BASE (SafePath + O_NOFOLLOW
walk, fd-relative on the pinned root dir), verifies the leaf is a real
socket via fstatat(AT_SYMLINK_NOFOLLOW) (a planted file/symlink is
refused, never touched), then chowns it to the helper's caller
(getuid/getgid, the real=caller ids inside the privileged scope) and
chmods 0660. A not-yet-created socket is reported Pending, not an error.

Adds SafeDir::stat and SafeDir::chmod fd-relative primitives.
… can connect

AwaitingApi now hands the jailed API socket to the node user (via the new
chroot-jail grant-api op) before each readiness probe. firecracker
creates the socket owned by the per-VM uid, so the controller gets EACCES
until the helper chowns it; granting first lets the probe (and every
later API call) connect.

The grant runs once (tracked by State.api_granted); :socket_pending and
transient grant errors keep the controller waiting until the existing
boot deadline rather than crashing, via a shared deadline-aware
keep_probing path.
gen_vm_id used Base.url_encode64, which emits - and _. firecracker
rejects _ in an instance id (InvalidInstanceId / "Invalid char (_)"),
so any id containing one crash-looped the jailer at boot. Switch to
lowercase base32 ([a-z2-7], alphanumeric only) - the intersection of the
firecracker, dm/jailer, and registry-key constraints. Strengthen the
property from "no leading -" to "strictly alphanumeric".
Starting a process with a {:via, Horde.Registry, _} name makes OTP run
gen:get_proc_name right after start: it calls whereis_name immediately
after the synchronous register. Horde materialises the name into local
ETS only asynchronously (DeltaCRDT diff loop), so under registry churn
the read loses the race and startup aborts with
{:process_not_registered_via, Horde.Registry}. The crash-storm from the
bad vm_id flooded the CRDT and tipped this over, killing FireVMM.State.

Add Routing.register_self/1 and have the supervisor, client, and state
machine register from their own init (started unnamed). Names stay
cluster-resolvable via via/1 once the diff propagates - callers already
tolerate that lag. Core starts unnamed entirely (resolved by nobody).
grant-api chowned the API socket to the caller, but the jailer leaves
<id>/root as 0700 owned by the per-VM uid. connect() needs search (+x)
on every ancestor, so the unprivileged node still got EACCES traversing
into root - the socket's owner was irrelevant. The op now also opens that
one directory to the caller's group: owner stays the per-VM uid
(firecracker needs it), chgrp to the caller's gid, chmod 0710 (owner rwx,
group --x traverse-not-list, other none). Unrelated users stay locked
out; only the socket and its parent's group/mode move.

Add SafeDir::chmod_self/chgrp_self (fchmod/fchown on the pinned root fd,
not by name - TOCTOU-safe). Extend the grant test to assert root is
chgrp'd to the caller and chmod'd 0710.
@markovejnovic markovejnovic force-pushed the chore/get-a-vm-running branch from c9ebaaf to 4196153 Compare June 26, 2026 02:45
The rootfs device handed to chroot-jail staging is /dev/mapper/hyper-rw-<id>,
a symlink into the dm farm. SafeFile<IsBlockDevice> opens it O_NOFOLLOW (it must
never follow an attacker-supplied symlink), so the dm symlink itself was rejected
with "file is not of the required type" and the guest never booted.

Resolve the device to its real /dev/dm-N node via canonicalize before the open.
Safe because the BlockDev lexical guard already proved the name is a hyper-owned
dm device and /dev/mapper is a root-owned 0755 dir an unprivileged caller cannot
redirect; loop nodes canonicalize to themselves. The re-opened target is still
verified IsBlockDevice.
Both staging and config-apply failures end the boot via a :one_for_all
supervisor restart, so without an explicit log the reason vanished into the
restart cycle and the VM just appeared to relaunch for no visible cause. Log
the reason at each failure path before stopping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant