Skip to content

feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79

Draft
cevian wants to merge 11 commits into
mainfrom
prod-multiplayer-migration
Draft

feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79
cevian wants to merge 11 commits into
mainfrom
prod-multiplayer-migration

Conversation

@cevian

@cevian cevian commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

One-time migration of production from the old org/engine/role + RLS model
(deployed at server/v0.2.5) to the new auth/core/space model. Rebased on
current main and re-validated
(the better-auth + memory-names work landed on
main since the first draft — see "Adapted to main" below).

Topology

Two source databases — DB_ACCOUNTS (identity) and DB_SHARD (memories, one
me_<slug> per engine) — ETL'd into a new database (auth + core +
per-space me_<slug>). Sources are read-only; rollback = repoint the app at them.

Approach

Fresh target DB (no collisions; slugs reused). Provision auth/core, migrate
identities + oauth; per engine create+provision the space, build roster/grants,
and stream-copy memories cross-DB (batched unnest, ~500/insert). Run the ETL
first, then point DATABASE_URL at the target (the app's idempotent boot
migration becomes a no-op). Run in-region to keep the copy fast.

Key outcomes

  • Identities → auth.users + core.principal (id preserved); oauth links copied.
  • Engines → spaces; org roster → space roster; org owner/admin → admin + owner@root;
    tree_owner/tree_grant → grants (lossy, over-permissive, documented); RBAC
    roles → groups; service users → an agent owned by the engine's org owner.
  • Everyone re-authenticates after cutover: sessions do NOT migrate
    (better-auth stores raw session tokens; we only have old sha256 hashes — humans
    re-login), and API keys do not migrate (argon2 — agents re-key).

Validated against live prod (read-only) + full rehearsal

  • Two distinct source clusters; no DDL drift on the columns the ETL reads.
  • Full rehearsal into a test target: 34 spaces, 62,111 memories, 0 skipped, 0
    warnings
    ; verify.ts passes (per-space memory counts reconcile incl. the
    20,862-row engine; service-user→agent confirmed; ≥1 admin per space).
  • Timing: ~15 min from a remote host (I/O-bound, ~1.8 GB over the WAN); faster in-region.

Adapted to main

Rebasing surfaced two breaking schema changes on main, now handled:

  • better-auth dropped auth.sessions.token_hash for a raw token → session
    migration removed (the §2.3 premise is dead; users re-login).
  • the space memory table renamed embedding_versioncontent_version (and
    added a nullable name) → memory copy retargeted; name left null.

Tests

tsc + lint clean; 16 integration + 5 unit pass, full suite green. The
integration test stands in one physical DB for all three connections and covers
the simple + complex scenarios (multi-member org, RBAC role→group,
service-user→agent, grants, dangling identity, invitations, deleted/orphan engines,
no-session-migration, engine-subset filter).

🤖 Generated with Claude Code

@cevian cevian force-pushed the prod-multiplayer-migration branch from 6a06e0c to 3bce9e1 Compare June 22, 2026 13:49
cevian and others added 11 commits June 25, 2026 15:12
One-time, in-place migration from the old org/engine/role + RLS model
(deployed at server/v0.2.5) to the new auth/core/space model (PR #71),
all within the single existing database.

PROD_MIGRATION_PLAN.md captures the full old→new mapping, decisions
(in-place, reuse slugs, rename-aside), the phased run procedure, and the
"verify against live prod" checklist (no DB access yet — drafted off code).

packages/migrate-prod implements it, reusing the new code's own
provisioning (migrateAuth/migrateCore/provisionSpace) and core SQL
functions rather than re-implementing DDL:
  - Phase A: provision auth+core beside the live accounts schema; migrate
    identities → auth.users + core.principal (id preserved), oauth links,
    and live sessions (token_hash copied verbatim — same sha256 scheme).
  - Phase B (per engine, one txn): rename old me_<slug> aside, provision a
    fresh one, build the roster + tree-access grants from org membership /
    superuser / tree_owner / tree_grant / role_membership, same-DB copy
    memories (carrying embeddings).
  - Phase C: explicit dropLegacy/dropAccounts teardown.

Not migrated (by constraint): api keys (argon2, unrecoverable — agents
re-issue), oauth tokens, device-flow rows. Grant {actions}→level mapping
is intentionally lossy/over-permissive (documented).

Tested end-to-end against a real Postgres (simple + complex scenarios:
multi-member org, RBAC role→group, explicit grants, dangling identity,
invitations, deleted/orphan engines) plus unit tests for the mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operational step-by-step for executing the migration: pre-flight checklist
(privileges, backup, §9 verification), rollback plan, the maintenance-window
mode (recommended) vs per-engine zero-downtime mode, Phase-C teardown, and
reconciliation/verification SQL.

Grounds the steps in how prod actually deploys/migrates: the new server
auto-migrates idempotently on boot (so run the ETL first to avoid the
helm --wait --atomic crashloop), and connects via DATABASE_URL with a
temporary ENGINE_DATABASE_URL fallback (so the single-DB cutover needs no
chart connection change). Flags the one hard break (API-key re-issue) and the
cross-schema privilege requirement for the ETL connection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prod actually runs two separate databases (DB_ACCOUNTS + DB_SHARD), not one,
so the migration now targets a brand-new third database instead of an in-place
schema swap.

- ETL takes three connections {accounts, shard, target}; sources are read-only.
- No rename-aside / collision handling (sources and target are different DBs);
  each engine slug is reused verbatim as its space slug.
- Memories are copied cross-database by streaming (cursor over DB_SHARD →
  batched insert into the target) instead of insert…select; meta re-sent via
  sql.json to dodge the postgres.js text-in-jsonb double-encoding footgun.
- Removed the dropLegacy/dropAccounts teardown helpers — sources are never
  modified, so rollback is just repointing the app at the old databases and
  decommissioning them is out of band.
- run.ts reads DB_ACCOUNTS / DB_SHARD / DATABASE_URL(target).
- Plan + runbook rewritten for the three-DB topology (fresh target, cross-DB
  copy, repoint-to-sources rollback, chart DB-secret repoint, cross-DB
  verification queries).

Tests updated: the integration test stands in one physical DB for all three
connections (source schemas carry a distinct prefix so they don't collide with
the target). typecheck + lint clean; 13 integration + 5 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Added survey.ts: a READ-ONLY §9 verification tool (DB_ACCOUNTS + DB_SHARD)
that checks DDL drift and surveys the data shape. Run against live prod:
  - two distinct physical clusters confirmed → cross-DB ETL is correct
  - no DDL drift; 32 identities, 34 active engines, 62,111 memories, 0 orphans
  - 33/34 orgs single-owner; 1 multi-member org; 1 RBAC role
  - 6 service users (login, no identity), all confirmed to be each owner's
    own coding agents (claude/codex/sidekick/…)

Service-user handling (decision): map each to a kind='a' agent owned by the
engine's org owner, joined to the space, with its grants re-created (clamped
under the owner's owner@root). Dangling identities (none in prod) still drop
with a warning. Memory copy kept per-row (cursor fetched in batches).

Fixture + test cover the new service-user→agent path; plan updated with the
§9 results and the decision (§4.1, §10). typecheck + lint clean; 14
integration + 5 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the per-engine zero-downtime mode (Mode B) and the modes section; the
runbook is now a single linear maintenance-window cutover. Fold in the §9
survey facts: humans keep working (sessions migrate) so only agents (former
service users) need re-issued keys; expected reconciliation numbers (~32
users, 34 active engines, ~62k memories, 0 orphans, 0 skipped/warnings);
note the row-by-row copy runtime. Fix the plan's stale runbook cross-refs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The target needs only an empty database (+ creatable/installed extensions + a
schema-creating role) — not a pre-migrated one. The ETL runs migrateAuth/
migrateCore + provisionSpace itself.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add MigrateOptions.engineSlugs (run.ts: MIGRATE_ENGINES env) to restrict Phase B
to a subset of engines — Phase A still migrates all identities. Lets you smoke
test the ETL against a throwaway target with the real (read-only) prod sources:
a fast few-engine pass first, then a full rehearsal. Requested slugs that aren't
active engines are reported in skippedEngines.

Runbook §0 documents the rehearsal procedure (incl. target reset SQL for re-runs).
Tests cover the filter + the not-found-slug case. typecheck + lint clean; 16
integration + 5 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A subset-aware, read-only verifier across DB_ACCOUNTS/DB_SHARD/target: identity↔
user counts + the auth.users == core.principal invariant, oauth/session copy,
per-space memory counts (target vs source shard), ≥1 admin per space, every
member's effective build_tree_access non-empty, and the Tiger-Den access-parity
spot-check (owner→owner@root, member→group grants). Prints a ✓/✗ checklist,
exits non-zero on failure. Runbook §5 points at it.

Verified the smoke-test target (Tiger Den + one small engine): all 12 checks
passed — 32 identities/users/principals, 18 sessions, memory counts match, and
the role→group access resolves for the collaborator.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-row cross-DB insert with one batched insert per cursor fetch:
read every column as text, then `insert … select … from unnest($1::text[], …)`
with scalar casts in the projection (meta::jsonb, tree::ltree,
temporal::tstzrange, embedding::halfvec). Validated the cast pattern locally
(jsonb objects, halfvec, ltree incl. root, tstzrange, nulls all preserved).

Cuts ~62k target round-trips to ~125 → the memory copy drops from tens of
minutes to a few. Behavior unchanged: the row-level enqueue trigger still fires
per inserted row (null-embedding rows enqueue), counts/embeddings/tree paths
identical. Docs (plan §5, runbook) updated; 16 integration + unit suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Full rehearsal against a test target (real prod sources, read-only): 34 spaces,
62,111 memories, 0 skipped, 0 warnings; verify.ts 114/114 checks pass (counts
reconcile per space incl. the 20,862-row one; service-user→agent confirmed).
Wall-clock 14m36s but ~3% CPU — I/O-bound (~1.8 GB of halfvec over the WAN), so
run the real cutover in-region. Runbook timing notes corrected from "a few
minutes" to the measured number.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on rename)

Rebased onto current main (76 commits ahead) and re-ran the import test, which
surfaced two breaking schema changes:

- better-auth retired auth.sessions.token_hash for a plaintext `token` (raw
  token round-tripped). Old sessions stored only a one-way sha256 hash, so they
  can no longer be migrated — DROP session migration; everyone re-authenticates
  after cutover (humans re-login, agents re-key).
- the space memory table renamed embedding_version → content_version (and added
  a nullable `name`). Retarget the memory copy to content_version; leave name null.

Updated the ETL, verify.ts (now asserts 0 sessions migrated), the fixture/test,
and the plan + runbook (§2.3 sessions, §5 column map, §9.1, decision #3, the
runbook re-auth callout). typecheck + lint clean; 16 integration + full suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cevian cevian force-pushed the prod-multiplayer-migration branch from 4f21556 to 74f77dc Compare June 25, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant