Prover performance: batch/endomorphism MSM, CIOS wasm, batching modes#185
Open
OBrezhniev wants to merge 62 commits into
Open
Prover performance: batch/endomorphism MSM, CIOS wasm, batching modes#185OBrezhniev wants to merge 62 commits into
OBrezhniev wants to merge 62 commits into
Conversation
…fers to worker threads instead of arrays (make it compatible with SharedArrayBuffer)
Fix nChunks calculation - drastically improve memory usage. Increase min chunk size to 1<<15 (32k) - speed improvement on smaller circuits. Serial chunk processing - better mem usage. Linter fixes
- remove chunking of chunks (removes unneeded copying of the same data to different worker jobs), - make nChunks multiple of tm.concurrency for optimal load balancing - switch back to promises from awaits (allows parallel execution of chunks) - rollback min chunk size - transfer buffer ownership to worker threads (removes memory copying for large arrays!!!)
…ination Replace parallel index arrays (workers[], initialized[], working[], etc.) with a WorkerSlot class that owns all per-worker state. Message handlers close over the slot reference, so stale messages from replaced workers are detected by a simple identity check (pool[i] !== slot) rather than generation counters. 2-phase termination protocol: - Worker fires want_to_terminate when idle timer expires (200ms, down from 1s) - Main thread nulls pool[i] immediately, sends TERMINATE ack, calls processWorks so a replacement worker can start filling the slot right away - Worker's subsequent terminated message arrives stale and only removes event listeners to break the slot→worker→closure reference cycle for prompt GC - Stale task results (want_to_terminate race with in-flight dispatch) are still resolved correctly so callers never hang Additional fixes: - scheduleTermination() moved inside init().then() so the 200ms idle timer never fires during async WASM compilation - removeEventListener called on both stale and non-stale terminated paths so WASM memory held by old slots is released immediately, not GC-deferred - processWorks start-new-workers loop no longer calls startWorker() on slots that are already occupied (working or initializing)
… to 1500ms - engine_fft.js: remove console.log for FFT input size, point count, and reversePermutation name (fired on every FFT call) - engine_multiexp.js: remove console.log for nChunks (fired on every multiExp call) - threadman.js: remove "Worker N not initialized" log from processWorks - threadman_thread.js: remove "INIT DONE" log; raise terminationTimeout 200ms → 1500ms so workers stay alive across the multiExp→IFFT/FFT gap (~0.8s) avoiding a 100ms WASM re-compile per worker each proof Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… __reversePermutation The bit-reversal permutation before the FFT mix phase is just a permutation of fixed-size sIn-byte elements. It is now done in a worker by transferring the buffer in, reversing it in place with plain Uint32Array lane swaps, and transferring it back. Both transfers are pointer moves, so the whole step is zero-copy. Versus the previous WASM __reversePermutation task this: - never grows/retains the worker's WASM linear memory (the old ALLOCSET copied the full buffer into WASM memory — ~640MB across workers for a 2^21 Fr FFT, retained since WASM memory cannot shrink) and skips the GET copy-out; - allocates nothing (Uint32Array lanes avoid the BigInt boxing a BigUint64Array would incur, and there is no per-swap slice as the old pure-JS buffReverseBits did — a single reused temp covers the byte-wise path for unaligned sizes); - is ~2.5x faster than that old pure-JS buffReverseBits. It also fixes a correctness bug: __reversePermutation swapped n8g-sized elements rather than sIn-sized ones, which was wrong whenever sIn != n8g (e.g. affine input G1/G2 FFTs). The "big FFT/IFFT in G1" test that failed on HEAD now passes; full suite 59/59. A single worker is used because the swap is memory-bandwidth bound — splitting it across workers does not help and would oversubscribe the pool shared by the concurrent A/B/C transforms — so no SharedArrayBuffer is needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pairingEq transferred its per-equation g1Buff/g2Buff to the worker, but curve.G1.toJacobian()/G2.toJacobian() return their argument unchanged when the point is already in jacobian form. Caller-owned points such as curve.G1.g and curve.G2.g are stored jacobian, so the transfer detached their backing buffers on the main thread (byteLength -> 0). The next use of G1.g/G2.g then failed the size check in eq()/toRprLEM() with "invalid point size". This surfaced as 15 failing snarkjs "Full process" tests (powersoftau verify, groth16 setup, ...) that all cascaded from the first detached generator. The pairing inputs are single points; ALLOCSET already copies them into the worker's WASM memory, so transferring saved nothing and only created the aliasing hazard. Drop the transfer list and let them be structured-cloned. Other transfer sites (multiexp, fft, batchconvert) transfer freshly sliced buffers, so they are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prebuilt-wasm loader used two Node-only/too-new APIs that broke browser bundles: - Buffer.from(gzipCode, "base64") -> "Buffer is not defined". Use atob (a global in browsers and Node >=16) to decode base64 into a Uint8Array. - Response.bytes() -> "bytes is not a function" on engines that don't ship it yet (e.g. Chromium 129). Use the universally available arrayBuffer(). The surrounding Blob/DecompressionStream path is already browser-native, so the curve now builds in-browser. Verified via the snarkjs browser test suite (full setup/prove/verify in headless Chrome on the inlined build). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…process.browser) Makes ffjavascript loadable/usable in modern bundlers, browser extensions, and SES/Snap realms (in addition to Node/Bun/Deno): - bn128/bls12381: cache the built curve in a module-local `let` instead of globalThis.curve_*. Assigning to a frozen globalThis (SES lockdown) threw at module load, so a Snap couldn't even import ffjavascript. The cache is module- private (not read elsewhere), so behavior is unchanged. - random.getRandomBytes: drop `process.browser` (undefined under Vite/esbuild/ SES -> ReferenceError). Prefer the Node crypto module (no per-call size limit), then Web Crypto chunked to its 65536-byte cap, then an insecure last resort. - threadman: add a non-throwing `isNode` (process.versions.node); worker-source encoding uses Buffer on Node else Blob/btoa; single-thread auto-fallback keys off globalThis.Worker presence; concurrency uses navigator.hardwareConcurrency then os.cpus. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nd-stubbing The shipped build/browser.esm.js still imported `os` and `crypto` at the top (the browser rollup config only stripped web-worker), so a consumer bundling ffjavascript for the browser had to stub those builtins themselves. Add a package.json "browser" field mapping os/crypto to false; the browser rollup build (nodeResolve browser:true) now resolves them to empty, producing a clean browser.esm.js. The code already tolerates the empty stubs (`os && os.cpus`, `crypto && crypto.randomFillSync`). Node build/usage is unaffected (the browser field is ignored by Node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The default curve-load path used a dynamic import of wasmcurves' gzipped prebuilt and decompressed it with atob + DecompressionStream + Response -- a dynamic import (forbidden under SES) and web-stream APIs (absent in SES/Snap). bls12381 had no prebuilt at all: it recompiled the wasm via ModuleBuilder on every load. Vendor the prebuilt wasm into ffjavascript and load it statically: - src/wasm/bn128_wasm.js, src/wasm/bls12381_wasm.js: the UNCOMPRESSED prebuilt (base64 of the raw wasm + pointer offsets / moduli), generated from wasmcurves by the new dev script scripts/gen-wasm.js (npm run gen-wasm). - src/wasm/base64.js: pure-JS base64 decoder (no atob/Buffer/DecompressionStream), so decoding works in Node, browsers, extensions and SES/Snap realms. - bn128.js/bls12381.js default path: static import + manual decode. No dynamic import, no gzip. bls12381 no longer recompiles on every load. - plugins path: kept, dynamic-imports wasmbuilder/wasmcurves, now moved to optionalDependencies (only needed when a caller passes `plugins`, or for gen-wasm). Runtime dependencies are now just web-worker. Vendored bytes verified byte-identical to gunzip(gzip prebuilt) for bn128 and to the ModuleBuilder output (code + every pointer) for bls12381. Validated: ff 59/59, snarkjs 49/49, bls12381 pairing-bilinearity smoke, full snarkjs build, tutorial + browser e2e. Tradeoff: uncompressed wasm is larger than the gzip variant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two changes so the default (single-threaded) curve-load path touches no SES/Snap-forbidden API at import or build time: - threadman: compute the worker source lazily (getWorkerSource, memoized, called only when a worker is actually created) instead of at module load. The old module-top block touched Blob/btoa/URL.createObjectURL on import, which throws in a SES realm (no such globals, frozen). The existing `!isNode && !globalThis.Worker` guard already forces single-thread where no Worker exists (SES/Snap, limited browsers), so the worker path is never reached there. - base64: prefer the native decoder (Buffer in Node, atob in browsers) and fall back to the pure-JS decoder only when neither is available (SES). The fallback lookup table is built lazily so the common path pays nothing extra. Verified: all three base64 paths byte-identical; ff 59/59; SES-proxy (Blob/btoa/ atob/Worker/DecompressionStream/Response/Buffer all blanked) builds both curves single-threaded and computes; full snarkjs build; snarkjs 49/49; tutorial + browser e2e (multi-thread paths) pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A real `ses` lockdown() test that builds both curves and runs pairings under a SES hardened profile, catching regressions plain unit tests can't (e.g. mutating globalThis at module load, or touching Blob/btoa at import). - test/ses/lockdown.mjs: runs lockdown(), asserts intrinsics frozen, freezes globalThis, then dynamic-imports both curve modules INSIDE the hardened realm and builds each single-threaded, checking G1 generator validity and pairing bilinearity e(2P,Q) == e(P,2Q). Curve imports are wrapped in try/catch so a load-time globalThis mutation reports as a clean FAIL with a stack instead of an uncaught rejection. - test/ses.test.js: mocha wrapper that runs the harness as its OWN child process via execFileSync. lockdown() is global and irreversible, so it must never run in the mocha process itself -- the child keeps it isolated while still gating CI on exit code. Placing the harness in test/ses/ (a subdirectory) keeps mocha's default non-recursive glob from auto-loading it. - package.json: "test:ses" script + ses devDependency. Also reword the existing SES comments in bn128/bls12381/base64/threadman to "SES hardened profile/realm" (drop MetaMask Snap naming). Verified: npm run test:ses -> 6 ok, exit 0; npm test -> 60 passing (lockdown isolated, other suites unaffected); negative test (globalThis mutation injected at bn128.js load) -> clean FAIL, exit 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wasmcurves emits unoptimized, hand-assembled wasm. Run `wasm-opt -Oz` over it in gen-wasm.js before vendoring -- this is both a size and a speed win, since the input had no inlining / dead-local removal / instruction selection. - scripts/gen-wasm.js: decode the wasmcurves base64, pipe through the binaryen `wasm-opt -Oz` binary (temp files, exact CLI semantics), re-encode. Only the `code` export changes; pointer offsets / moduli pass through untouched (wasm-opt preserves the data layout they reference). - src/wasm/bn128_wasm.js: wasm 86601 -> 68635 bytes (-21%). - src/wasm/bls12381_wasm.js: wasm 114939 -> 98160 bytes (-15%). - binaryen added as a devDependency (gen-time only; not a runtime/optional dep). - build/main.cjs, build/browser.esm.js: rebuilt to inline the optimized base64. Correctness: only `code` differs from HEAD in both modules; the -Oz binary is byte-identical to the original on field mul/square/inverse, G1 timesFr/double/ toAffine, MSM 2^16, and the full Fp12 pairing. Performance (bn128, vs the original unoptimized wasm): - microbench: frm_mul/f1m_mul -24%, pairing -24%, MSM 2^16 -26%, frm_square -12% - end-to-end groth16 prove (authV3, 29MB zkey): ~1.29s -> ~1.15s (~10-11% faster) Validated: ffjavascript 60, fastfile 17, snarkjs 49, SES lockdown harness, tutorial e2e (groth16/plonk/fflonk), and the puppeteer browser e2e (in-browser groth16 setup/contribute/beacon/verify/prove/verify) -- all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add G.multiExpAffineChunked(basesReader, totalBasesBytes, buffScalars, ...): a streaming affine multiexp where the bases are produced chunk-by-chunk by a reader (e.g. a direct sub-range file read) instead of being read whole and sliced. This removes the main-thread per-chunk slice copy and keeps only a few chunks resident (bounded in-flight reads with backpressure), so the full bases section never sits in RAM. Result is identical to multiExpAffine. While here, collapse the duplication between the in-memory and streaming paths. Both now share: - pointSize(inType), fnNameFor(inType), chunkSizeFor(nPoints, sScalar), geometry() - _multiExpDispatch(getChunk, ..., maxInFlight, ...): the one chunk loop + sum. In-memory multiExp passes a synchronous slice provider and maxInFlight=Infinity (dispatch-all -- behaviour identical to before); the streaming path passes the reader and maxInFlight=concurrency+2. _multiExpChunk is slimmed too (dropped the dead single-result doubling loop and the unused inType default/logger param). Net: 140 lines vs 166 originally, despite adding the whole streaming feature. Backpressure cleanup uses op.finally so a slot is freed on BOTH fulfilment and rejection -- verified by injecting a read failure under maxInFlight=2: the failing chunk frees its slot, the error propagates, and the loop neither wedges nor leaks (clean exit under --unhandled-rejections=strict). Tests (test/bn128.js): multiExpAffineChunked vs multiExpAffine equality for G1 (4 chunks) and G2 (2 chunks), plus the non-function-reader guard. ff 63 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PAGE_SIZE gated on `Buffer.constants.MAX_LENGTH`, but those constants are on the `buffer` module, not the `Buffer` class, so the probe was always undefined and fell back to `1 << 30`. Drop the dead check and set 1 GiB explicitly: a deliberately conservative, fragmentation-friendly page -- NOT the engine's max single-buffer length (~8 GiB+ today), which would defeat paging and risk OOM on the multi-GiB G1/G2 buffers large circuits produce. No behaviour change (the value was already 1 GiB at runtime). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_fft copies its entire input up front (`buff.slice(0, byteLength)`) because the bit-reversal runs in place and the chunks are transferred -- so it must not touch the caller's buffer. When the caller is about to discard the input, that full- domain copy is pure overhead on the critical path (it blocks before any worker runs). Add a `consume` flag to fft/ifft (default false, unchanged): when set AND the input is a flat ArrayBuffer view, skip the copy and reverse/transfer the caller's buffer in place (its backing buffer is detached as a result). A BigBuffer input is still flattened (it has no single .buffer to transfer), so consume is only honoured for a Uint8Array. New test: Fr.fft consume == non-consume and detaches the input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regenerated from wasmcurves (feature/msm-signed-buckets): signed-bucket Pippenger that halves the bucket count per window. Bit-exact with the previous multiexp; ffjavascript curve tests pass and a groth16 authV3 prove+verify is OK.
Vendor wasmcurves' batch-affine MSM helper (src/wasm/msm_batch_wasm.js, ~3.6KB) and link it in every thread next to the main curve module: the worker INIT instantiates it against the main instance's field/group exports and the shared memory, once for G1 (f1m/g1m) and once for G2 (f2m/g2m). The same binary serves both curves and both groups (base-field size is a runtime parameter). engine_multiexp targets `<g>_multiexpAffineBatch`; the worker CALL dispatch resolves batch entry points first and falls back to the plain in-module `<g>_multiexpAffine` when the batch module is absent (same 5-arg signature). Set FF_NO_BATCH=1 to force the plain path (benchmark escape hatch). Measured (20-core box, bn128): single MSM 1.12x faster at 4k points, ~1.5x at 64k-105k (G1 and G2). Full groth16 prove: authV3 (2^16) ~11% faster (796 -> 703 ms median); sha256 (2^21) at parity -- under full worker concurrency the prove is memory-bandwidth-bound, which is also why the batch module's fill phase keeps ascending point order (near-sequential base reads). Proofs verify; ffjavascript and snarkjs suites pass, including SES (the hardened realm instantiates both modules in the single-thread path).
multiExpAffine and multiExpAffineChunked take an options object with `batch: "auto" | "enabled" | "disabled"` (booleans accepted as aliases; default "auto"; FF_NO_BATCH=1 still force-disables globally). "auto" routes a chunk to the batch-affine module only when its bases fit in ~2 MiB -- the measured regime where the batch fill's random-access set stays cache-resident under full worker concurrency and the fewer-multiplications advantage is real (+10% on a 2^16 prove). Larger chunks (e.g. 2^21 circuits, ~6 MiB bases per chunk) are bandwidth-bound: batch is parity at best there and costs extra per-worker scratch, so auto keeps them on the plain in-module multiexp (measured: PiA 0.3-0.4s auto/plain vs 1.4s forced-batch, and ~0.3 GB lower peak RSS).
Pick up wasmcurves' CIOS Montgomery multiplication, and change the vendoring optimizer from -Oz to -O2: both -Oz and -O3 pessimize the hot mul by ~15% (61.5-61.8 ns vs 53.4 ns -- their aggressive local restructuring fights V8's register allocator), while -O2 is the fastest level measured and only ~11 KB larger. Net f1m_mul: 71.3 -> 54.7 ns (~23%). Full prove impact (all proofs verify, suites pass): authV3 2^16 median 703 -> 693 ms; sha256 2^21 median ~7.9 -> ~7.2 s (~8% -- the mul is compute even where the MSM fill is bandwidth-bound; Fr FFTs and buildABC benefit).
bn128 advertises `glv`; threadman threads it through both INIT paths and the worker binds the batch module's multiexpAffineGLV as the G1 batch entry point. G2 and bls12-381 keep the generic batch path. Behind the existing msmBatching gate, so auto/enabled/disabled semantics are unchanged. authV3 (2^16) full prove: 676 -> 595 ms median (min 554), proofs verify; sha256 unchanged (auto routes its large chunks to the plain path).
The batch instances now supply f_conj (f2m_conjugate for G2, a harmless copy for G1) and the G2 batch entry binds multiexpAffineGLS when the curve advertises glv; the wasm gates internally on chunk size. authV3 full prove: 595 -> 530 ms median (min 514); sha256 unchanged (its G2 chunks route to the plain path). Proofs verify; suites pass.
…irrors FF_NO_BATCH)
The GLS binding is decided at worker INIT, so the worker now registers both
G2 entry points ("g2m_multiexpAffineBatch" = GLS when the curve advertises
it, "...BatchNoGls" = generic batch) and the engine picks per call via
options.gls (default true; false disables the endomorphism path). The
env-var escape hatch is gone -- as an option this also works in browsers,
where process.env never existed. The plain-path fallback in the worker
dispatch strips either suffix.
…led" options.glv (G1) joins options.gls (G2): "auto" (default -- endomorphism path when the curve advertises it, wasm still gates internally on sizes) or "disabled" (generic batch accumulation); false accepted as an alias. The worker binds a NoGlv variant next to NoGls and the plain-path fallback strips either suffix. "auto" rather than "enabled" because the path is never unconditional -- curve support and chunk-size gates still apply.
The single-thread task manager instantiates the batch-affine MSM module next to the main curve module, so a hardened realm now performs a second WebAssembly.instantiate + cross-module import wiring -- exercised here for both curves, with the batch, endomorphism-disabled and plain multiexp paths checked for agreement, plus an Fr fft/ifft roundtrip. All under frozen intrinsics + frozen globalThis, no Worker.
Fully superseded by the per-call option ({batch: "disabled"}, exposed by
snarkjs as msmBatching). No runtime env vars remain in the library.
wasmbuilder and the wasmcurves generators are only reachable through the custom-plugins curve-build path, which is never taken when the prebuilt vendored wasm is used. inlineDynamicImports was folding the whole toolchain into build/browser.esm.js; marking the two packages external preserves the lazy import() so consumer bundlers split it into an async chunk that never loads unless plugins are passed. 885 KB -> 478 KB (-46%).
- rollup 3 -> 4 (+ latest @rollup plugins); both bundles rebuild cleanly - eslint 10: migrate .eslintrc.cjs to flat eslint.config.mjs; fix the handful of real findings it surfaced (stale /* global */ comments that now count as redeclares, an unused buffReverseBits import left from the REVERSE-command refactor, dead sleep() helper, write-only wantToTerminate state, unused catch binding) - chai 6 / mocha 11: 67 tests pass unchanged; SES lockdown harness passes
The committed manifest now resolves standalone (once the referenced branch is pushed). For local development keep an uncommitted file:../wasmcurves override in the working tree; the lockfile still records the local layout and gets regenerated after the branches are published.
The wasm-compile/memory-init/grow/terminate logs and the >25MB task dumps printed unconditionally into every consumer's output. 67 tests + SES pass.
git+ssh resolved URLs require SSH credentials; git+https installs anonymously (and exact-SHA GitHub deps fetch via codeload tarball).
Member
Author
|
PR stack (landing order):
Cross-repo deps are pinned by git+https commit refs; re-pin consumers if a branch gains commits before merge. |
This was referenced Jul 4, 2026
Open
js-yaml/picomatch bumped in-range; serialize-javascript ^7.0.5 and diff ^8.0.3 overridden (mocha pins vulnerable ranges). npm audit clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prover performance: batch/endomorphism MSM, CIOS wasm, batching modes
Summary
Companion to the
wasmcurvesMSM/CIOS PR (must land together — the vendoredwasm here is regenerated from it). Adds the batch-affine MSM module wiring,
a three-state MSM batching option, and re-vendors the rewritten field
arithmetic. Also includes the earlier SES-hardening / vendored-wasm /
streaming-multiexp / fft-consume work from
feature/sharedArrayBuffers.Full groth16 prove impact (snarkjs, interleaved A/B, all proofs verify):
Changes (MSM/CIOS era)
module next to the main curve module (shared memory, imports wired per
group: f1m/g1m and f2m/g2m +
f_conj).glvflag routes bn254 G1/G2 tothe GLV/GLS entry points.
multiExpAffine(..., {batch: "auto"|"enabled"|"disabled"}).autouses the batch module only for chunks whose bases fit ~2 MiB —measured cache-residency boundary; larger chunks are bandwidth-bound and
stay on the plain path (faster AND lower memory there).
gen-wasmswitched fromwasm-opt -Ozto-O2— both-Ozand-O3pessimize the hot CIOS mul by ~15% (V8 regalloc);-O2isthe fastest level measured.
{batch: "auto"|"enabled"|"disabled"},{glv: "auto"|"disabled"},{gls: "auto"|"disabled"}(exposed by snarkjs asmsmBatching/msmGlv/msmGls). No runtime env vars.toolchain, only reachable via the custom-
pluginscurve-build path) keptexternal so the lazy
import()survives — consumer bundlers async-chunk itinstead of inlining it.
build/browser.esm.js885 → 478 KB (−46%).terminate logs printed unconditionally into every consumer's output).
wasmcurves pinned by commit ref (git+https), local dev via uncommitted
file:override.Validation
(the hardened single-thread path instantiates both modules).
in-page.
Measured dead ends (documented, not merged)
For reviewer context, these were prototyped bit-exact and measured slower,
hence absent: wasm-SIMD Montgomery mul (0.76× vs scalar — no carries/widening
mul in SIMD128), four-step FFT (0.92–0.96× — the FFT is compute+copy bound
with baked root tables, not RAM-bandwidth bound), SharedArrayBuffer multiexp
marshalling (already overlapped behind compute).
🤖 Generated with Claude Code