MiniMax-M3 MXFP8 full sweep config for GB200#1734
Conversation
Add minimaxm3-fp8-gb200-dynamo-vllm to nvidia-master.yaml with 6
topologies covering the full concurrency range:
- TP4/TP8 (low latency, conc 4-64)
- TP4+EP4 agg + 1P+1D disagg (mid curve, conc 64-512)
- DEP4/DEP8 (high throughput, conc 256-2048)
All recipe YAMLs included under minimax-m3-gb200-fp8/{1k1k,8k1k}/.
Adopt the NVIDIA Dynamo vLLM runtime image (nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-minimax-m3-dev.1), the canonical M3 runtime from ai-dynamo/dynamo release/1.3.0-minimax-m3-dev.1. Changes mirrored from that release's recipes/minimax-m3/vllm/disagg/MXFP8/deploy.yaml: - dynamo.install: false — the runtime image bundles dynamo 1.3.0, so the prior 1.2.0 wheel install is dropped (srtctl defaults install=true) - attention-backend: FLASH_ATTN on every prefill/decode/agg engine Benchmark-specific knobs kept over the reference's serving defaults: language-model-only (text-only), no-enable-prefix-caching (random data), scenario-trimmed max-model-len.
enroot's docker:// URI needs `#` to separate the registry host from
the image path; `nvcr.io/...` was parsed as a Docker Hub repo and 401'd
against registry-1.docker.io. Matches the existing nvcr.io# convention
in nvidia-master.yaml. Recipe container fields kept byte-identical to
the master image: field (srtslurm.yaml maps "${IMAGE}" -> squashfile).
Replace the mostly-aggregated GB200 sweep (5 agg + 1 disagg) with a fully disaggregated sweep that splits prefill/decode over NixlConnector, mirroring the minimaxm2.5-fp8-gb200 reference. Every worker = one 4-GPU node since the 444 GB MXFP8 checkpoint can't fit in fewer. Topologies (1k1k): 1P1D TP4 (low-lat), 1P1D TP4+EP4 (mid), 1P2D TP4+EP4 (decode-scaled), 2P1D TP4+EP4 (prefill-scaled), 1P1D DEP4 (max-tput), spanning conc 4-2048. - add 4 disagg recipes; remove 8 orphaned agg recipes (1k1k + 8k1k) - rewire nvidia-master.yaml search-space to the 5 disagg entries - perf-changelog: describe disagg sweep; fix stale Image line (vllm/vllm-openai:minimax-m3 -> nvcr.io#.../vllm-runtime:1.3.0-minimax-m3-dev.1) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… transfer Run 27478698552 failed: every disagg worker crashed at NixlConnector init with "NIXL is not available" (RuntimeError, vllm .../nixl/worker.py:248). The ai-dynamo vllm-runtime:1.3.0-minimax-m3-dev.1 image ships dynamo but NOT the nixl bindings (cupy missing too), so kv_connector=NixlConnector cannot initialize and the engine core never becomes healthy. Revert to the pre-ed63c1e0 runtime path that pulls NIXL in via the dynamo wheel (same as the working minimaxm2.5-gb200 disagg recipes): - image/container: vllm/vllm-openai:minimax-m3 (the m3_release build all other m3 entries already use) - dynamo.install=true + wheel 1.2.0.dev20260526 (nixl is a dynamo dep) - keep attention-backend FLASH_ATTN (added in the image-switch commit) Also enable NVLink (MNNVL) KV transfer so NIXL doesn't fall back to TCP, mirroring the deepseek-v4 gb200 disagg recipes — on every prefill/decode env block: UCX_TLS=cuda_copy,cuda_ipc,tcp UCX_CUDA_IPC_ENABLE_MNNVL=y UCX_MEMTYPE_CACHE=n / UCX_MEMTYPE_REG_WHOLE=n NCCL_CUMEM_ENABLE=1 (cuMem-allocate buffers so they are IPC-exportable) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The narrow DEP8-max sweep showed no GB200 advantage over B200 because both cap at an 8-GPU NVLink island. Exploit NVL72's rack-scale NVLink with wide expert parallelism spanning multiple nodes, mirroring the deepseek-v4 "megamoe" ladder (DEP = data-parallel attention + expert-parallel): - 1P1D TP4 (2n) low-latency, conc 4-64 - 1P1D DEP8 (4n) mid, EP8/16-experts-per-rank, conc 128-512 - 1P1D DEP8->DEP16 (6n) wide decode (EP16), conc 512-2048 - 2P1D DEP8->DEP16 (8n) prefill-scaled, conc 2048-4096 - 4P1D DEP8->DEP16 (12n) max throughput, conc 4096-8192 M3 has 128 routed experts (top-4), so EP8/EP16 shard cleanly. EP16 across 16 GPU / 4 nodes is the regime B200 physically can't reach. Attention: FLASH_ATTN -> FLASHINFER (trtllm-gen) on all GB200 recipes to exploit Blackwell. Requires the :minimax-m3 image rebuilt from m3_release HEAD 022448dd (vllm-project/vllm#45381), which gates trtllm-gen page>=128. Also add GB200 perf/NVLink-KV knobs from the deepseek-v4 reference: numa-bind (Grace) and enable-sleep-mode (cuMem allocator so the KV cache is IPC-exportable over the MNNVL fabric), alongside the existing UCX MNNVL env. Replaces the four narrow EP4 recipes; keeps 1P1D TP4 for low latency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1k1k TP4 low-conc tuning: stream-interval 1 (was 128 decode / 32 prefill), cudagraph cap 128 (was 512), conc range extended to 1-64 (was 4-64) to match B200 coverage. 8k1k sweep: 5 disagg recipes mirroring the 1k1k megamoe ladder (TP4, DEP8, DEP8→DEP16, 2P1D, 4P1D) with max-model-len 9472 (74×128 blocks = ISL+OSL+256 headroom). Concurrencies shifted ~4x lower for 8x heavier prefill: TP4 1-16, DEP8 32-128, DEP8→DEP16 128-512, 2P1D 512-1024, 4P1D 1024-2048. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment out all conc > 64 entries (1k1k DEP8/DEP16/2P1D/4P1D and all
8k1k high-conc) to focus sweep budget on low-concurrency tuning.
Add two new 1k1k experiments at conc 1-64 alongside the existing
1P1D TP4 baseline:
- 1P2D TP4 (3 nodes): 2 decode workers halve per-worker batch
- 1P1D TP4→TP8 (3 nodes): wider decode TP spreads forward pass
across 8 GPU over NVL72
All three share the low-conc tuning (stream-interval 1, cudagraph
cap 128, FLASHINFER, block-size 128).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onc gap B200 TEP8 (TP8+EP8) achieves 11.68ms TPOT at conc 1 vs GB200 TP8's 15.29ms — the gap is entirely from expert parallelism splitting 128 MoE experts across 8 ranks. Add enable-expert-parallel: true to the TP8 decode recipe and update nvidia-master.yaml decode ep: 1→8 so result JSON reflects TEP8. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… ISL GB200 8k1k only had TP4 (2n) giving 18.50ms TPOT at conc 1 vs B200 TEP8's 11.57ms. Add 1P1D TP4→TEP8 (3n) 8k1k recipe mirroring the 1k1k TEP8 config that already closed the gap there (12.34ms vs 11.68ms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop 1P1D-TP4 (2n) and 1P2D-TP4 (3n) entries from both 1k1k and 8k1k. TEP8 dominates at every concurrency — TP4 baseline is 50% slower at conc 1 and 1P2D gave <2% TPOT improvement for 50% more GPUs. Active sweep is now TEP8-only: 1k1k conc 1-64, 8k1k conc 1-16. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable DEP8 (4n), DEP8→DEP16 (6n), 2P1D (8n), 4P1D (12n) for both 1k1k and 8k1k alongside the optimized TEP8 low-conc configs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test whether EP4 on 4 decode GPUs (2 nodes total) improves TPOT over pure TP4 on GB200's NVL72 NVLink. B200 showed TEP4 slightly worse than TP4 intra-node; NVL72 all-to-all may differ. All other entries commented out for this isolated test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The validator requires new entries appended at the file end (byte prefix must match origin/main exactly). The previous commit inserted mid-file, shifting entry indices and triggering the immutability check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o feat/minimax-m3-gb200-sweep
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27789705560 |
…weep Multi-arch image (arm64+amd64) with upstream head_ratio fix baked in. Update all 14 GB200 disagg recipes (1k1k + 8k1k), nvidia-master.yaml, and changelog entry (no longer evals-only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27792745830 |
minimax-m3-0618 likely cherry-picks vLLM PR #45723 (gemm1_alpha for FP8 TRT-LLM MoE) but ships flashinfer ≤0.6.13 which lacks that kwarg (flashinfer PR #3504), causing TypeError at runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27794230681 |
Per PR #1809 pattern: Marlin MoE backend for TP-only configs (no EP, no DP-attention). Applied to 6 recipes affecting 9 worker sections (prefill and/or decode). EP/DP-attention workers stay on default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-sweep # Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27808018367 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fc4af8b. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27813364923 |
Switch all prefill from DEP8 (TP1 DP8 EP, 2 nodes) to TEP4 (TP4+EP4, 1 node), halving per-worker node footprint. Decode configs follow B300 run 27630519240 optimal points (spec=none): - conc 8-32: TP4+Marlin (no EP) - conc 64-256: TEP4 (TP4+EP4) - conc 512/1024: TEP8 (8k1k) or DEP8 (1k1k), 8 workers × 18n Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename 2p8d/18n recipes to 2p2d/6n: 2 prefill (2 nodes) + 2 decode (4 nodes) = 6 nodes total. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27821886221 |
4 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27821886221 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27821886221 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27821886221 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27821886221 |
Replace TEP4 prefill + B300-optimal decode recipes with NV's PR #1863 B300 dynamo-vllm disagg search matrix, adapted for GB200 NVL72 (4 GPU/node): - All prefill switched to DEP2 (TP1 DP2 EP, 2 GPU/worker) — lighter per-worker footprint allows more prefill workers - Decode types: TP4+Marlin, TEP8, DEP8, DEP4 - 4p3d (3 decode workers) skipped - 15 recipe files: 8 for 8k1k, 7 for 1k1k (both ISLs active) - PR 1863 vllm_config values (max-num-seqs up to 4096, max-cudagraph-capture-size up to 8192, max-num-batched-tokens 16384) - Prefill uses cudagraph (max-cudagraph-capture-size: 2048) instead of enforce-eager - req_rate: inf for all benchmarks - FLASHINFER attention, GB200 UCX env vars preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27879569932 |

Summary
minimaxm3-fp8-gb200-dynamo-vllmto nvidia-master.yaml with 6 topologies: TP4, TP8, TP4+EP4, 1P+1D disagg, DEP4, DEP8/mnt/lustre01/models/MiniMax-M3-MXFP8extra_mountandHF_HOMEfrom recipesminimax-m3-gb200-fp8/{1k1k,8k1k}/Test plan
Note
Low Risk
Changes are benchmark orchestration YAML and GB200 CI launcher only; no application runtime logic, with risk limited to misconfigured sweep jobs or cluster resource usage.
Overview
Adds
minimaxm3-fp8-gb200-dynamo-vllmtonvidia-master.yamlas a multinode dynamo-vllm disagg sweep for MiniMax-M3 MXFP8 on GB200 (4 GPUs/node), mirroring the validated GB300/B300 layout: fixed DEP4 prefill workers and decode variants (TP4+Marlin, TEP8, DEP4, DEP8) at 1k/1k and 8k/1k, each pointing at new Slurm recipe YAMLs underminimax-m3-gb200-fp8/.Introduces 15 matching
srt-slurm-recipesconfigs (NixlConnector, FLASHINFER, FP8 KV, stagedminimax-m3-mxfp8model path,nightly-aarch64image) and documents the GB200 sweep inperf-changelog.yaml(nightly image / GQA disagg head-ratio fix, Marlin on TP-only decode).Updates
launch_gb200-nv.shso M3 FP8 uses/mnt/lustre01/models/MiniMax-M3-MXFP8, reuses the minimax watchtower shared-FS staging paths for minimaxm3, copies the new recipe tree into srt-slurm, and replaces bareenroot importwith a locked, atomicimport_squashhelper to avoid concurrent matrix jobs corrupting shared squash images.Reviewed by Cursor Bugbot for commit d8e17d4. Bugbot is set up for automated code reviews on this repo. Configure here.