[NV] Add MiniMax M3 B300 Dynamo vLLM recipes by Oseltamivir · Pull Request #1863 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-19T21:28:59Z

Summary

Recreate the latest changes from [NV] Add MiniMax M3 B300 Dynamo vLLM recipes #1787 on current main.
Add MiniMax-M3 MXFP8 B300 disaggregated Dynamo + vLLM benchmarks for 1k1k and 8k1k STP.
Add the matching srt-slurm recipes and B300 launcher integration.
Update the master configuration and all 16 recipes to vllm/vllm-openai:minimax-m3-0618-x86_64-cu130.

Validation

bash -n runners/launch_b300-nv.sh
Parsed all changed YAML files with PyYAML.
Generated 16 matching sweep entries with generate_sweep_configs.py.
Confirmed all recipe containers match the master configuration.
Confirmed the container manifest exists.
git diff --check

Note

Low Risk
Changes are benchmark orchestration, YAML recipes, and Slurm launcher wiring only; no production inference paths. Operational risk is limited to patched container behavior during benchmark jobs.

Overview
Adds MiniMax-M3 MXFP8 disaggregated Dynamo + vLLM coverage on B300 for fixed-sequence 1k/1k and 8k/1k sweeps, wired through minimaxm3-fp8-b300-dynamo-vllm in nvidia-master.yaml and 16 matching srt-slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b300-fp8 (prefill DEP2, decode variants TEP8, DEP8, DEP4, and TP4+Marlin with optional colocation and CUDA IPC).

Extends runners/launch_b300-nv.sh for minimaxm3-fp8 + dynamo-vllm: local model path, overlay recipes into NVIDIA srt-slurm on sa-submission-q2-2026, run minimax-m3-vllm-fixes.sh via srtctl --setup-script (runtime patches to the pinned 0618 image for MSA prefill top-k contiguity and NIXL heterogeneous-TP KV length checks), backport srt-slurm#38 for node-IP discovery, and inject Slurm exclude: b300-018 (overridable via MINIMAX_M3_SLURM_EXCLUDED_NODELIST) with a post-submit sanity check on the rendered sbatch script.

Documents the benchmark matrix and node exclusion in perf-changelog.yaml.

^{Reviewed by Cursor Bugbot for commit 03d27e7. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-19T21:29:07Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

github-actions · 2026-06-19T21:43:48Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27849356286
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27849356286

github-actions · 2026-06-19T22:10:44Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27850405431
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27850405431

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 71ba2ea. Configure here.}

github-actions · 2026-06-19T22:46:58Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27850645617
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27850645617

github-actions · 2026-06-19T23:11:04Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27852040297
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27852040297

github-actions · 2026-06-20T02:08:19Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27852563848
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27852563848

Replace TEP4 prefill + B300-optimal decode recipes with NV's PR #1863 B300 dynamo-vllm disagg search matrix, adapted for GB300 NVL72 (4 GPU/node): - All prefill switched to DEP2 (TP1 DP2 EP, 2 GPU/worker) — lighter per-worker footprint allows more prefill workers - Decode types: TP4+Marlin, TEP8, DEP8, DEP4 - 4p3d (3 decode workers) skipped - 15 recipe files: 8 for 8k1k, 7 for 1k1k (both ISLs active) - PR 1863 vllm_config values (max-num-seqs up to 4096, max-cudagraph-capture-size up to 8192, max-num-batched-tokens 16384) - Prefill uses cudagraph (max-cudagraph-capture-size: 2048) instead of enforce-eager - kv-cache-dtype: fp8, req_rate: inf for all benchmarks - GB300 MNNVL/NVLS env vars + sbatch mem=0 preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace TEP4 prefill + B300-optimal decode recipes with NV's PR #1863 B300 dynamo-vllm disagg search matrix, adapted for GB200 NVL72 (4 GPU/node): - All prefill switched to DEP2 (TP1 DP2 EP, 2 GPU/worker) — lighter per-worker footprint allows more prefill workers - Decode types: TP4+Marlin, TEP8, DEP8, DEP4 - 4p3d (3 decode workers) skipped - 15 recipe files: 8 for 8k1k, 7 for 1k1k (both ISLs active) - PR 1863 vllm_config values (max-num-seqs up to 4096, max-cudagraph-capture-size up to 8192, max-num-batched-tokens 16384) - Prefill uses cudagraph (max-cudagraph-capture-size: 2048) instead of enforce-eager - req_rate: inf for all benchmarks - FLASHINFER attention, GB200 UCX env vars preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # perf-changelog.yaml

github-actions · 2026-06-20T05:34:23Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27857055295
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27857055295

github-actions · 2026-06-20T05:48:39Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27861632221
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27861632221

github-actions · 2026-06-20T08:29:48Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27861979524
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27861979524

github-actions · 2026-06-20T11:06:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27865635871
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27865635871

github-actions · 2026-06-20T11:25:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27865635871
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27865635871

github-actions · 2026-06-20T13:47:14Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27872945965
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27872945965

github-actions · 2026-06-20T16:32:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27873218834
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27873218834

* feat: MiniMax-M3 MXFP8 full sweep config for GB300 Add minimaxm3-fp8-gb300-dynamo-vllm to nvidia-master.yaml with 7 topologies covering the full concurrency range: - TP4/TP8 (low latency, conc 4-64) - TP4+EP4 agg + 1P+1D disagg 2-node + 1P+1D collocated (mid, conc 64-512) - DEP4/DEP8 (high throughput, conc 256-2048) All recipe YAMLs included under minimax-m3-gb300-fp8/{1k1k,8k1k}/. GB300 recipes include srun_options mem=0 (CW DefMemPerCPU cgroup fix) and omit safetensors-load-strategy prefetch (host-memory limit). * chore: update perf-changelog pr-link to #1735 * Update runner name in nvidia-master.yaml * fix: add sbatch_directives mem=0 + cpus-per-task=72 to M3 GB300 recipes srun_options.mem=0 only grants a step the job's existing allocation; on gb300-cw (DefMemPerCPU=4096, no DefCpuPerGPU) the job itself was only allocated 4 GB/node and workers were cgroup-OOM-killed during engine init (run 27452273567: oom_kill in StepId=7409.7 on slurm-gb300-133-193, worker RLIMIT showed 4194304 KB). The canary passed because it landed on gb300-nv, which doesn't enforce the cap. Mirrors the sbatch_directives block of the DSV4 agentic recipes. * fix: run M3 GB300 workers cache-only (HF_HUB_OFFLINE=1) to avoid fetch_model lock race With the mem fix in place, run 27452976271 cleared the OOM but hit a new failure: both nodes of the TP8-2n job called dynamo fetch_model within 200ms (191 @ :23.637, 193 @ :23.833), 191 took the per-blob .lock on the shared /mnt/vast/hf-home cache and held it verifying the 444 GB snapshot, 193 retried ~6.4s and died 'Lock acquisition failed' (dynamo's rust hub doesn't wait like Python hf_hub). The launcher already pre-stages and verifies the snapshot offline before submit, so the workers never need to fetch. Setting HF_HUB_OFFLINE=1 in every worker env block makes dynamo serve cache-only and skip the download lock entirely, so co-fetching workers no longer collide. Applied to all agg + disagg (prefill/decode) env blocks across the 11 recipes. * fix: re-pin utils/aiperf to live cjq/agentx-v0.3 tip (ff2b646c) The previous pin 062a5de9 (set by #1571 "chore: agentx v0.3") was the cjq/agentx-v0.3 tip on 2026-06-02, but that branch was later rebased/ force-pushed (now at ff2b646c) which orphaned 062a5de9; GitHub has since garbage-collected it. It is now unfetchable ("upload-pack: not our ref") and absent from every CI runner cache, so actions/checkout fails on any cold runner with "Unable to find current revision in submodule path utils/aiperf" (e.g. the newly-added gb300-cw runner-4, run 27453693856). Re-pin to the current cjq/agentx-v0.3 tip — the branch .gitmodules already declares, which is live/fetchable and contains the prior aiperf history as an ancestor. This makes the pin and the declared branch consistent again. * MiniMax-M3 GB300: disagg-only sweep + multi-node-NVLink KV transfer Replace the aggregated M3 GB300 topologies with disaggregated-only, and enable NixlConnector KV transfer over multi-node NVLink on every disagg recipe. On gb300-cw the cross-node prefill->decode KV handoff was silently falling back to RDMA/TCP (~268 MB/s, ~1400 tiny descriptors for M3 MSA cache) — the disagg ceiling. Setting UCX_CUDA_IPC_ENABLE_MNNVL=y plus --enable-cumem-allocator (VMM-registers KV so NIXL uses cuda_ipc across the NVL fabric) lifts it to ~1.4-1.7 GB/s and gives +17% / +23% / +49% out tok/s/gpu at conc 64 / 128 / 256 (jobs 7490 base vs 7493 MNNVL, 1P1D TP4EP4). This is a GB300-only win: B300 8-GPU IB islands cannot move KV over multi-node NVLink. Sweep (1k1k), all MNNVL: - 1P1D TP4+EP4 collocated 1n (8 GPU), conc 8-256 - low/mid latency - 1P1D TP4+EP4 split 2n (8 GPU), conc 64-512 - mid throughput - 1P + DP16+EP wide decode 5n (20 GPU), conc 512-2048 - max throughput (decode keeps scaling on NVL where 1P1D saturates: ~1213 vs ~810 out tok/s/gpu @ conc 1024) Removes all agg-gb300 recipes (1k1k + 8k1k); applies MNNVL to the 8k1k disagg recipe too for consistency. * M3 GB300: add 8k1k disagg sweep; drop unschedulable collocated-1n The collocated-1n topology (disagg-gb300-1p1d-tp4ep4-1n) declared gpus_per_node: 8, but gb300-cw nodes have 4 GPUs — sbatch rejects it with "Requested node configuration is not available" even on a fully idle cluster (confirmed: fails standalone with 28 nodes free; the split-2n and wide-decode at gpus_per_node 4 schedule fine). It was an 8-GPU-node template artifact that never reached sbatch before. Remove it (1k1k + 8k1k) and let the split-2n cover the low-latency end (conc extended down to 8). Add the 8k1k (isl 8192) scenario mirroring 1k1k with the two valid disagg shapes (split-2n + wide DP16 decode), MNNVL KV transfer on both, seq params retuned for long context (max-model-len 9472) and lower concurrency. * M3 GB300: add rack-saturating balanced-ratio TP-ep1 max-throughput disagg config Adds a 17-node (full-rack) disagg topology to the M3 GB300 sweep (1k1k + 8k1k) from on-cluster tuning (gb300-cw): - PREFILL is the binding bottleneck, not decode width or KV transfer: a single prefill worker left ~3967 reqs queued and starved 64 decode GPUs. Balancing to 5 prefill : 12 decode (TP4) cleared the backlog and lifted throughput +57% (535 -> 843 out tok/s/gpu @ conc 2048). - TP-only decode (ep1, no expert parallelism) per the Qwen3.5-397B-A17B recipes (closest M3 analog); M3 wide-EP/DP-attention all-to-all was slower and DP32 < DP16 per-GPU. - Kept the existing 1p1d (low/mid latency) and dep16dec (wide-decode) topologies so CI measures the full Pareto rather than replacing them. NixlConnector KV transfer stays on multi-node NVLink (MNNVL + cumem); note KV transfer was verified NOT to bottleneck throughput (doubling its bandwidth via num_threads changed end-to-end tok/s/gpu by ~0). recipe yamls line up 1:1 with the nvidia-master.yaml CONFIG_FILE references. * M3 GB300: replace dep16dec with 1P4D TP4-ep1; add prefill-heavy 10P7D for 8k1k DSR1 GB300 patterns show wide-EP decode hurts M3's MoE all-to-all; independent TP4 decode workers are strictly better. Also, 8k1k is prefill-bound (616-req backlog at 5P:12D) — rebalance to 10P:7D per DSR1/DSV4's prefill-heavy long-context ratios. Changes: - Replace dep16dec (EP16 single decode) with 1P+4D (4x TP4 ep1 decode) for both 1k1k and 8k1k, same 5 nodes - Add 10P+7D TP4 ep1 (17 nodes) for 8k1k max throughput - Tighten concurrency ranges: 1P1D [4-32], 1P4D [64-512], 5P12D/10P7D [1024+] * [Klaud Cold]minimaxm3-fp8-mi300x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI300X recipe (#1749) * minimaxm3-fp8-mi300x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 MI300X recipe Adds the spec-decoding=mtp sibling of minimaxm3-fp8-mi300x-vllm, based on the MI300X non-MTP recipe + the MI355X MTP recipe. Keeps the MI300X serve shape (BF16 KV cache — gfx942 lacks calibrated ROCm FP8 attention scales — plus --no-enable-prefix-caching, TRITON_ATTN, --enforce-eager, minimax_m3 parsers) and adds the Inferact/MiniMax-M3-EAGLE3 draft via --speculative-config (method eagle3, 3 spec tokens) + chat-template prompts. Carries the same in-place EAGLE3 patch as the MI355X MTP recipe: the shipped ROCm image's AMD MiniMax-M3 model lacks SupportsEagle3, so the recipe patches the installed amd/model.py before serving (functionstackx/vllm#1, upstream vllm-project/vllm#45546; validated green on MI355X). Idempotent; hard-fails on base drift. TP8-only search space (gfx942 192 GB is memory-tight, like H100), TP8 latency rows started at conc 1, matching the H100/MI355X MTP recipes. Also adds SPEC_SUFFIX to launch_mi300x-amds.sh so spec-decoding=mtp routes to the _mtp script (the launcher hardcoded _mi300x.sh). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp (#1749) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> * [AMD] perf: enable MiniMax M3 CUDA graphs on MI300X (#1750) * feat: add MiniMax M3 MI300X day-zero benchmark * chore: link MiniMax M3 MI300X changelog * fix: mount ROCm devices on MI300X * fix: disable prefix caching for MI300X MiniMax M3 * fix: use bf16 kv cache for MI300X MiniMax M3 * perf: enable MI300X MiniMax M3 CUDA graphs * chore: link MI300X CUDA graph changelog * [Klaud Cold] minimaxm3-fp8-mi300x-vllm-mtp: run with CUDA graphs (drop --enforce-eager, VLLM_USE_BREAKABLE_CUDAGRAPH=0) (#1756) * minimaxm3-fp8-mi300x-vllm-mtp: run with CUDA graphs (drop --enforce-eager) Remove --enforce-eager from the MI300X EAGLE3 MTP recipe and set VLLM_USE_BREAKABLE_CUDAGRAPH=0, matching the non-MTP MI300X recipe (#1750). Avoids the M3-decode breakable-cudagraph path that previously forced eager execution. Re-sweeps minimaxm3-fp8-mi300x-vllm-mtp. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp cudagraphs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> * M3 GB300: drop dominated configs, restore 1P1D full range Data from run 27489709722 showed: - 1P4D (20 GPU) strictly dominated by 1P1D (8 GPU): 320 vs 974 out/s/gpu @ conc 128 (1k1k). Single prefill can't feed 4 decode workers — 1P:4D ratio is too decode-heavy. - 8k1k 5P12D (68 GPU) dominated by 10P7D: 567 vs 874 out/s/gpu @ conc 1024. Prefill-heavy ratio is correct for long context. Changes: - Remove 1P4D recipes (both 1k1k and 8k1k) - Remove 8k1k 5P12D recipe (dominated by 10P7D) - Restore 1P1D to full concurrency range [8-512] 1k1k, [8-256] 8k1k (was truncated to [4-32] to avoid 1P4D overlap) Final GB300 configs: 1P1D (latency-to-mid) + rack-saturating (max tput) 1k1k: 1P1D [8-512] + 5P12D [2048-8192] 8k1k: 1P1D [8-256] + 10P7D [1024-4096] * M3 GB300 disagg: add DSV4-level decode optimizations Port decode optimizations from DSV4 GB300 disagg reference configs to all 4 M3 GB300 recipe files: - fp8 KV cache (2x decode slot capacity vs bf16) - max-num-seqs/max-num-batched-tokens 256→512 - CUDA graph compilation (FULL_DECODE_ONLY mode) - NCCL MNNVL env vars (CUMEM_ENABLE, MNNVL_ENABLE, NVLS_ENABLE) - enable-ep-weight-filter + no-disable-hybrid-kv-cache-manager - stream-interval 32→50 on decode * Switch GB300 M3 recipes to nightly-aarch64 + add Marlin MoE for TP-only workers - All 4 recipes: container vllm/vllm-openai:minimax-m3 → nightly-aarch64 (contains upstream head_ratio fix vllm#45879, avoids gemm1_alpha crash) - TP-only recipes (5p12d-tp4ep1, 10p7d-tp4ep1): add moe-backend: marlin for both prefill and decode workers per PR #1809 pattern - EP recipes (1p1d-tp4ep4): no Marlin (EP enabled) - nvidia-master.yaml: update image, comment out 1k1k (run 8k1k only) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: switch GB300 M3 runner from gb300-cw to gb300-nv Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add minimaxm3-fp8 to gb300-nv launcher + switch recipes to alias-based model path - Add minimaxm3 fp8 case to launch_gb300-nv.sh (MODEL_PATH, srt-slurm clone) - Switch recipe model.path from hf:MiniMaxAI/MiniMax-M3-MXFP8 to minimax-m3-mxfp8 (alias resolved via srtslurm.yaml model_paths, matching GB200 pattern) - Remove __M3_HF_HOME__ placeholder (extra_mount, HF_HOME, HF_HUB_OFFLINE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: redesign GB300 M3 recipes — DEP8 prefill, TEP8/TP8/DEP8 decode All prefill workers switched to DEP8 (TP1 DP8 EP, 8 GPU, 2 nodes). Low conc (<128): two decode variants — TEP8 (TP8+EP8) and TP8+Marlin. High conc (128+): DEP8 decode, 2P+7D = 18 nodes. TP8 decode (not TP4) to avoid Marlin OOM seen on previous run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: TEP4 prefill + B300-optimal decode for GB300 M3 disagg Switch all prefill from DEP8 (TP1 DP8 EP, 2 nodes) to TEP4 (TP4+EP4, 1 node), halving per-worker node footprint. Decode configs follow B300 run 27630519240 optimal points (spec=none): - conc 8-32: TP4+Marlin (no EP) - conc 64-256: TEP4 (TP4+EP4) - conc 512/1024: TEP8 (8k1k) or DEP8 (1k1k), max 2 workers × 6n Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: adapt NV B300 PR #1863 disagg configs for GB300 M3 sweep Replace TEP4 prefill + B300-optimal decode recipes with NV's PR #1863 B300 dynamo-vllm disagg search matrix, adapted for GB300 NVL72 (4 GPU/node): - All prefill switched to DEP2 (TP1 DP2 EP, 2 GPU/worker) — lighter per-worker footprint allows more prefill workers - Decode types: TP4+Marlin, TEP8, DEP8, DEP4 - 4p3d (3 decode workers) skipped - 15 recipe files: 8 for 8k1k, 7 for 1k1k (both ISLs active) - PR 1863 vllm_config values (max-num-seqs up to 4096, max-cudagraph-capture-size up to 8192, max-num-batched-tokens 16384) - Prefill uses cudagraph (max-cudagraph-capture-size: 2048) instead of enforce-eager - kv-cache-dtype: fp8, req_rate: inf for all benchmarks - GB300 MNNVL/NVLS env vars + sbatch mem=0 preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: reduce GB300 DEP CUDA graph capture sizes --------- Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Cameron Quilici <cjquilici@gmail.com>

github-actions · 2026-06-20T18:25:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27873218834
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27873218834

# Conflicts: # .github/configs/nvidia-master.yaml

github-actions · 2026-06-21T02:50:19Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27888496957
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27888496957

Oseltamivir added 2 commits June 20, 2026 05:25

[NV] Add MiniMax M3 B300 Dynamo vLLM recipes

b506cd4

chore: update MiniMax M3 B300 container

84a023a

Oseltamivir requested a review from a team June 19, 2026 21:29

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners June 19, 2026 21:29

github-project-automation Bot added this to InferenceMAX Board Jun 19, 2026

chore: update changelog PR link

b09bc78

Oseltamivir added the full-sweep-enabled label Jun 19, 2026

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b300-fp8/1k1k/1p1d-dep2-tep8-1k1k.yaml Outdated

Comment thread benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b300-fp8/8k1k/4p2d-dep2-dep8-8k1k.yaml

Oseltamivir added 2 commits June 19, 2026 14:31

Update perf-changelog.yaml

86da150

Update perf-changelog.yaml

f5727c2

SemiAnalysisAI deleted a comment from github-actions Bot Jun 19, 2026

fix(vllm): patch MiniMax M3 MSA contiguity

3b6dad4

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b300-fp8/8k1k/1p2d-dep2-dep8-8k1k.yaml

fix(recipes): align MiniMax M3 parallel settings

71ba2ea

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b300-fp8/8k1k/4p2d-dep2-dep8-8k1k.yaml

fix(vllm): backport MiniMax M3 eval fixes

b859a0b

ci(sweep): enable full MiniMax M3 validation

2d408e4

perf(vllm): right-size MiniMax M3 low concurrency

3956aee

Merge remote-tracking branch 'origin/main' into pr-1787-latest

33fe6a9

# Conflicts: # perf-changelog.yaml

Merge branch 'main' into pr-1787-latest

77c6391

perf(vllm): colocate MiniMax M3 TP4 workers

b99d3c9

fix(runner): exclude faulty B300 RDMA node

d2347aa

Oseltamivir added full-sweep-enabled and removed full-sweep-enabled labels Jun 20, 2026

fix(runner): verify B300 node exclusion

8ace2e9

fix(runner): check generated B300 sbatch script

884ff12

Oseltamivir added full-sweep-enabled and removed full-sweep-enabled labels Jun 20, 2026

ci(sweep): validate B300 node exclusion

3ae240b

cursor Bot mentioned this pull request Jun 20, 2026

MiniMax-M3 MXFP8 full sweep config for GB200 #1734

Open

2 tasks

Merge remote-tracking branch 'origin/main' into pr-1787-latest

9751d93

# Conflicts: # .github/configs/nvidia-master.yaml

Oseltamivir added full-sweep-enabled and removed full-sweep-enabled labels Jun 20, 2026

refactor(vllm): trim MiniMax M3 runtime patches

03d27e7

Conversation

Oseltamivir commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 19, 2026 •

edited by cursor Bot

Loading