Fix for https://github.com/sgl-project/sglang/issues/22072 by davzhuAMD · Pull Request #1806 · SemiAnalysisAI/InferenceX

davzhuAMD · 2026-06-16T15:00:16Z

Fix SGLang MORI disagg tests hanging on MI325X. The test matrix appeared as follows:

Test #	Config	Decode nodes	EP/DP	MoRI a2a	Status
1	P(tp4) D(tp8)	1	No	No	Works
2	P(tp8) D(tp8, 2 workers)	1 each	No	No	Works
3	P(tp4) D(tp8/ep8/dp, 1 node)	1	Yes	Yes	Works (used to hang)
4	P(tp4) D(tp8/ep8/dp, 1 node, MTP)	1	Yes	Yes	Works (used to hang)
5	P(tp8) D(tp16/ep16/dp16, 2 nodes)	2	Yes	Yes	Works (used to hang)

Investigation employed helper scripts from https://github.com/davzhuAMD/InferenceX/tree/sglang-pd-mi325x, derived from https://github.com/JohnQinAMD/InferenceX/tree/sglang-pd-mi300-mi325x, to aid investigation and repro. These are the run_* and start_* scripts that essentially act as config instructions to set up the repro.

To validate, copy the run_* and start_* scripts from the sglang-pd-mi325 branch mentioned above into the main InferenceX folder, update the PREFILL_MODEL_HOST_DIR, DECODE_MODEL_HOST_DIR, PREFILL_NODE, and DECODE_NODE IPs per test in these scripts, then run

start_test_1.sh
start_test_2.sh
start_test_3.sh
start_test_4.sh
start_test_5.sh

Thanks to Akash Dhaka who investigated this issue with me.

Note

Medium Risk
Changes multi-node RDMA/NCCL/MoRI defaults and privileged Docker launch paths; mis-tuned GID/QoS or IBDEVICES ordering could still break cross-node comms, but scope is benchmark scripts rather than core serving code.

Overview
Addresses SGLang MoRI disaggregated benchmark hangs on MI325X by tuning RoCE/NCCL/MoRI env defaults and fixing decode DP+EP sizing that was starving cross-node EP under load.

RDMA / NCCL: env.sh adds default NCCL_IB_GID_INDEX, NCCL_IB_TC, and NCCL_IB_SL for bnxt_re RoCEv2 multi-node collectives. When MORI_RDMA_TC is set, it now derives matching MORI_RDMA_SL and mirrors TC/SL to MORI_IO_* so bnxt_re stays on the lossless PFC queue. Decode MoRI dispatch cap rises from 512 → 4096 tokens.

Decode DP+EP: server_sglang.sh stops shrinking MORI_MAX_DISPATCH_TOKENS_DECODE to conc/dp_ranks; it scales max-running-requests as conc * TP (floored at dp_ranks) and keeps env.sh MoRI buffer sizes.

Launch path: New scripts/_disagg_ssh_remote_inner.sh sorts per-node IBDEVICES by RoCEv2 GID subnet before Docker; _disagg_container_entry.sh optionally runs rebuild_bnxt.sh and install_mori_in_container.sh then server.sh. Helpers added: detect_ibdevices_bnxt.sh, rebuild_bnxt.sh.

Model config: DeepSeek-R1-0528 prefill mem_fraction_static 0.8 → 0.9 in models.yaml.

^{Reviewed by Cursor Bugbot for commit eb43d03. Bugbot is set up for automated code reviews on this repo. Configure here.}

… mentioned in #22072.

1am9trash · 2026-06-17T10:40:08Z

+export REBUILD_LIBBNXT_IN_CONTAINER=1
+export PATH_TO_BNXT_TAR_PACKAGE=/workspace/driver/libbnxt_re-237.1.137.0.tar.gz
+
+export PREFILL_NODE="45.63.71.103"


The start_test_*.sh scripts hardcode many cluster-specific values (real node IPs like 45.63.71.103 / 137.220.60.12 and Docker image tags) directly in a public repo. Maybe we could parameterize these (e.g. via env vars) instead of committing concrete cluster details.

Thanks for your comments. These test scripts themselves (including the new start_test_, run_1p, and start_sglang*) don't necessarily need to make it into the final repo, but I added them to illustrate how we investigated and reproduced the original issue on our side. If it makes things clearer, I can remove the helper scripts from the commit and attach them somewhere else instead, leaving just the actual fixes in this PR.

1am9trash · 2026-06-17T10:51:14Z

+# IPADDRS order: prefill_ip,decode1_ip,decode2_ip. NNODES=3, xP=1, yD=1.
+#
+# Prerequisites:
+#   - Passwordless SSH from this machine to both nodes


Should this say 3 nodes? This script launches 1 prefill + 2 decode nodes (NNODES=3), but the comments/header still say "both nodes", which were copy-pasted from the single-decode launcher and read as if there are only 2 nodes total.

…er concurrencies. - Increase the decode dispatch buffer size from 512 to 4096 and correctly pass it along for the DP and EP cases, avoiding a stall under load. - Correctly set MORI parameters when prioritizing traffic as lossless, which otherwise interferes with the data transfers that the tests generate.

…e final commit. They'll remain availabel in the sglang-pd-mi325x branch.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3ecc400. Configure here.}

davzhuAMD · 2026-06-18T03:38:31Z

Guidance would be appreciated for how to handle a library file like this. During investigation and debug, we requested an update of our test nodes' NIC firmware and driver. However, the Docker image we used didn't have the updated driver itself, so our test scripts had to pass export REBUILD_LIBBNXT_IN_CONTAINER=1 to patch in this updated driver. In theory, using an updated image with a later compatible NIC driver should make this archive unnecessary to commit.

davzhuAMD · 2026-06-18T04:13:56Z

Regarding my comment on libbnxt_re-237.1.137.0.tar.gz https://github.com/SemiAnalysisAI/InferenceX/pull/1806/changes#r3432816594, if anyone validates this fix using a Docker image that comes with an updated NIC driver compatible with the host's firmware, then this rebuild_bnxt.sh script becomes unnecessary.

davzhuAMD requested a review from a team June 16, 2026 15:00

github-project-automation Bot added this to InferenceMAX Board Jun 16, 2026

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread utils/find_reusable_sweep_run.py

Comment thread benchmarks/multi_node/amd_utils/server.sh

Comment thread run_1p1d_tp16_sglang_mi300_mi325x.sh Outdated

Comment thread runners/launch_gb300-nv.sh

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread scripts/_disagg_ssh_remote_inner.sh

Patches and helper scripts for the 3 failing SGLang MORI disagg tests…

c8a0e03

… mentioned in #22072.

davzhuAMD force-pushed the main branch from 7821a3a to c8a0e03 Compare June 16, 2026 16:02

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread run_1p2d_sglang_mi300_mi325x.sh Outdated

Comment thread run_1p1d_tp16_sglang_mi300_mi325x.sh Outdated

1am9trash reviewed Jun 17, 2026

View reviewed changes

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/env.sh

Remove and revert test-related files that don't need to make it to th…

3ecc400

…e final commit. They'll remain availabel in the sglang-pd-mi325x branch.

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/rebuild_bnxt.sh

Restore .gitignore missing blank newline.

eb43d03

davzhuAMD commented Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for https://github.com/sgl-project/sglang/issues/22072#1806

Fix for https://github.com/sgl-project/sglang/issues/22072#1806
davzhuAMD wants to merge 4 commits into
SemiAnalysisAI:mainfrom
davzhuAMD:main

davzhuAMD commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1am9trash Jun 17, 2026

Uh oh!

davzhuAMD Jun 17, 2026

Uh oh!

1am9trash Jun 17, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

davzhuAMD Jun 18, 2026

Uh oh!

davzhuAMD Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davzhuAMD commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1am9trash Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

davzhuAMD Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

1am9trash Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davzhuAMD Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

davzhuAMD Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davzhuAMD commented Jun 16, 2026 •

edited

Loading