Skip to content

Fix for https://github.com/sgl-project/sglang/issues/22072#1806

Open
davzhuAMD wants to merge 4 commits into
SemiAnalysisAI:mainfrom
davzhuAMD:main
Open

Fix for https://github.com/sgl-project/sglang/issues/22072#1806
davzhuAMD wants to merge 4 commits into
SemiAnalysisAI:mainfrom
davzhuAMD:main

Conversation

@davzhuAMD

@davzhuAMD davzhuAMD commented Jun 16, 2026

Copy link
Copy Markdown

Fix SGLang MORI disagg tests hanging on MI325X. The test matrix appeared as follows:

Test # Config Decode nodes EP/DP MoRI a2a Status
1 P(tp4) D(tp8) 1 No No Works
2 P(tp8) D(tp8, 2 workers) 1 each No No Works
3 P(tp4) D(tp8/ep8/dp, 1 node) 1 Yes Yes Works (used to hang)
4 P(tp4) D(tp8/ep8/dp, 1 node, MTP) 1 Yes Yes Works (used to hang)
5 P(tp8) D(tp16/ep16/dp16, 2 nodes) 2 Yes Yes Works (used to hang)

Investigation employed helper scripts from https://github.com/davzhuAMD/InferenceX/tree/sglang-pd-mi325x, derived from https://github.com/JohnQinAMD/InferenceX/tree/sglang-pd-mi300-mi325x, to aid investigation and repro. These are the run_* and start_* scripts that essentially act as config instructions to set up the repro.

To validate, copy the run_* and start_* scripts from the sglang-pd-mi325 branch mentioned above into the main InferenceX folder, update the PREFILL_MODEL_HOST_DIR, DECODE_MODEL_HOST_DIR, PREFILL_NODE, and DECODE_NODE IPs per test in these scripts, then run

  • start_test_1.sh
  • start_test_2.sh
  • start_test_3.sh
  • start_test_4.sh
  • start_test_5.sh

Thanks to Akash Dhaka who investigated this issue with me.


Note

Medium Risk
Changes multi-node RDMA/NCCL/MoRI defaults and privileged Docker launch paths; mis-tuned GID/QoS or IBDEVICES ordering could still break cross-node comms, but scope is benchmark scripts rather than core serving code.

Overview
Addresses SGLang MoRI disaggregated benchmark hangs on MI325X by tuning RoCE/NCCL/MoRI env defaults and fixing decode DP+EP sizing that was starving cross-node EP under load.

RDMA / NCCL: env.sh adds default NCCL_IB_GID_INDEX, NCCL_IB_TC, and NCCL_IB_SL for bnxt_re RoCEv2 multi-node collectives. When MORI_RDMA_TC is set, it now derives matching MORI_RDMA_SL and mirrors TC/SL to MORI_IO_* so bnxt_re stays on the lossless PFC queue. Decode MoRI dispatch cap rises from 512 → 4096 tokens.

Decode DP+EP: server_sglang.sh stops shrinking MORI_MAX_DISPATCH_TOKENS_DECODE to conc/dp_ranks; it scales max-running-requests as conc * TP (floored at dp_ranks) and keeps env.sh MoRI buffer sizes.

Launch path: New scripts/_disagg_ssh_remote_inner.sh sorts per-node IBDEVICES by RoCEv2 GID subnet before Docker; _disagg_container_entry.sh optionally runs rebuild_bnxt.sh and install_mori_in_container.sh then server.sh. Helpers added: detect_ibdevices_bnxt.sh, rebuild_bnxt.sh.

Model config: DeepSeek-R1-0528 prefill mem_fraction_static 0.8 → 0.9 in models.yaml.

Reviewed by Cursor Bugbot for commit eb43d03. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread utils/find_reusable_sweep_run.py
Comment thread benchmarks/multi_node/amd_utils/server.sh
Comment thread run_1p1d_tp16_sglang_mi300_mi325x.sh Outdated
Comment thread runners/launch_gb300-nv.sh
Comment thread scripts/_disagg_ssh_remote_inner.sh
Comment thread run_1p2d_sglang_mi300_mi325x.sh Outdated
Comment thread run_1p1d_tp16_sglang_mi300_mi325x.sh Outdated
Comment thread start_test_1.sh Outdated
export REBUILD_LIBBNXT_IN_CONTAINER=1
export PATH_TO_BNXT_TAR_PACKAGE=/workspace/driver/libbnxt_re-237.1.137.0.tar.gz

export PREFILL_NODE="45.63.71.103"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start_test_*.sh scripts hardcode many cluster-specific values (real node IPs like 45.63.71.103 / 137.220.60.12 and Docker image tags) directly in a public repo. Maybe we could parameterize these (e.g. via env vars) instead of committing concrete cluster details.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. These test scripts themselves (including the new start_test_, run_1p, and start_sglang*) don't necessarily need to make it into the final repo, but I added them to illustrate how we investigated and reproduced the original issue on our side. If it makes things clearer, I can remove the helper scripts from the commit and attach them somewhere else instead, leaving just the actual fixes in this PR.

Comment thread run_1p1d_tp16_sglang_mi300_mi325x.sh Outdated
# IPADDRS order: prefill_ip,decode1_ip,decode2_ip. NNODES=3, xP=1, yD=1.
#
# Prerequisites:
# - Passwordless SSH from this machine to both nodes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say 3 nodes? This script launches 1 prefill + 2 decode nodes (NNODES=3), but the comments/header still say "both nodes", which were copy-pasted from the single-decode launcher and read as if there are only 2 nodes total.

…er concurrencies.

- Increase the decode dispatch buffer size from 512 to 4096 and correctly pass it along for the DP and EP cases, avoiding a stall under load.
- Correctly set MORI parameters when prioritizing traffic as lossless, which otherwise interferes with the data transfers that the tests generate.
Comment thread benchmarks/multi_node/amd_utils/env.sh
…e final commit. They'll remain availabel in the sglang-pd-mi325x branch.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3ecc400. Configure here.

Comment thread benchmarks/multi_node/amd_utils/rebuild_bnxt.sh

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guidance would be appreciated for how to handle a library file like this. During investigation and debug, we requested an update of our test nodes' NIC firmware and driver. However, the Docker image we used didn't have the updated driver itself, so our test scripts had to pass export REBUILD_LIBBNXT_IN_CONTAINER=1 to patch in this updated driver. In theory, using an updated image with a later compatible NIC driver should make this archive unnecessary to commit.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding my comment on libbnxt_re-237.1.137.0.tar.gz https://github.com/SemiAnalysisAI/InferenceX/pull/1806/changes#r3432816594, if anyone validates this fix using a Docker image that comes with an updated NIC driver compatible with the host's firmware, then this rebuild_bnxt.sh script becomes unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants