Fix for https://github.com/sgl-project/sglang/issues/22072#1806
Fix for https://github.com/sgl-project/sglang/issues/22072#1806davzhuAMD wants to merge 4 commits into
Conversation
… mentioned in #22072.
| export REBUILD_LIBBNXT_IN_CONTAINER=1 | ||
| export PATH_TO_BNXT_TAR_PACKAGE=/workspace/driver/libbnxt_re-237.1.137.0.tar.gz | ||
|
|
||
| export PREFILL_NODE="45.63.71.103" |
There was a problem hiding this comment.
The start_test_*.sh scripts hardcode many cluster-specific values (real node IPs like 45.63.71.103 / 137.220.60.12 and Docker image tags) directly in a public repo. Maybe we could parameterize these (e.g. via env vars) instead of committing concrete cluster details.
There was a problem hiding this comment.
Thanks for your comments. These test scripts themselves (including the new start_test_, run_1p, and start_sglang*) don't necessarily need to make it into the final repo, but I added them to illustrate how we investigated and reproduced the original issue on our side. If it makes things clearer, I can remove the helper scripts from the commit and attach them somewhere else instead, leaving just the actual fixes in this PR.
| # IPADDRS order: prefill_ip,decode1_ip,decode2_ip. NNODES=3, xP=1, yD=1. | ||
| # | ||
| # Prerequisites: | ||
| # - Passwordless SSH from this machine to both nodes |
There was a problem hiding this comment.
Should this say 3 nodes? This script launches 1 prefill + 2 decode nodes (NNODES=3), but the comments/header still say "both nodes", which were copy-pasted from the single-decode launcher and read as if there are only 2 nodes total.
…er concurrencies. - Increase the decode dispatch buffer size from 512 to 4096 and correctly pass it along for the DP and EP cases, avoiding a stall under load. - Correctly set MORI parameters when prioritizing traffic as lossless, which otherwise interferes with the data transfers that the tests generate.
…e final commit. They'll remain availabel in the sglang-pd-mi325x branch.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3ecc400. Configure here.
There was a problem hiding this comment.
Guidance would be appreciated for how to handle a library file like this. During investigation and debug, we requested an update of our test nodes' NIC firmware and driver. However, the Docker image we used didn't have the updated driver itself, so our test scripts had to pass export REBUILD_LIBBNXT_IN_CONTAINER=1 to patch in this updated driver. In theory, using an updated image with a later compatible NIC driver should make this archive unnecessary to commit.
There was a problem hiding this comment.
Regarding my comment on libbnxt_re-237.1.137.0.tar.gz https://github.com/SemiAnalysisAI/InferenceX/pull/1806/changes#r3432816594, if anyone validates this fix using a Docker image that comes with an updated NIC driver compatible with the host's firmware, then this rebuild_bnxt.sh script becomes unnecessary.

Fix SGLang MORI disagg tests hanging on MI325X. The test matrix appeared as follows:
Investigation employed helper scripts from https://github.com/davzhuAMD/InferenceX/tree/sglang-pd-mi325x, derived from https://github.com/JohnQinAMD/InferenceX/tree/sglang-pd-mi300-mi325x, to aid investigation and repro. These are the run_* and start_* scripts that essentially act as config instructions to set up the repro.
To validate, copy the run_* and start_* scripts from the sglang-pd-mi325 branch mentioned above into the main InferenceX folder, update the PREFILL_MODEL_HOST_DIR, DECODE_MODEL_HOST_DIR, PREFILL_NODE, and DECODE_NODE IPs per test in these scripts, then run
Thanks to Akash Dhaka who investigated this issue with me.
Note
Medium Risk
Changes multi-node RDMA/NCCL/MoRI defaults and privileged Docker launch paths; mis-tuned GID/QoS or IBDEVICES ordering could still break cross-node comms, but scope is benchmark scripts rather than core serving code.
Overview
Addresses SGLang MoRI disaggregated benchmark hangs on MI325X by tuning RoCE/NCCL/MoRI env defaults and fixing decode DP+EP sizing that was starving cross-node EP under load.
RDMA / NCCL:
env.shadds defaultNCCL_IB_GID_INDEX,NCCL_IB_TC, andNCCL_IB_SLfor bnxt_re RoCEv2 multi-node collectives. WhenMORI_RDMA_TCis set, it now derives matchingMORI_RDMA_SLand mirrors TC/SL toMORI_IO_*so bnxt_re stays on the lossless PFC queue. Decode MoRI dispatch cap rises from 512 → 4096 tokens.Decode DP+EP:
server_sglang.shstops shrinkingMORI_MAX_DISPATCH_TOKENS_DECODEtoconc/dp_ranks; it scalesmax-running-requestsasconc * TP(floored atdp_ranks) and keeps env.sh MoRI buffer sizes.Launch path: New
scripts/_disagg_ssh_remote_inner.shsorts per-nodeIBDEVICESby RoCEv2 GID subnet before Docker;_disagg_container_entry.shoptionally runsrebuild_bnxt.shandinstall_mori_in_container.shthenserver.sh. Helpers added:detect_ibdevices_bnxt.sh,rebuild_bnxt.sh.Model config:
DeepSeek-R1-0528prefillmem_fraction_static0.8 → 0.9 inmodels.yaml.Reviewed by Cursor Bugbot for commit eb43d03. Bugbot is set up for automated code reviews on this repo. Configure here.