Skip to content

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150

Open
phu0ngng wants to merge 5 commits into
NVIDIA:mainfrom
phu0ngng:update_nccl
Open

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150
phu0ngng wants to merge 5 commits into
NVIDIA:mainfrom
phu0ngng:update_nccl

Conversation

@phu0ngng

Copy link
Copy Markdown
Collaborator

Description

Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR bumps the 3rdparty/nccl submodule to pick up a downstream fix for MAX_SUPPORTED_TOKENS_PER_RANK and reorders the NCCL discovery probe in setup.py to prefer the library the runtime dynamic loader will actually resolve.

  • Submodule bump: pointer advanced from 808d2433 to a6b5de08; no TransformerEngine source changes.
  • _discover_nccl_home reorder: ldconfig is now consulted before well-known prefixes, and the nvidia-nccl-cu* pip-wheel fallback moves to last place, avoiding ABI mismatches where the build would link against the pip wheel but the loader would resolve the system NCCL at runtime.

Confidence Score: 5/5

Safe to merge; both changes are narrow in scope — a submodule pointer bump and a probe-order reorder in build tooling.

The submodule bump carries an upstream fix and touches no TransformerEngine source. The setup.py reorder is well-reasoned, well-commented, and only affects the build-time NCCL discovery path; it correctly defers the pip-wheel probe to last so the build links against the same library the runtime loader resolves.

No files require special attention.

Important Files Changed

Filename Overview
3rdparty/nccl NCCL submodule pointer bumped from 808d2433 to a6b5de08 to pick up the MAX_SUPPORTED_TOKENS_PER_RANK fix.
setup.py NCCL discovery order in _discover_nccl_home() changed: ldconfig is now consulted before well-known prefixes, and the pip wheel fallback is moved to last; docstring updated to match.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_discover_nccl_home] --> B{NCCL_HOME env var set?}
    B -- Yes, header found --> C[Return NCCL_HOME]
    B -- Yes, header missing --> D[Warn, continue]
    B -- No --> D
    D --> E{ldconfig -p finds libnccl.so?}
    E -- Yes, nccl.h found at parent --> F[Return ldconfig root]
    E -- No / not found --> G[Probe well-known prefixes]
    G --> H{/opt/nvidia/nccl, /usr/local/nccl, /usr?}
    H -- Found --> I[Return prefix]
    H -- Not found --> J[Probe pip wheel nvidia.nccl]
    J --> K{nvidia.nccl importable & nccl.h present?}
    K -- Yes --> L[Return pip_root]
    K -- No --> M[raise RuntimeError]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[_discover_nccl_home] --> B{NCCL_HOME env var set?}
    B -- Yes, header found --> C[Return NCCL_HOME]
    B -- Yes, header missing --> D[Warn, continue]
    B -- No --> D
    D --> E{ldconfig -p finds libnccl.so?}
    E -- Yes, nccl.h found at parent --> F[Return ldconfig root]
    E -- No / not found --> G[Probe well-known prefixes]
    G --> H{/opt/nvidia/nccl, /usr/local/nccl, /usr?}
    H -- Found --> I[Return prefix]
    H -- Not found --> J[Probe pip wheel nvidia.nccl]
    J --> K{nvidia.nccl importable & nccl.h present?}
    K -- Yes --> L[Return pip_root]
    K -- No --> M[raise RuntimeError]
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into update_nccl" | Re-trigger Greptile

Comment thread 3rdparty/nccl Outdated
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant