[Common] EP C API: version config structs and extend `nvte_ep_prepare` with `total_recv_tokens_per_rank` placeholder by phu0ngng · Pull Request #3154 · NVIDIA/TransformerEngine

phu0ngng · 2026-06-29T12:08:18Z

Description

Versions the EP config structs (NVTEEpGroupConfig, NVTEEpLayerConfig) with a leading struct_size field and passes them by pointer, so fields can be added without breaking ABI.
Adds a total_recv_tokens_per_rank placeholder output to nvte_ep_prepare for future use (accepted, may be null, ignored for now).
Renames the nvte_ep_prepare output token_counts to recv_tokens_per_expert for clarity.
Updates all call sites (JAX bindings, C++ distributed tests) and docs accordingly.
PENDING: Update PyT callers' side after PR [PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy #3035 is merged.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-29T12:14:35Z

Greptile Summary

This PR versions the EP C API config structs (NVTEEpGroupConfig, NVTEEpLayerConfig) with a leading struct_size field and switches all pass-by-value arguments to const-pointer, enabling future field additions without ABI breakage. It also renames token_counts → recv_tokens_per_expert for clarity and adds a total_recv_tokens_per_rank placeholder output to nvte_ep_prepare (currently ignored, may be null).

Introduces normalize_ep_config<Cfg>() in ep_api.cpp to copy caller structs into the current layout, zero-filling fields the caller didn't provide and normalizing struct_size == 0 to the known base size.
Adds NVTE_EP_GROUP_CONFIG_INIT / NVTE_EP_LAYER_CONFIG_INIT macros for zero-safe construction, and updates all call sites (JAX bindings, C++ distributed tests) accordingly.
Removes the explicit max_token_dtype range check from EPBackend::validate_config, relying on typeToSize() for downstream validation.

Confidence Score: 4/5

Safe to merge for the stated scope; the ABI versioning design is sound and all updated call sites are consistent.

The struct-versioning design is well-thought-out and the normalize_ep_config helper correctly handles old-style zero-init, forward-compat (newer caller/older library), and normal cases. All seven changed files are consistently updated. The two open items — a dropped explicit dtype bounds check and a subtly redundant NVTE_CHECK condition — neither affect correct operation for well-formed inputs.

ep_api.cpp (normalize_ep_config NVTE_CHECK condition) and ep_backend.cpp (removed max_token_dtype bounds check) are the two files worth a second look.

Important Files Changed

Filename	Overview
transformer_engine/common/include/transformer_engine/ep.h	Adds `struct_size` as first field to both config structs, changes all function signatures to pointer-based, adds `total_recv_tokens_per_rank` placeholder, and introduces INIT macros. Clear, well-documented ABI versioning pattern.
transformer_engine/common/ep/ep_api.cpp	Adds `normalize_ep_config` template for struct versioning; logic is mostly correct but the NVTE_CHECK condition is redundant for the struct_size==0 path and could be made more explicit.
transformer_engine/common/ep/ep_backend.cpp	Renames `token_counts` to `recv_tokens_per_expert`, adds `total_recv_tokens_per_rank` placeholder (unused/commented). Removes the `max_token_dtype` explicit bounds check, relying on `typeToSize` for validation.
transformer_engine/common/ep/ep_backend.h	Updated `prepare` signature to match new API; straightforward rename and placeholder addition.
transformer_engine/jax/csrc/extensions/ep.cpp	Updated to use designated initializers with `struct_size`, pointer-based config passing, and passes `nullptr` for the placeholder `total_recv_tokens_per_rank`. All call sites correctly updated.
tests/cpp_distributed/test_ep.cu	All test call sites updated: renames `token_counts` buffers to `recv_tokens_per_expert`, stores `NVTEEpLayerConfig` as member initialized with INIT macro, passes `nullptr` for the new placeholder output.
tests/cpp_distributed/test_ep_common.h	Bootstrap and re-initialize helpers updated to use `NVTE_EP_GROUP_CONFIG_INIT` and pass config by pointer.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant ep_api as ep_api.cpp
    participant normalize as normalize_ep_config
    participant EPBackend

    Caller->>ep_api: "nvte_ep_initialize(comm, &group_config)"
    ep_api->>normalize: "normalize_ep_config(&group_config, kGroupConfigMinSize)"
    normalize-->>ep_api: "NVTEEpGroupConfig (normalized, struct_size=sizeof)"
    ep_api->>EPBackend: initialize(comm, cfg)

    Caller->>ep_api: "nvte_ep_handle_mem_size(&layer_cfg)"
    ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
    normalize-->>ep_api: NVTEEpLayerConfig (normalized)
    ep_api->>EPBackend: handle_mem_size(cfg)

    Caller->>ep_api: "nvte_ep_prepare(handle_mem, topk_idx, recv_tokens_per_expert, null, &layer_cfg, stream)"
    ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
    normalize-->>ep_api: NVTEEpLayerConfig (normalized)
    ep_api->>EPBackend: prepare(..., recv_tokens_per_expert, nullptr[placeholder], cfg, stream)
    Note over EPBackend: total_recv_tokens_per_rank ignored (reserved placeholder)

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant ep_api as ep_api.cpp
    participant normalize as normalize_ep_config
    participant EPBackend

    Caller->>ep_api: "nvte_ep_initialize(comm, &group_config)"
    ep_api->>normalize: "normalize_ep_config(&group_config, kGroupConfigMinSize)"
    normalize-->>ep_api: "NVTEEpGroupConfig (normalized, struct_size=sizeof)"
    ep_api->>EPBackend: initialize(comm, cfg)

    Caller->>ep_api: "nvte_ep_handle_mem_size(&layer_cfg)"
    ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
    normalize-->>ep_api: NVTEEpLayerConfig (normalized)
    ep_api->>EPBackend: handle_mem_size(cfg)

    Caller->>ep_api: "nvte_ep_prepare(handle_mem, topk_idx, recv_tokens_per_expert, null, &layer_cfg, stream)"
    ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
    normalize-->>ep_api: NVTEEpLayerConfig (normalized)
    ep_api->>EPBackend: prepare(..., recv_tokens_per_expert, nullptr[placeholder], cfg, stream)
    Note over EPBackend: total_recv_tokens_per_rank ignored (reserved placeholder)

_{Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-06-29T12:14:38Z

  NVTE_CHECK(config.hidden_dim > 0, "hidden_dim must be positive, got ", config.hidden_dim);
-  NVTE_CHECK(config.max_token_dtype >= 0 && config.max_token_dtype < kNVTENumTypes,
-             "max_token_dtype out of range, got ", static_cast<int>(config.max_token_dtype));
  const size_t elem_bytes = typeToSize(static_cast<DType>(config.max_token_dtype));


Removed max_token_dtype range check degrades error messages

The explicit NVTE_CHECK(config.max_token_dtype >= 0 && config.max_token_dtype < kNVTENumTypes, ...) guard was removed, so typeToSize() is now the first line of defense. An out-of-range value (e.g. an accidentally unset field left as a garbage integer) will produce either a generic TRANSFORMER_ENGINE_TYPE_SWITCH_ALL error or — if typeToNumBits silently returns 0 for an unknown type — a confusing "row_bytes < 16" failure rather than a clear "max_token_dtype out of range" message. Given that the header comment now says the field is "Required (no default)", a fast, explicit bounds check here would give users a much better diagnostic.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Add it back!

greptile-apps · 2026-06-29T12:14:39Z

+  const size_t want = (user->struct_size == 0) ? min_size : user->struct_size;
+  NVTE_CHECK(want >= min_size, name, ".struct_size (", user->struct_size,


struct_size == 0 and truncated memcpy can silently copy a wrong number of bytes

When user->struct_size is between min_size and sizeof(Cfg) the memcpy copies exactly want bytes, which is correct. However, the NVTE_CHECK below only fires when want < min_size after the ternary substitution. If struct_size is a small non-zero garbage value (e.g. 1), the check accepts it as long as min_size == 1, and memcpy will copy just 1 byte into cfg leaving almost every field zeroed. Consider making the guard explicit: user->struct_size == 0 || want >= min_size.

Suggested change

const size_t want = (user->struct_size == 0) ? min_size : user->struct_size;

NVTE_CHECK(want >= min_size, name, ".struct_size (", user->struct_size,

const size_t want = (user->struct_size == 0) ? min_size : user->struct_size;

NVTE_CHECK(user->struct_size == 0 || want >= min_size, name, ".struct_size (", user->struct_size,

jberchtold-nvidia

LGTM pending CI. I think for now the struct_size field is sufficient for versioning. I feel like we have a lot of structs that have needed versioning, so in future would be great for someone to align the C API with similar versioning functionality, like some VERSIONED_STRUCT macro. But out of scope for this PR

phu0ngng added 3 commits June 29, 2026 10:28

versioning EP C configs

799badc

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Rename EP prepare token_counts to recv_tokens_per_expert

5b104ce

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Add total_recv_tokens_per_rank placeholder to nvte_ep_prepare

3d3abf2

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng requested review from jberchtold-nvidia and ptrendx as code owners June 29, 2026 12:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

c7da769

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 29, 2026

View reviewed changes

jberchtold-nvidia approved these changes Jun 29, 2026

View reviewed changes

phu0ngng added the 2.17 label Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Common] EP C API: version config structs and extend `nvte_ep_prepare` with `total_recv_tokens_per_rank` placeholder#3154

[Common] EP C API: version config structs and extend `nvte_ep_prepare` with `total_recv_tokens_per_rank` placeholder#3154
phu0ngng wants to merge 4 commits into
NVIDIA:mainfrom
phu0ngng:ep-c-api

phu0ngng commented Jun 29, 2026

Uh oh!

greptile-apps Bot commented Jun 29, 2026

Uh oh!

greptile-apps Bot Jun 29, 2026

Uh oh!

phu0ngng Jun 29, 2026

Uh oh!

greptile-apps Bot Jun 29, 2026

Uh oh!

jberchtold-nvidia left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const size_t want = (user->struct_size == 0) ? min_size : user->struct_size;
		NVTE_CHECK(want >= min_size, name, ".struct_size (", user->struct_size,

Uh oh!

Conversation

phu0ngng commented Jun 29, 2026

Description

Type of change

Checklist:

Uh oh!

greptile-apps Bot commented Jun 29, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

phu0ngng Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

jberchtold-nvidia left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jberchtold-nvidia left a comment •

edited

Loading