[Common] EP C API: version config structs and extend nvte_ep_prepare with total_recv_tokens_per_rank placeholder#3154
[Common] EP C API: version config structs and extend nvte_ep_prepare with total_recv_tokens_per_rank placeholder#3154phu0ngng wants to merge 4 commits into
nvte_ep_prepare with total_recv_tokens_per_rank placeholder#3154Conversation
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR versions the EP C API config structs (
Confidence Score: 4/5Safe to merge for the stated scope; the ABI versioning design is sound and all updated call sites are consistent. The struct-versioning design is well-thought-out and the normalize_ep_config helper correctly handles old-style zero-init, forward-compat (newer caller/older library), and normal cases. All seven changed files are consistently updated. The two open items — a dropped explicit dtype bounds check and a subtly redundant NVTE_CHECK condition — neither affect correct operation for well-formed inputs. ep_api.cpp (normalize_ep_config NVTE_CHECK condition) and ep_backend.cpp (removed max_token_dtype bounds check) are the two files worth a second look. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Caller
participant ep_api as ep_api.cpp
participant normalize as normalize_ep_config
participant EPBackend
Caller->>ep_api: "nvte_ep_initialize(comm, &group_config)"
ep_api->>normalize: "normalize_ep_config(&group_config, kGroupConfigMinSize)"
normalize-->>ep_api: "NVTEEpGroupConfig (normalized, struct_size=sizeof)"
ep_api->>EPBackend: initialize(comm, cfg)
Caller->>ep_api: "nvte_ep_handle_mem_size(&layer_cfg)"
ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
normalize-->>ep_api: NVTEEpLayerConfig (normalized)
ep_api->>EPBackend: handle_mem_size(cfg)
Caller->>ep_api: "nvte_ep_prepare(handle_mem, topk_idx, recv_tokens_per_expert, null, &layer_cfg, stream)"
ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
normalize-->>ep_api: NVTEEpLayerConfig (normalized)
ep_api->>EPBackend: prepare(..., recv_tokens_per_expert, nullptr[placeholder], cfg, stream)
Note over EPBackend: total_recv_tokens_per_rank ignored (reserved placeholder)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Caller
participant ep_api as ep_api.cpp
participant normalize as normalize_ep_config
participant EPBackend
Caller->>ep_api: "nvte_ep_initialize(comm, &group_config)"
ep_api->>normalize: "normalize_ep_config(&group_config, kGroupConfigMinSize)"
normalize-->>ep_api: "NVTEEpGroupConfig (normalized, struct_size=sizeof)"
ep_api->>EPBackend: initialize(comm, cfg)
Caller->>ep_api: "nvte_ep_handle_mem_size(&layer_cfg)"
ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
normalize-->>ep_api: NVTEEpLayerConfig (normalized)
ep_api->>EPBackend: handle_mem_size(cfg)
Caller->>ep_api: "nvte_ep_prepare(handle_mem, topk_idx, recv_tokens_per_expert, null, &layer_cfg, stream)"
ep_api->>normalize: "normalize_ep_config(&layer_cfg, kLayerConfigMinSize)"
normalize-->>ep_api: NVTEEpLayerConfig (normalized)
ep_api->>EPBackend: prepare(..., recv_tokens_per_expert, nullptr[placeholder], cfg, stream)
Note over EPBackend: total_recv_tokens_per_rank ignored (reserved placeholder)
Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| NVTE_CHECK(config.hidden_dim > 0, "hidden_dim must be positive, got ", config.hidden_dim); | ||
| NVTE_CHECK(config.max_token_dtype >= 0 && config.max_token_dtype < kNVTENumTypes, | ||
| "max_token_dtype out of range, got ", static_cast<int>(config.max_token_dtype)); | ||
| const size_t elem_bytes = typeToSize(static_cast<DType>(config.max_token_dtype)); |
There was a problem hiding this comment.
Removed
max_token_dtype range check degrades error messages
The explicit NVTE_CHECK(config.max_token_dtype >= 0 && config.max_token_dtype < kNVTENumTypes, ...) guard was removed, so typeToSize() is now the first line of defense. An out-of-range value (e.g. an accidentally unset field left as a garbage integer) will produce either a generic TRANSFORMER_ENGINE_TYPE_SWITCH_ALL error or — if typeToNumBits silently returns 0 for an unknown type — a confusing "row_bytes < 16" failure rather than a clear "max_token_dtype out of range" message. Given that the header comment now says the field is "Required (no default)", a fast, explicit bounds check here would give users a much better diagnostic.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| const size_t want = (user->struct_size == 0) ? min_size : user->struct_size; | ||
| NVTE_CHECK(want >= min_size, name, ".struct_size (", user->struct_size, |
There was a problem hiding this comment.
struct_size == 0 and truncated memcpy can silently copy a wrong number of bytes
When user->struct_size is between min_size and sizeof(Cfg) the memcpy copies exactly want bytes, which is correct. However, the NVTE_CHECK below only fires when want < min_size after the ternary substitution. If struct_size is a small non-zero garbage value (e.g. 1), the check accepts it as long as min_size == 1, and memcpy will copy just 1 byte into cfg leaving almost every field zeroed. Consider making the guard explicit: user->struct_size == 0 || want >= min_size.
| const size_t want = (user->struct_size == 0) ? min_size : user->struct_size; | |
| NVTE_CHECK(want >= min_size, name, ".struct_size (", user->struct_size, | |
| const size_t want = (user->struct_size == 0) ? min_size : user->struct_size; | |
| NVTE_CHECK(user->struct_size == 0 || want >= min_size, name, ".struct_size (", user->struct_size, |
There was a problem hiding this comment.
LGTM pending CI. I think for now the struct_size field is sufficient for versioning. I feel like we have a lot of structs that have needed versioning, so in future would be great for someone to align the C API with similar versioning functionality, like some VERSIONED_STRUCT macro. But out of scope for this PR
Description
NVTEEpGroupConfig,NVTEEpLayerConfig) with a leadingstruct_sizefield and passes them by pointer, so fields can be added without breaking ABI.total_recv_tokens_per_rankplaceholder output tonvte_ep_preparefor future use (accepted, may be null, ignored for now).nvte_ep_prepareoutputtoken_countstorecv_tokens_per_expertfor clarity.Type of change
Checklist: