Skip to content

[PyTorch][torch.compile] Make quantizers opaque value objects#3152

Open
pggPL wants to merge 14 commits into
NVIDIA:mainfrom
pggPL:make_qunatizers_opaque
Open

[PyTorch][torch.compile] Make quantizers opaque value objects#3152
pggPL wants to merge 14 commits into
NVIDIA:mainfrom
pggPL:make_qunatizers_opaque

Conversation

@pggPL

@pggPL pggPL commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Description

Tensorless quantizers in TE (MXFP8, FP8 blockwise, FP8 current-scaling, NVFP4)
are fully described by a handful of plain, reproducible scalars — they hold no
live tensors and no process groups. This PR turns them into opaque value
objects
so torch.compile can treat them as baked-in constants: two
quantizers with the same configuration become interchangeable, hashable, and
reconstructible inside an FX graph.

Quantizers that hold live state (delayed-scaling Float8Quantizer, which keeps
scale/amax tensors) and any user-defined quantizer keep the default
identity semantics, so the change is opt-in and backward compatible. On older
PyTorch builds without the opaque-object API the registration is a graceful
no-op.

Along the way this also un-breaks the existing test_torch_compile.py suite:
that file lived on main but was never wired into CI, and its
test_autocast_nested_custom case (nested te.autocast with multiple
CustomRecipe instances) was failing because of the CustomRecipe state-caching
bug fixed here. The file is now run in CI and passes.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add opt-in value-object identity to the base Quantizer
    (_value_fields / _value_key / __eq__ / __hash__). Returning None
    from _value_fields() (the default) keeps identity semantics.
  • New module transformer_engine/pytorch/dynamo.py holding the
    torch.compile glue: __fx_repr__, value-key reconstruction and
    register_value_opaque_quantizer (gracefully no-op without PyTorch's
    opaque-object API).
  • Register MXFP8Quantizer, Float8BlockQuantizer,
    Float8CurrentScalingQuantizer and NVFP4Quantizer as value opaque types
    (the deprecated amax_reduction_group is never part of the value).
  • Fix CustomRecipe state caching in TransformerEngineBaseModule.set_meta_tensor:
    rebuild quantizers when the CustomRecipe instance changes (e.g. nested
    te.autocast regions) instead of reusing the first recipe's state, since
    every CustomRecipe shares the CustomRecipeState type but carries its own
    qfactory. This fixes the previously-failing test_autocast_nested_custom.
  • Enable tests/pytorch/test_torch_compile.py in the L0_pytorch_unittest QA
    suite (it existed on main but was never run in CI), and add the quantizer
    value-object tests to it. Bringing it into CI required fixing the existing
    CustomRecipe torch.compile path: the qfactory now dispatches on
    QuantizerRole.tensor_type supplied by ToyLinear.get_quantizer_roles.
  • Guard the value-object path against a stored amax reduction group: __fx_repr__
    already rejects any quantizer holding a process group, and __eq__ / __hash__
    now raise too. The group is excluded from the value key, so a stored group would
    otherwise compare/hash equal to a groupless quantizer and let torch.compile
    reuse a graph that skips the reduction. Pass the group per quantize call instead.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL and others added 8 commits June 29, 2026 11:25
…ompile

Give tensorless quantizers (MXFP8, FP8 blockwise, FP8 current-scaling,
NVFP4) value-object semantics so torch.compile can treat them as baked-in
constants:

- Add opt-in value identity to the base Quantizer (_value_fields /
  _value_key / __eq__ / __hash__). Quantizers holding live tensors
  (delayed-scaling Float8Quantizer) and custom quantizers keep identity
  semantics.
- New transformer_engine/pytorch/dynamo.py houses the torch.compile glue:
  __fx_repr__, value-key reconstruction and register_value_opaque_quantizer
  (gracefully a no-op on PyTorch builds without the opaque-object API).
- Register the four tensorless quantizers as value opaque types.

Also fix CustomRecipe state caching in TransformerEngineBaseModule:
set_meta_tensor now rebuilds quantizers when the CustomRecipe instance
changes (e.g. nested te.autocast regions) instead of reusing the first
recipe's state, since every CustomRecipe shares the CustomRecipeState type
but carries its own qfactory.

Move the quantizer value-object tests into tests/pytorch/test_torch_compile.py
and add that file to the L0 pytorch unittest QA suite.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…globals

Follow-up to the value-opaque quantizer support:

- Remove the module-level _QUANTIZER_VALUE_REGISTRY (qualname -> class) and
  _quantizer_from_value_key. __fx_repr__ now captures the quantizer class
  directly in the FX globals and reconstructs via _rebuild_quantizer(cls, items),
  matching how PyTorch's own value opaque types (e.g. DTensor placements)
  reconstruct themselves. This removes global mutable state and the qualname
  collision risk.
- Consolidate the quantizer value-object tests in test_torch_compile.py down to
  two functions and exercise reconstruction through the public __fx_repr__ path
  instead of internal helpers.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Replace the single dynamo.py module with a dynamo/ package so the
torch.compile glue can grow with a clear responsibility split across the
stacked branches. This branch owns the value-opaque quantizer layer.

  * dynamo/quantizer_opaque.py -- register_value_opaque_quantizer and helpers
  * dynamo/__init__.py -- re-exports the public API so callers keep importing
    from transformer_engine.pytorch.dynamo unchanged

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
A value-opaque quantizer must not carry live distributed state. Scan the
quantizer attributes in __fx_repr__ and raise TypeError if any holds a
torch.distributed.ProcessGroup (e.g. a non-None deprecated amax_reduction_group),
so it cannot be silently baked into a torch.compile FX graph. Clarify the related
comments accordingly.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
NVFP4Quantizer is registered as a value-opaque quantizer but was missing
from the value-semantics / __fx_repr__ round-trip test. Add it to
_VALUE_QUANTIZERS (skipped without CUDA, which it needs to construct).

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…__/__hash__

The amax reduction group is excluded from the value key, so a value quantizer
that stored one would compare/hash equal to a groupless one and let torch.compile
reuse a graph that skips the reduction. __eq__/__hash__ now raise (mirroring
__fx_repr__, which already rejects any process-group-bearing quantizer). The
group should be passed per quantize call, not stored on the quantizer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Add is_value_opaque_quantizer() + the _te_compile_value_opaque flag stamped at
registration, so dynamo-traced code can detect registered quantizers (and fall
back to eager for unregistered ones).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…fp4 value key

- Narrow register_opaque_type except to (RuntimeError, TypeError): the API is
  already imported above, so ImportError/AttributeError there only mask real errors.
- Add test_quantizer_value_object_fullgraph exercising torch.compile(fullgraph=True)
  end-to-end to verify opaque-type registration took effect.
- Restore missing NVFP4Quantizer._with_random_sign_mask assignment required by
  _value_fields()/_value_key().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL requested a review from ksivaman as a code owner June 29, 2026 09:36
@pggPL

pggPL commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch

@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds value-object semantics to tensorless quantizers in Transformer Engine, enabling torch.compile to treat them as baked-in constants rather than live objects that trigger graph breaks. Four quantizer classes (MXFP8Quantizer, Float8BlockQuantizer, Float8CurrentScalingQuantizer, NVFP4Quantizer) gain __eq__/__hash__/__fx_repr__ and are registered with PyTorch's opaque-object API. A CustomRecipe state-caching bug is also fixed, and test_torch_compile.py is wired into CI for the first time.

  • New dynamo/quantizer_opaque.py module implements the torch.compile glue: _rebuild_quantizer reconstructs value quantizers by bypassing __init__ and setting attributes from the stored value key; register_value_opaque_quantizer stamps a class as value-opaque and gracefully no-ops on older PyTorch builds without the opaque-object API.
  • Quantizer base class gains _BASE_VALUE_FIELDS, _value_fields(), _value_key(), __eq__, and __hash__; subclasses opt into value semantics by returning a non-None tuple from _value_fields(); process-group detection is centralized in _check_value_has_no_process_group to prevent live distributed state from being baked into FX graphs.
  • NVFP4Quantizer stores _with_random_sign_mask alongside the derived rht_matrix_random_sign_mask_t in the value key and restores the non-serializable rht_matrix tensor via a _rebuild_derived_state hook called by _rebuild_quantizer.

Confidence Score: 5/5

Safe to merge; all four tensorless quantizers are correctly reconstructed via _rebuild_quantizer, process-group detection is centralized and exhaustive, and the change is opt-in with identity fallback for Float8Quantizer and custom quantizers.

The value-key construction covers every field set by each quantizer's init; the _rebuild_derived_state hook correctly restores NVFP4's non-serializable rht_matrix tensor; registration is gracefully a no-op on older PyTorch builds; and the new test suite exercises both equality/hash semantics and bit-exact kernel round-trips. The only findings are comment inaccuracies with no runtime impact.

No files require special attention; nvfp4_tensor.py carries a mildly misleading comment about device-independence of rht_matrix_random_sign_mask_t but the logic is correct.

Important Files Changed

Filename Overview
transformer_engine/pytorch/dynamo/quantizer_opaque.py New module implementing torch.compile glue; _rebuild_quantizer bypasses init and restores fields; _quantizer_fx_repr generates eval-able code; register_value_opaque_quantizer gracefully handles missing PyTorch opaque-object API.
transformer_engine/pytorch/quantized_tensor.py Adds _BASE_VALUE_FIELDS, _value_fields(), _value_key(), eq, hash to Quantizer base, plus _contains_process_group helper. eq's other._value_fields() call on an object-typed parameter is a static-type concern (flagged in previous threads); functionally safe due to the prior type(self) is type(other) guard.
transformer_engine/pytorch/tensor/nvfp4_tensor.py Adds _with_random_sign_mask storage, _rebuild_derived_state hook for rht_matrix, and _value_fields. Comment claims rht_matrix_random_sign_mask_t is device-independent but it is device-dependent when with_random_sign_mask=True; functionally correct but the comment is misleading.
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Adds _value_fields() returning five scalar fields and calls register_value_opaque_quantizer. All fields set in init are covered; no missing fields, no process-group attributes.
transformer_engine/pytorch/tensor/float8_tensor.py Adds _value_fields() to Float8CurrentScalingQuantizer with four scalar fields, excluding amax_reduction_group intentionally. All init fields are covered.
transformer_engine/pytorch/tensor/mxfp8_tensor.py Adds _value_fields() returning just ("dtype",) and registers as value-opaque. Simple and complete; MXFP8Quantizer has no amax_reduction_group or process-group attrs.
tests/pytorch/test_torch_compile.py Adds comprehensive value-object tests: identity/hash/equality checks, fx_repr round-trip via eval, bit-exact quantize kernel comparison on real hardware, process-group rejection tests, and fullgraph compile tests with custom ops.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Quantizer subclass calls\nregister_value_opaque_quantizer] --> B{PyTorch opaque-object\nAPI available?}
    B -- No --> C[Attach __fx_repr__ only\neager value semantics work]
    B -- Yes --> D[register_opaque_type with typ=value]
    D --> E[torch.compile treats\nquantizer as FX constant]

    F[__eq__ / __hash__ called] --> G{_value_fields\nreturns None?}
    G -- Yes --> H[Identity semantics\nobject.__hash__]
    G -- No --> I[_value_key called\n_check_value_has_no_process_group]
    I --> J{ProcessGroup\nfound in vars?}
    J -- Yes --> K[raise TypeError]
    J -- No --> L[Return qualname + tuple\nof name/value pairs]

    M[FX codegen calls __fx_repr__] --> N[_quantizer_fx_repr generates\n_rebuild_quantizer call string]
    N --> O[eval rebuilds quantizer\nvia _rebuild_quantizer]
    O --> P[bypass __init__\nobject.__setattr__ for each field]
    P --> Q{_rebuild_derived_state\nexists?}
    Q -- Yes --> R[Rebuild derived tensors\ne.g. NVFP4 rht_matrix]
    Q -- No --> S[Done]
    R --> S
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Quantizer subclass calls\nregister_value_opaque_quantizer] --> B{PyTorch opaque-object\nAPI available?}
    B -- No --> C[Attach __fx_repr__ only\neager value semantics work]
    B -- Yes --> D[register_opaque_type with typ=value]
    D --> E[torch.compile treats\nquantizer as FX constant]

    F[__eq__ / __hash__ called] --> G{_value_fields\nreturns None?}
    G -- Yes --> H[Identity semantics\nobject.__hash__]
    G -- No --> I[_value_key called\n_check_value_has_no_process_group]
    I --> J{ProcessGroup\nfound in vars?}
    J -- Yes --> K[raise TypeError]
    J -- No --> L[Return qualname + tuple\nof name/value pairs]

    M[FX codegen calls __fx_repr__] --> N[_quantizer_fx_repr generates\n_rebuild_quantizer call string]
    N --> O[eval rebuilds quantizer\nvia _rebuild_quantizer]
    O --> P[bypass __init__\nobject.__setattr__ for each field]
    P --> Q{_rebuild_derived_state\nexists?}
    Q -- Yes --> R[Rebuild derived tensors\ne.g. NVFP4 rht_matrix]
    Q -- No --> S[Done]
    R --> S
Loading

Reviews (5): Last reviewed commit: "Cover is_opaque_value_type with the impo..." | Re-trigger Greptile

Comment thread transformer_engine/pytorch/tensor/nvfp4_tensor.py
Comment thread transformer_engine/pytorch/quantized_tensor.py
…trip

_rebuild_quantizer only restores value-key fields, so a reconstructed
NVFP4Quantizer was missing the derived rht_matrix tensor (not hashable, so not
in the value key) and failed at copy()/quantize time. Add a _rebuild_derived_state
hook (called by _rebuild_quantizer) that NVFP4Quantizer uses to rebuild rht_matrix
from _with_random_sign_mask (lru_cache -> cheap).

Extend test_quantizer_value_object to also quantize with the original and the
rebuilt quantizer and require bit-exact results (gated on HW support), so a
field the kernel needs but the value key omits can no longer slip through.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

@kshitij12345 kshitij12345 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, would be good to resolve the inline comments before merging.

Comment thread transformer_engine/pytorch/dynamo/quantizer_opaque.py
Comment thread transformer_engine/pytorch/dynamo/quantizer_opaque.py Outdated
pggPL and others added 2 commits June 29, 2026 14:46
Move the ProcessGroup guard out of the (overridable) __fx_repr__ into
Quantizer._value_key -- the single point every value-materialization path
(__eq__/__hash__/__fx_repr__) goes through -- so a custom __fx_repr__ can no
longer bypass it. Generalizes the old amax-only check to any field holding a
ProcessGroup. Add a test that a value quantizer carrying a live group raises.

Addresses review on NVIDIA#3152.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…assthrough

Replace the trivial pass-through fullgraph test with one that drives each
production quantizer through a minimal custom op (quantize + dequantize) under
torch.compile(fullgraph=True) and compares to eager -- so the opaque-type
registration is actually exercised inside the graph (a graph break would make
fullgraph=True raise). Op registration sits right before the test. Also drop
stale comments referencing the old __fx_repr__-side process-group guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Comment thread transformer_engine/pytorch/dynamo/quantizer_opaque.py Outdated
pggPL and others added 3 commits June 29, 2026 15:29
…paque flag

- rht_matrix_random_sign_mask_t is a device-independent int derived from
  _with_random_sign_mask (the device only places a throwaway tensor); fix the
  misleading comment.
- Explain why registration uses a class attribute, not a registry set:
  is_value_opaque_quantizer is traced inside the compile graph and dynamo can
  bake a getattr constant but cannot do 'type(q) in set' on the opaque class.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
is_opaque_value_type(cls) sat between the import guard and the
register_opaque_type guard, so on a partial/experimental opaque-object build it
could raise RuntimeError/TypeError and crash TE import. Move it inside the same
except so the 'registration never crashes import' promise holds for both calls.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL

pggPL commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants