Skip to content

cmd/compile/internal/arm64: fuse adjacent spill/reload STR/LDR into STP/LDP#79689

Open
gaul wants to merge 1 commit into
golang:masterfrom
gaul:arm64/ldp-stp
Open

cmd/compile/internal/arm64: fuse adjacent spill/reload STR/LDR into STP/LDP#79689
gaul wants to merge 1 commit into
golang:masterfrom
gaul:arm64/ldp-stp

Conversation

@gaul

@gaul gaul commented May 27, 2026

Copy link
Copy Markdown
Contributor

The SSA pair pass (cmd/compile/internal/ssa/pair.go) runs before regalloc
and only sees source-level loads and stores. Spill/reload code that
regalloc inserts later for OpStoreReg/OpLoadReg becomes individual STR/LDR
instructions that never get a chance to be paired, even when two spills
target adjacent 8-byte stack slots.

Fuse those pairs as the final step of code generation, in the compiler
rather than the assembler. A new ssagen.ArchInfo.SSAGenFinish hook runs
after genssa has emitted all of a function's Progs, resolved its branch
and jump-table targets, and finalized the frame size in defframe (so the
register-argument spills defframe inserts participate too); on arm64 it
walks the Prog list and rewrites strictly-adjacent AMOVD spill pairs that
share a base register and have consecutive 8-byte offsets into a single
ASTP/ALDP. The second Prog is reduced to a 0-byte ANOP rather than
unlinked so that branch targets referencing it remain valid. The pass is
skipped under -N to keep unoptimized builds unoptimized. Doing this in
the compiler keeps the assembler a simple translator.

Fusion is gated on several safety and profitability conditions:

  • same base register and same Addr.Name (AUTO or PARAM, the only
    classes spill slots use), distinct destination registers (LDP with
    Rt1 == Rt2 is CONSTRAINED UNPREDICTABLE), and no pre/post-index or
    register-offset addressing
  • the resolved offset must encode in LDP/STP's signed 7-bit scaled
    immediate, [-512, 504]: the assembler rewrites an AUTO offset to
    off+framesize+8 and a PARAM offset to off+framesize+24 or +32
    depending on frame alignment (off+8 in a frameless leaf, which is
    not decided until assembly, so a frameless PARAM must fit both).
    Checking the resolved value against the final frame size both
    admits deep spill slots in large frames, where spills are most
    common, and refuses fusions that would need an
    assembler-synthesized address (ADD + LDP), which is no smaller
    than the original pair and serializes through REGTMP
  • the first load of a pair must not write the base register: executed
    sequentially, the second load computes its address from the
    just-loaded value, while LDP computes both addresses from the
    original base
  • the second instruction is not a branch or jump-table target
    (otherwise paths that jump directly to it would skip the work the
    LDP/STP now does at the first instruction's position) and does not
    carry a statement boundary: genssa promotes instructions it reuses
    as inline marks to statements, and inline marks must never become
    zero-sized, while plain statement boundaries must keep their line
    table entries

The prologue's register-argument spills around morestack, which the
assembler inserts during preprocess, are already emitted as STP/LDP
pairs (CL 621556).

TestPairSpills in cmd/compile/internal/arm64 drives pairSpills directly
with hand-constructed Prog chains, asserting the fused operands and
covering each fusion path and each gating condition.
test/codegen/memcombine.go pins down the spill/reload pattern that the
SSA pair pass misses but this pass catches.
test/fixedbugs/spillreload_arm64_pair.go exercises the conditional-call-
with-adjacent-reloads pattern from runtime.schedule that miscompiled
before the branch-target check was added. BenchmarkSpillReloadPair in
cmd/compile/internal/test improves from about 1.92 to 1.78 ns/op on an
Apple M4 Max (~7%).

armlint reports that adjacent STR/LDR pairings drop from 4354 -> 235 on
gofmt (94.6% reduction) and 26022 -> 772 on cmd/go (97.0% reduction).
The text section shrinks by 16720 bytes (1.40%) on gofmt and 101312
bytes (1.52%) on cmd/go.

@gopherbot

Copy link
Copy Markdown
Contributor

This PR (HEAD: 8da7882) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/783660.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

@gopherbot

Copy link
Copy Markdown
Contributor

Message from Gopher Robot:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/783660.
After addressing review feedback, remember to publish your drafts!

@gopherbot

Copy link
Copy Markdown
Contributor

Message from Keith Randall:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/783660.
After addressing review feedback, remember to publish your drafts!

@gopherbot

Copy link
Copy Markdown
Contributor

Message from Cherry Mui:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/783660.
After addressing review feedback, remember to publish your drafts!

@gaul gaul force-pushed the arm64/ldp-stp branch 2 times, most recently from bcc3282 to 8a00d61 Compare June 4, 2026 16:43
@gopherbot

Copy link
Copy Markdown
Contributor

Message from Andrew Gaul:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/783660.
After addressing review feedback, remember to publish your drafts!

@gopherbot

Copy link
Copy Markdown
Contributor

This PR (HEAD: 8a00d61) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/783660.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

@gopherbot

Copy link
Copy Markdown
Contributor

Message from Keith Randall:

Patch Set 2:

(16 comments)


Please don’t reply on this GitHub thread. Visit golang.org/cl/783660.
After addressing review feedback, remember to publish your drafts!

…TP/LDP

The SSA pair pass (cmd/compile/internal/ssa/pair.go) runs before regalloc
and only sees source-level loads and stores. Spill/reload code that
regalloc inserts later for OpStoreReg/OpLoadReg becomes individual STR/LDR
instructions that never get a chance to be paired, even when two spills
target adjacent 8-byte stack slots.

Fuse those pairs as the final step of code generation, in the compiler
rather than the assembler. A new ssagen.ArchInfo.SSAGenFinish hook runs
after genssa has emitted all of a function's Progs, resolved its branch
and jump-table targets, and finalized the frame size in defframe (so the
register-argument spills defframe inserts participate too); on arm64 it
walks the Prog list and rewrites strictly-adjacent AMOVD spill pairs that
share a base register and have consecutive 8-byte offsets into a single
ASTP/ALDP. The second Prog is reduced to a 0-byte ANOP rather than
unlinked so that branch targets referencing it remain valid. The pass is
skipped under -N to keep unoptimized builds unoptimized. Doing this in
the compiler keeps the assembler a simple translator.

Fusion is gated on several safety and profitability conditions:
  - same base register and same Addr.Name (AUTO or PARAM, the only
    classes spill slots use), distinct destination registers (LDP with
    Rt1 == Rt2 is CONSTRAINED UNPREDICTABLE), and no pre/post-index or
    register-offset addressing
  - the resolved offset must encode in LDP/STP's signed 7-bit scaled
    immediate, [-512, 504]: the assembler rewrites an AUTO offset to
    off+framesize+8 and a PARAM offset to off+framesize+24 or +32
    depending on frame alignment (off+8 in a frameless leaf, which is
    not decided until assembly, so a frameless PARAM must fit both).
    Checking the resolved value against the final frame size both
    admits deep spill slots in large frames, where spills are most
    common, and refuses fusions that would need an
    assembler-synthesized address (ADD + LDP), which is no smaller
    than the original pair and serializes through REGTMP
  - the first load of a pair must not write the base register: executed
    sequentially, the second load computes its address from the
    just-loaded value, while LDP computes both addresses from the
    original base
  - the second instruction is not a branch or jump-table target
    (otherwise paths that jump directly to it would skip the work the
    LDP/STP now does at the first instruction's position) and does not
    carry a statement boundary: genssa promotes instructions it reuses
    as inline marks to statements, and inline marks must never become
    zero-sized, while plain statement boundaries must keep their line
    table entries

The prologue's register-argument spills around morestack, which the
assembler inserts during preprocess, are already emitted as STP/LDP
pairs (CL 621556).

TestPairSpills in cmd/compile/internal/arm64 drives pairSpills directly
with hand-constructed Prog chains, asserting the fused operands and
covering each fusion path and each gating condition.
test/codegen/memcombine.go pins down the spill/reload pattern that the
SSA pair pass misses but this pass catches.
test/fixedbugs/spillreload_arm64_pair.go exercises the conditional-call-
with-adjacent-reloads pattern from runtime.schedule that miscompiled
before the branch-target check was added. BenchmarkSpillReloadPair in
cmd/compile/internal/test improves from about 1.92 to 1.78 ns/op on an
Apple M4 Max (~7%).

armlint reports that adjacent STR/LDR pairings drop from 4354 -> 235 on
gofmt (94.6% reduction) and 26022 -> 772 on cmd/go (97.0% reduction).
The text section shrinks by 16720 bytes (1.40%) on gofmt and 101312
bytes (1.52%) on cmd/go.
@gaul gaul changed the title cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP cmd/compile/internal/arm64: fuse adjacent spill/reload STR/LDR into STP/LDP Jun 11, 2026
@gopherbot

Copy link
Copy Markdown
Contributor

Message from Andrew Gaul:

Patch Set 2:

(17 comments)


Please don’t reply on this GitHub thread. Visit golang.org/cl/783660.
After addressing review feedback, remember to publish your drafts!

@gopherbot

Copy link
Copy Markdown
Contributor

This PR (HEAD: b153dbd) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/783660.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants