cmd/compile/internal/arm64: fuse adjacent spill/reload STR/LDR into STP/LDP#79689
cmd/compile/internal/arm64: fuse adjacent spill/reload STR/LDR into STP/LDP#79689gaul wants to merge 1 commit into
Conversation
|
This PR (HEAD: 8da7882) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/783660. Important tips:
|
|
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/783660. |
|
Message from Keith Randall: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/783660. |
|
Message from Cherry Mui: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/783660. |
bcc3282 to
8a00d61
Compare
|
Message from Andrew Gaul: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/783660. |
|
This PR (HEAD: 8a00d61) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/783660. Important tips:
|
|
Message from Keith Randall: Patch Set 2: (16 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/783660. |
…TP/LDP
The SSA pair pass (cmd/compile/internal/ssa/pair.go) runs before regalloc
and only sees source-level loads and stores. Spill/reload code that
regalloc inserts later for OpStoreReg/OpLoadReg becomes individual STR/LDR
instructions that never get a chance to be paired, even when two spills
target adjacent 8-byte stack slots.
Fuse those pairs as the final step of code generation, in the compiler
rather than the assembler. A new ssagen.ArchInfo.SSAGenFinish hook runs
after genssa has emitted all of a function's Progs, resolved its branch
and jump-table targets, and finalized the frame size in defframe (so the
register-argument spills defframe inserts participate too); on arm64 it
walks the Prog list and rewrites strictly-adjacent AMOVD spill pairs that
share a base register and have consecutive 8-byte offsets into a single
ASTP/ALDP. The second Prog is reduced to a 0-byte ANOP rather than
unlinked so that branch targets referencing it remain valid. The pass is
skipped under -N to keep unoptimized builds unoptimized. Doing this in
the compiler keeps the assembler a simple translator.
Fusion is gated on several safety and profitability conditions:
- same base register and same Addr.Name (AUTO or PARAM, the only
classes spill slots use), distinct destination registers (LDP with
Rt1 == Rt2 is CONSTRAINED UNPREDICTABLE), and no pre/post-index or
register-offset addressing
- the resolved offset must encode in LDP/STP's signed 7-bit scaled
immediate, [-512, 504]: the assembler rewrites an AUTO offset to
off+framesize+8 and a PARAM offset to off+framesize+24 or +32
depending on frame alignment (off+8 in a frameless leaf, which is
not decided until assembly, so a frameless PARAM must fit both).
Checking the resolved value against the final frame size both
admits deep spill slots in large frames, where spills are most
common, and refuses fusions that would need an
assembler-synthesized address (ADD + LDP), which is no smaller
than the original pair and serializes through REGTMP
- the first load of a pair must not write the base register: executed
sequentially, the second load computes its address from the
just-loaded value, while LDP computes both addresses from the
original base
- the second instruction is not a branch or jump-table target
(otherwise paths that jump directly to it would skip the work the
LDP/STP now does at the first instruction's position) and does not
carry a statement boundary: genssa promotes instructions it reuses
as inline marks to statements, and inline marks must never become
zero-sized, while plain statement boundaries must keep their line
table entries
The prologue's register-argument spills around morestack, which the
assembler inserts during preprocess, are already emitted as STP/LDP
pairs (CL 621556).
TestPairSpills in cmd/compile/internal/arm64 drives pairSpills directly
with hand-constructed Prog chains, asserting the fused operands and
covering each fusion path and each gating condition.
test/codegen/memcombine.go pins down the spill/reload pattern that the
SSA pair pass misses but this pass catches.
test/fixedbugs/spillreload_arm64_pair.go exercises the conditional-call-
with-adjacent-reloads pattern from runtime.schedule that miscompiled
before the branch-target check was added. BenchmarkSpillReloadPair in
cmd/compile/internal/test improves from about 1.92 to 1.78 ns/op on an
Apple M4 Max (~7%).
armlint reports that adjacent STR/LDR pairings drop from 4354 -> 235 on
gofmt (94.6% reduction) and 26022 -> 772 on cmd/go (97.0% reduction).
The text section shrinks by 16720 bytes (1.40%) on gofmt and 101312
bytes (1.52%) on cmd/go.
|
Message from Andrew Gaul: Patch Set 2: (17 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/783660. |
|
This PR (HEAD: b153dbd) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/783660. Important tips:
|
The SSA pair pass (cmd/compile/internal/ssa/pair.go) runs before regalloc
and only sees source-level loads and stores. Spill/reload code that
regalloc inserts later for OpStoreReg/OpLoadReg becomes individual STR/LDR
instructions that never get a chance to be paired, even when two spills
target adjacent 8-byte stack slots.
Fuse those pairs as the final step of code generation, in the compiler
rather than the assembler. A new ssagen.ArchInfo.SSAGenFinish hook runs
after genssa has emitted all of a function's Progs, resolved its branch
and jump-table targets, and finalized the frame size in defframe (so the
register-argument spills defframe inserts participate too); on arm64 it
walks the Prog list and rewrites strictly-adjacent AMOVD spill pairs that
share a base register and have consecutive 8-byte offsets into a single
ASTP/ALDP. The second Prog is reduced to a 0-byte ANOP rather than
unlinked so that branch targets referencing it remain valid. The pass is
skipped under -N to keep unoptimized builds unoptimized. Doing this in
the compiler keeps the assembler a simple translator.
Fusion is gated on several safety and profitability conditions:
classes spill slots use), distinct destination registers (LDP with
Rt1 == Rt2 is CONSTRAINED UNPREDICTABLE), and no pre/post-index or
register-offset addressing
immediate, [-512, 504]: the assembler rewrites an AUTO offset to
off+framesize+8 and a PARAM offset to off+framesize+24 or +32
depending on frame alignment (off+8 in a frameless leaf, which is
not decided until assembly, so a frameless PARAM must fit both).
Checking the resolved value against the final frame size both
admits deep spill slots in large frames, where spills are most
common, and refuses fusions that would need an
assembler-synthesized address (ADD + LDP), which is no smaller
than the original pair and serializes through REGTMP
sequentially, the second load computes its address from the
just-loaded value, while LDP computes both addresses from the
original base
(otherwise paths that jump directly to it would skip the work the
LDP/STP now does at the first instruction's position) and does not
carry a statement boundary: genssa promotes instructions it reuses
as inline marks to statements, and inline marks must never become
zero-sized, while plain statement boundaries must keep their line
table entries
The prologue's register-argument spills around morestack, which the
assembler inserts during preprocess, are already emitted as STP/LDP
pairs (CL 621556).
TestPairSpills in cmd/compile/internal/arm64 drives pairSpills directly
with hand-constructed Prog chains, asserting the fused operands and
covering each fusion path and each gating condition.
test/codegen/memcombine.go pins down the spill/reload pattern that the
SSA pair pass misses but this pass catches.
test/fixedbugs/spillreload_arm64_pair.go exercises the conditional-call-
with-adjacent-reloads pattern from runtime.schedule that miscompiled
before the branch-target check was added. BenchmarkSpillReloadPair in
cmd/compile/internal/test improves from about 1.92 to 1.78 ns/op on an
Apple M4 Max (~7%).
armlint reports that adjacent STR/LDR pairings drop from 4354 -> 235 on
gofmt (94.6% reduction) and 26022 -> 772 on cmd/go (97.0% reduction).
The text section shrinks by 16720 bytes (1.40%) on gofmt and 101312
bytes (1.52%) on cmd/go.