Skip to content

Performance Series 2: Overview#3879

Draft
unp1 wants to merge 18 commits into
mainfrom
bubel/perf-preview
Draft

Performance Series 2: Overview#3879
unp1 wants to merge 18 commits into
mainfrom
bubel/perf-preview

Conversation

@unp1

@unp1 unp1 commented Jun 30, 2026

Copy link
Copy Markdown
Member

Intended Change

Overview / integration branch for Performance Series 2 — a set of behaviour-preserving performance improvements to the KeY prover, used here for the combined end-to-end benchmark and the determinism check before any part merges to main. The individual changes are:

This branch additionally carries the runAllProofs per-fork statistics tooling used for measurement.

Benchmark — combined effect

Automode time, median of 3 runs, serial forks (-PrapForks=1), isolated user.home per run; baseline = merge-base 993820ec37.

perfTest problem baseline (ms) this branch (ms) Δ
PUZ031p1 13712 7399 −46%
SimplifiedLinkedList.remove 17469 12996 −26%
Saddleback_search 13026 9656 −26%
SYN550p1 821 606 −26%
symmArray 12737 9912 −22%
coincidence_count/project 2181 1890 −13%
gemplusDecimal/add 8378 7595 −9%
ArrayList.remove.1 3339 3176 −5%
Total 71663 53230 −26% (1.35×)

Per-PR attribution benchmarks are being added to the individual PRs.

Type of pull request

  • Refactoring (behaviour should not change or only minimally change)
  • There are changes to the (Java) code
  • There are changes to the deployment/CI infrastructure (runAllProofs measurement tooling)

Ensuring quality

  • Behaviour-preserving across the series (identical proof node counts in runAllProofs).
  • A REPEATS=3 determinism corpus run is the gate before any of this merges to main.

Additional information and contact(s)

Umbrella/integration branch; the individual changes are reviewed in #3874#3878.

Created with AI tooling support.

The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.

unp1 added 18 commits June 30, 2026 23:30
RenamingTermProperty.equalsModThisProperty (equality modulo bound renaming) compares two
terms with the O(term) unifyHelp recursion. It is hot across the prover -- the quantifier
cost heuristic (PredictCostProver.directConsequenceOrContradictionOfAxiom), taclet matching
(MatchSchemaVariableInstruction), and sequent-redundancy checks all call it -- and it shows
up as a top self-time leaf on arith/quant-heavy proofs.

Cache each term's renaming-invariant hashCode (RenamingTermProperty.hashCodeModThisProperty)
lazily on TermImpl, exposed as JTerm#hashCodeModRenaming, and use it as an O(1) reject in
equalsModThisProperty: if the two hashes differ the terms cannot be equal modulo renaming,
so the unifyHelp walk is skipped. Terms that ARE equal modulo renaming share the hash (by
the hashCodeModThisProperty / equalsModThisProperty contract), so the filter never rejects a
true match -- behaviour-preserving.

Proofs stay byte-identical (runAllProofs green; symmArray/SLL/saddleback keep identical node
and rule-app counts) while Automode time drops ~13% on those proofs. Cost: one cached int
per TermImpl.

Created with AI tooling support
When a NoPosTacletApp is added, addedNoPosTacletApp re-walks the affected
formulas and, at every position, fetched that position's full operator
taclet list and filtered it through a SetRuleFilter to keep the single new
taclet. For common operators that list is large, so this membership scan
dominated re-indexing (~12.8% of CPU on the wide-branching bike example;
60.7% of all HashMap lookups).

Instead, feed the walk a one-taclet index built from just the new taclet,
reusing the identical getList/match logic, so matchTaclets iterates ~one
candidate per position. The single-threaded index is used deliberately:
the multi-threaded matcher only parallelizes above 256 taclets, so a
one-taclet index always takes its sequential path anyway.

Byte-identical -- the multi-taclet addedNoPosTacletApps path is unchanged.
New differential + unit tests in TestTermTacletAppIndex compare the
mini-index re-index against the SetRuleFilter path (identical fired apps
and apps at every position); full testRAP is green. On bike this cuts
automode time ~16% (HashMap.getNode 12.1% -> 4.9%).

Created with AI tooling support
Pin the current TriggersSet output for representative quantifier shapes (uni/multi/powerset, equality exclusion, negated implication guards, existential variables, conjunction clauses, and the no-outer-variable case) ahead of a rename/cleanup pass. A behavior-preservation net, not a claim the current output is ideal.

Created with AI tooling support
The three-argument intersect computed all three sizes from set0, so the 'intersect the two smaller sets first' optimization never engaged. Use the respective set sizes. The three-way intersection is order-independent so the result is unchanged; the only caller (SplittableQuantifiedFormulaFeature) inspects just the result's emptiness/size.

Created with AI tooling support
Rename for readability (triggers->elements, qvs->clauseVariables, setMultiSubstitution->combineElementSubstitutions, unifySubstitution->mergeIfCompatible, cryptic locals) and document the class. No behavior change.

Created with AI tooling support
Rename uqvs->universalVariables and the matching helpers, and turn the cycle detection copied from EqualityConstraint into readable, documented containsCycle/reachesItself. No behavior change.

Created with AI tooling support
Rename the trigger-discovery methods and locals (collectUniversalVariables, addMaximalUniTriggers, buildCoveringMultiTriggers, tryAddCoveringMultiTrigger, clauseVariables, ...), rename inner ClauseTrigger to ClauseTriggerFinder, document the multi-trigger powerset pruning, and fix smells (HashSet-declared LinkedHashSet, unused import, generic-array suppression). No behavior change.

Created with AI tooling support
Substitution.applyWithoutCasts applied each variable in its own
ClashFreeSubst pass -- one full traversal of the term per variable. As
the substitution is ground, all variables can be replaced in a single
shadow-aware traversal, keeping the per-variable path only as a fallback
for the rare cast case. Behaviour-preserving; cuts substitution time
roughly threefold on quantifier-heavy proofs.

Created with AI tooling support
When estimating quantifier instantiation cost, PredictCostProver calls
HandleArith.provedByArith for every literal/axiom pair. For a
non-arithmetic literal it allocated a cache key, took the cache lock and
ran formatArithTerm only to return the literal unproved -- and such
literals dominate (~98% of calls even on arithmetic proofs). Both entry
points now bail on a cheap arithmetic-shape check (>=, <=, plus = for the
single-argument variant), exactly the condition under which the full
prover would have returned the literal unchanged. Behaviour-preserving;
cuts refinement time by roughly a third.

Created with AI tooling support
AbstractMonomialSmallerThanFeature.introductionTime walks the entire
appliedRuleApps() history to find the polySimp_newSmallSym taclet that
introduced an op (used to order monomial/atom factors). The common
monomial atoms -- program variables and ordinary functions -- are never
introduced that way, yet hit the full O(history) walk on every comparison
and, since the "not introduced" answer (-1) is deliberately not cached,
re-walked it every time. On arith-heavy proofs this was a top self-time
hotspot (inNewSmallSymRuleSet 4-8%).

A polySimp_newSmallSym taclet introduces its symbol as a SkolemTermSV
instantiation, which is always a skolem-constant function
(TacletApp.createSkolemConstant). So an op that is not a skolem-constant
function can never have been introduced by one -- its time is -1 with no
history walk. Short-circuit those.

Behaviour-preserving: only ops that are structurally impossible to
introduce are short-circuited (a newSmallSym taclet always mints a fresh
skolem function, never an existing non-skolem op), so every result is
unchanged. runAllProofs stays green (every closed=0 is a declared
notprovable); symmArray/SLL/saddleback keep identical node and rule-app
counts while Automode time drops ~8% on those arith proofs. Covers
AtomsSmallerThanFeature too (shared base).

Created with AI tooling support
QueueRuleApplicationManager parks assumes-incomplete bases under each of
their wake operators in parkedByOp. The buckets were LinkedHashSets, but
RuleAppContainer uses identity equality, a base is parked at most once
per bucket (removed from all its buckets when woken; see unindexParked),
and the cross-operator de-duplication lives in the transient `woken` set
-- so the buckets never actually de-dup, yet a set spends a 16-slot table
plus a 40-byte Entry per element to hold a handful of refs. Use plain
ArrayLists instead (also in the per-branch clone()).

On the wide-branching bike example this removes ~70-85 MB at the
381-open-goal peak (~12-15% of the live set): per-container map-entry
overhead drops 2.76 -> 1.04, the ~206k per-op LinkedHashSets collapse,
and HashMap$Node[] arrays fall 179k -> 56k. Byte-identical (full testRAP
green; bike still ends at 26 goals / 100k steps).

Created with AI tooling support
The comparison-result cache grew to 100000 entries and then dropped all
of them at once, discarding the hot entries on every overflow and
averaging ~50000 live entries. Replace it with an LRUCache at a smaller
steady bound (default 10000, tunable via -Dkey.lexpath.cachesize).

10000 is well below the old average -- a memory saving -- and A/B-safe:
on the heaviest LexPath proof (the wide-branching bike example) shrinking
the cache 100000 -> 1000 cost <2% automode time, so the cache barely
helps. Byte-identical (pure memoization; full testRAP green).

Created with AI tooling support
Splitting a goal clones its TacletIndex, which deep-copied the rwList /
antecList / succList maps even though a child goal usually only reads
them. Wrap them in a copy-on-write map (CopyOnWriteIndexMap) that shares
the backing map across copy() and clones lazily on the first mutation, so
sibling goals share one map until one of them adds or removes a taclet.

A small structural win -- the heap profile shows the index is a minor
slice of per-goal state -- kept for the reduced clone cost. Byte-identical
(full testRAP green).

Created with AI tooling support
The per-operator parking buckets are ArrayLists for the memory saving, but
unindexParked removes by value (O(n)). A few operators collect thousands of
parked bases (the update operator on long straight-line proofs), so one giant
bucket made wake/unpark O(n^2) -- a 2.7x regression on
performance-test/updateSimplification/straightline_8000. Promote a bucket to a
LinkedHashSet once it grows past 16 elements (O(1) removal); small buckets stay
ArrayLists. Both insertion-ordered, so byte-identical (full testRAP green).
straightline_8000 32953->12817 ms (~main). Squash into the parking-buckets commit.

Created with AI tooling support
@unp1 unp1 force-pushed the bubel/perf-preview branch from 5daf836 to dc3f267 Compare June 30, 2026 21:41
unp1 added a commit that referenced this pull request Jun 30, 2026
LexPathOrdering lives in a per-proof strategy cost feature (SmallerThanFeature),
shared across all goals, and the parallel prover evaluates rule-application cost
concurrently -- but its two caches were not thread-safe: the comparison-result
cache (a HashMap cleared wholesale at 100k) and the sort-depth cache (a
WeakHashMap, which mutates even on get via stale-entry expunge). Concurrent
access could corrupt either.

The MT stress gate never caught this: its proofs are heap-shaped (symmArray,
SimplifiedLinkedList.remove) and do not exercise the arithmetic cost features
that use LexPathOrdering.

 * comparison cache -> ConcurrentLruCache (thread-safe bounded LRU; also caps the
   previously unbounded growth)
 * sort-depth cache -> Collections.synchronizedMap(WeakHashMap) (keeps weak keys,
   makes each op atomic)
 * add an arithmetic proof (gemplusDecimal/add) to MtStressTest so the cost-feature
   caches are exercised under concurrency

Surfaced while reviewing Performance Series 2 (PR #3879), which changes the
comparison cache to a plain (also non-thread-safe, access-ordered) LRUCache.
unp1 added a commit that referenced this pull request Jul 1, 2026
ModalityCache is held by the FilterSymbexStrategy of the symbolic-execution
macros (FinishSymbolicExecution, AutoMacro, AutoPilotPrepareProof, ...), which
run on the multi-core prover. hasModality() is reached from Strategy#isApprovedApp
on every worker, off the commit lock, so its two caches raced:
 * termCache was a plain LRUCache (access-ordered -> get mutates -> corruption)
   -> ConcurrentLruCache.
 * the single-element sequent cache (sequent + sequentValue) was updated non-
   atomically -> a torn read could return another sequent's value (a *correctness*
   bug, not just variance) -> now a per-thread ThreadLocal.

Add a FinishSymbolicExecutionMacro run to MtMacroStressTest (8 workers, 5 reps):
the heap/try-close proofs did not exercise the SE cost-feature caches, which is
why this and the LexPathOrdering race stayed latent. Asserts SE finishes with a
consistent open-goal count across reps.

Found while scanning LRUCache usages after the LexPathOrdering fix (#3879 review).
unp1 added a commit that referenced this pull request Jul 2, 2026
LexPathOrdering lives in a per-proof strategy cost feature (SmallerThanFeature),
shared across all goals, and the parallel prover evaluates rule-application cost
concurrently -- but its two caches were not thread-safe: the comparison-result
cache (a HashMap cleared wholesale at 100k) and the sort-depth cache (a
WeakHashMap, which mutates even on get via stale-entry expunge). Concurrent
access could corrupt either.

The MT stress gate never caught this: its proofs are heap-shaped (symmArray,
SimplifiedLinkedList.remove) and do not exercise the arithmetic cost features
that use LexPathOrdering.

 * comparison cache -> ConcurrentLruCache (thread-safe bounded LRU; also caps the
   previously unbounded growth)
 * sort-depth cache -> Collections.synchronizedMap(WeakHashMap) (keeps weak keys,
   makes each op atomic)
 * add an arithmetic proof (gemplusDecimal/add) to MtStressTest so the cost-feature
   caches are exercised under concurrency

Surfaced while reviewing Performance Series 2 (PR #3879), which changes the
comparison cache to a plain (also non-thread-safe, access-ordered) LRUCache.
unp1 added a commit that referenced this pull request Jul 2, 2026
ModalityCache is held by the FilterSymbexStrategy of the symbolic-execution
macros (FinishSymbolicExecution, AutoMacro, AutoPilotPrepareProof, ...), which
run on the multi-core prover. hasModality() is reached from Strategy#isApprovedApp
on every worker, off the commit lock, so its two caches raced:
 * termCache was a plain LRUCache (access-ordered -> get mutates -> corruption)
   -> ConcurrentLruCache.
 * the single-element sequent cache (sequent + sequentValue) was updated non-
   atomically -> a torn read could return another sequent's value (a *correctness*
   bug, not just variance) -> now a per-thread ThreadLocal.

Add a FinishSymbolicExecutionMacro run to MtMacroStressTest (8 workers, 5 reps):
the heap/try-close proofs did not exercise the SE cost-feature caches, which is
why this and the LexPathOrdering race stayed latent. Asserts SE finishes with a
consistent open-goal count across reps.

Found while scanning LRUCache usages after the LexPathOrdering fix (#3879 review).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant