Performance Series 2: Overview by unp1 · Pull Request #3879 · KeYProject/key

unp1 · 2026-06-30T17:39:10Z

Intended Change

Overview / integration branch for Performance Series 2 — a set of behaviour-preserving performance improvements to the KeY prover, used here for the combined end-to-end benchmark and the determinism check before any part merges to main. The individual changes are:

(1/5) Performance Series 2 (1/5): Faster cost computation by skipping an unnecessary proof-history walk #3874 — Faster cost computation by skipping an unnecessary proof-history walk
(2/5) Performance Series 2 (2/5): Faster term equality modulo renaming via a hash pre-filter #3875 — Faster term equality modulo renaming via a hash pre-filter
(3/5) Performance Series 2 (3/5): Incremental taclet re-indexing for a single added taclet #3876 — Incremental taclet re-indexing for a single added taclet
(4/5) Performance Series 2 (4/5): Lower memory for the per-goal rule-application queue #3877 — Lower memory for the per-goal rule-application queue
(5/5) Performance Series 2 (5/5): Faster quantifier-instantiation heuristic, plus code cleanup #3878 — Faster quantifier-instantiation heuristic, plus code cleanup

This branch additionally carries the runAllProofs per-fork statistics tooling used for measurement.

Benchmark — combined effect

Automode time, median of 3 runs, serial forks (-PrapForks=1), isolated user.home per run; baseline = merge-base 993820ec37.

perfTest problem	baseline (ms)	this branch (ms)	Δ
PUZ031p1	13712	7399	−46%
SimplifiedLinkedList.remove	17469	12996	−26%
Saddleback_search	13026	9656	−26%
SYN550p1	821	606	−26%
symmArray	12737	9912	−22%
coincidence_count/project	2181	1890	−13%
gemplusDecimal/add	8378	7595	−9%
ArrayList.remove.1	3339	3176	−5%
Total	71663	53230	−26% (1.35×)

Per-PR attribution benchmarks are being added to the individual PRs.

Type of pull request

Refactoring (behaviour should not change or only minimally change)
There are changes to the (Java) code
There are changes to the deployment/CI infrastructure (runAllProofs measurement tooling)

Ensuring quality

Behaviour-preserving across the series (identical proof node counts in runAllProofs).
A REPEATS=3 determinism corpus run is the gate before any of this merges to main.

Additional information and contact(s)

Umbrella/integration branch; the individual changes are reviewed in #3874–#3878.

Created with AI tooling support.

The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.

RenamingTermProperty.equalsModThisProperty (equality modulo bound renaming) compares two terms with the O(term) unifyHelp recursion. It is hot across the prover -- the quantifier cost heuristic (PredictCostProver.directConsequenceOrContradictionOfAxiom), taclet matching (MatchSchemaVariableInstruction), and sequent-redundancy checks all call it -- and it shows up as a top self-time leaf on arith/quant-heavy proofs. Cache each term's renaming-invariant hashCode (RenamingTermProperty.hashCodeModThisProperty) lazily on TermImpl, exposed as JTerm#hashCodeModRenaming, and use it as an O(1) reject in equalsModThisProperty: if the two hashes differ the terms cannot be equal modulo renaming, so the unifyHelp walk is skipped. Terms that ARE equal modulo renaming share the hash (by the hashCodeModThisProperty / equalsModThisProperty contract), so the filter never rejects a true match -- behaviour-preserving. Proofs stay byte-identical (runAllProofs green; symmArray/SLL/saddleback keep identical node and rule-app counts) while Automode time drops ~13% on those proofs. Cost: one cached int per TermImpl. Created with AI tooling support

When a NoPosTacletApp is added, addedNoPosTacletApp re-walks the affected formulas and, at every position, fetched that position's full operator taclet list and filtered it through a SetRuleFilter to keep the single new taclet. For common operators that list is large, so this membership scan dominated re-indexing (~12.8% of CPU on the wide-branching bike example; 60.7% of all HashMap lookups). Instead, feed the walk a one-taclet index built from just the new taclet, reusing the identical getList/match logic, so matchTaclets iterates ~one candidate per position. The single-threaded index is used deliberately: the multi-threaded matcher only parallelizes above 256 taclets, so a one-taclet index always takes its sequential path anyway. Byte-identical -- the multi-taclet addedNoPosTacletApps path is unchanged. New differential + unit tests in TestTermTacletAppIndex compare the mini-index re-index against the SetRuleFilter path (identical fired apps and apps at every position); full testRAP is green. On bike this cuts automode time ~16% (HashMap.getNode 12.1% -> 4.9%). Created with AI tooling support

Pin the current TriggersSet output for representative quantifier shapes (uni/multi/powerset, equality exclusion, negated implication guards, existential variables, conjunction clauses, and the no-outer-variable case) ahead of a rename/cleanup pass. A behavior-preservation net, not a claim the current output is ideal. Created with AI tooling support

The three-argument intersect computed all three sizes from set0, so the 'intersect the two smaller sets first' optimization never engaged. Use the respective set sizes. The three-way intersection is order-independent so the result is unchanged; the only caller (SplittableQuantifiedFormulaFeature) inspects just the result's emptiness/size. Created with AI tooling support

Rename for readability (triggers->elements, qvs->clauseVariables, setMultiSubstitution->combineElementSubstitutions, unifySubstitution->mergeIfCompatible, cryptic locals) and document the class. No behavior change. Created with AI tooling support

Rename uqvs->universalVariables and the matching helpers, and turn the cycle detection copied from EqualityConstraint into readable, documented containsCycle/reachesItself. No behavior change. Created with AI tooling support

Rename the trigger-discovery methods and locals (collectUniversalVariables, addMaximalUniTriggers, buildCoveringMultiTriggers, tryAddCoveringMultiTrigger, clauseVariables, ...), rename inner ClauseTrigger to ClauseTriggerFinder, document the multi-trigger powerset pruning, and fix smells (HashSet-declared LinkedHashSet, unused import, generic-array suppression). No behavior change. Created with AI tooling support

Substitution.applyWithoutCasts applied each variable in its own ClashFreeSubst pass -- one full traversal of the term per variable. As the substitution is ground, all variables can be replaced in a single shadow-aware traversal, keeping the per-variable path only as a fallback for the rare cast case. Behaviour-preserving; cuts substitution time roughly threefold on quantifier-heavy proofs. Created with AI tooling support

When estimating quantifier instantiation cost, PredictCostProver calls HandleArith.provedByArith for every literal/axiom pair. For a non-arithmetic literal it allocated a cache key, took the cache lock and ran formatArithTerm only to return the literal unproved -- and such literals dominate (~98% of calls even on arithmetic proofs). Both entry points now bail on a cheap arithmetic-shape check (>=, <=, plus = for the single-argument variant), exactly the condition under which the full prover would have returned the literal unchanged. Behaviour-preserving; cuts refinement time by roughly a third. Created with AI tooling support

AbstractMonomialSmallerThanFeature.introductionTime walks the entire appliedRuleApps() history to find the polySimp_newSmallSym taclet that introduced an op (used to order monomial/atom factors). The common monomial atoms -- program variables and ordinary functions -- are never introduced that way, yet hit the full O(history) walk on every comparison and, since the "not introduced" answer (-1) is deliberately not cached, re-walked it every time. On arith-heavy proofs this was a top self-time hotspot (inNewSmallSymRuleSet 4-8%). A polySimp_newSmallSym taclet introduces its symbol as a SkolemTermSV instantiation, which is always a skolem-constant function (TacletApp.createSkolemConstant). So an op that is not a skolem-constant function can never have been introduced by one -- its time is -1 with no history walk. Short-circuit those. Behaviour-preserving: only ops that are structurally impossible to introduce are short-circuited (a newSmallSym taclet always mints a fresh skolem function, never an existing non-skolem op), so every result is unchanged. runAllProofs stays green (every closed=0 is a declared notprovable); symmArray/SLL/saddleback keep identical node and rule-app counts while Automode time drops ~8% on those arith proofs. Covers AtomsSmallerThanFeature too (shared base). Created with AI tooling support

QueueRuleApplicationManager parks assumes-incomplete bases under each of their wake operators in parkedByOp. The buckets were LinkedHashSets, but RuleAppContainer uses identity equality, a base is parked at most once per bucket (removed from all its buckets when woken; see unindexParked), and the cross-operator de-duplication lives in the transient `woken` set -- so the buckets never actually de-dup, yet a set spends a 16-slot table plus a 40-byte Entry per element to hold a handful of refs. Use plain ArrayLists instead (also in the per-branch clone()). On the wide-branching bike example this removes ~70-85 MB at the 381-open-goal peak (~12-15% of the live set): per-container map-entry overhead drops 2.76 -> 1.04, the ~206k per-op LinkedHashSets collapse, and HashMap$Node[] arrays fall 179k -> 56k. Byte-identical (full testRAP green; bike still ends at 26 goals / 100k steps). Created with AI tooling support

The comparison-result cache grew to 100000 entries and then dropped all of them at once, discarding the hot entries on every overflow and averaging ~50000 live entries. Replace it with an LRUCache at a smaller steady bound (default 10000, tunable via -Dkey.lexpath.cachesize). 10000 is well below the old average -- a memory saving -- and A/B-safe: on the heaviest LexPath proof (the wide-branching bike example) shrinking the cache 100000 -> 1000 cost <2% automode time, so the cache barely helps. Byte-identical (pure memoization; full testRAP green). Created with AI tooling support

Splitting a goal clones its TacletIndex, which deep-copied the rwList / antecList / succList maps even though a child goal usually only reads them. Wrap them in a copy-on-write map (CopyOnWriteIndexMap) that shares the backing map across copy() and clones lazily on the first mutation, so sibling goals share one map until one of them adds or removes a taclet. A small structural win -- the heap profile shows the index is a minor slice of per-goal state -- kept for the reduced clone cost. Byte-identical (full testRAP green). Created with AI tooling support

The per-operator parking buckets are ArrayLists for the memory saving, but unindexParked removes by value (O(n)). A few operators collect thousands of parked bases (the update operator on long straight-line proofs), so one giant bucket made wake/unpark O(n^2) -- a 2.7x regression on performance-test/updateSimplification/straightline_8000. Promote a bucket to a LinkedHashSet once it grows past 16 elements (O(1) removal); small buckets stay ArrayLists. Both insertion-ordered, so byte-identical (full testRAP green). straightline_8000 32953->12817 ms (~main). Squash into the parking-buckets commit. Created with AI tooling support

LexPathOrdering lives in a per-proof strategy cost feature (SmallerThanFeature), shared across all goals, and the parallel prover evaluates rule-application cost concurrently -- but its two caches were not thread-safe: the comparison-result cache (a HashMap cleared wholesale at 100k) and the sort-depth cache (a WeakHashMap, which mutates even on get via stale-entry expunge). Concurrent access could corrupt either. The MT stress gate never caught this: its proofs are heap-shaped (symmArray, SimplifiedLinkedList.remove) and do not exercise the arithmetic cost features that use LexPathOrdering. * comparison cache -> ConcurrentLruCache (thread-safe bounded LRU; also caps the previously unbounded growth) * sort-depth cache -> Collections.synchronizedMap(WeakHashMap) (keeps weak keys, makes each op atomic) * add an arithmetic proof (gemplusDecimal/add) to MtStressTest so the cost-feature caches are exercised under concurrency Surfaced while reviewing Performance Series 2 (PR #3879), which changes the comparison cache to a plain (also non-thread-safe, access-ordered) LRUCache.

ModalityCache is held by the FilterSymbexStrategy of the symbolic-execution macros (FinishSymbolicExecution, AutoMacro, AutoPilotPrepareProof, ...), which run on the multi-core prover. hasModality() is reached from Strategy#isApprovedApp on every worker, off the commit lock, so its two caches raced: * termCache was a plain LRUCache (access-ordered -> get mutates -> corruption) -> ConcurrentLruCache. * the single-element sequent cache (sequent + sequentValue) was updated non- atomically -> a torn read could return another sequent's value (a *correctness* bug, not just variance) -> now a per-thread ThreadLocal. Add a FinishSymbolicExecutionMacro run to MtMacroStressTest (8 workers, 5 reps): the heap/try-close proofs did not exercise the SE cost-feature caches, which is why this and the LexPathOrdering race stayed latent. Asserts SE finishes with a consistent open-goal count across reps. Found while scanning LRUCache usages after the LexPathOrdering fix (#3879 review).

LexPathOrdering lives in a per-proof strategy cost feature (SmallerThanFeature), shared across all goals, and the parallel prover evaluates rule-application cost concurrently -- but its two caches were not thread-safe: the comparison-result cache (a HashMap cleared wholesale at 100k) and the sort-depth cache (a WeakHashMap, which mutates even on get via stale-entry expunge). Concurrent access could corrupt either. The MT stress gate never caught this: its proofs are heap-shaped (symmArray, SimplifiedLinkedList.remove) and do not exercise the arithmetic cost features that use LexPathOrdering. * comparison cache -> ConcurrentLruCache (thread-safe bounded LRU; also caps the previously unbounded growth) * sort-depth cache -> Collections.synchronizedMap(WeakHashMap) (keeps weak keys, makes each op atomic) * add an arithmetic proof (gemplusDecimal/add) to MtStressTest so the cost-feature caches are exercised under concurrency Surfaced while reviewing Performance Series 2 (PR #3879), which changes the comparison cache to a plain (also non-thread-safe, access-ordered) LRUCache.

ModalityCache is held by the FilterSymbexStrategy of the symbolic-execution macros (FinishSymbolicExecution, AutoMacro, AutoPilotPrepareProof, ...), which run on the multi-core prover. hasModality() is reached from Strategy#isApprovedApp on every worker, off the commit lock, so its two caches raced: * termCache was a plain LRUCache (access-ordered -> get mutates -> corruption) -> ConcurrentLruCache. * the single-element sequent cache (sequent + sequentValue) was updated non- atomically -> a torn read could return another sequent's value (a *correctness* bug, not just variance) -> now a per-thread ThreadLocal. Add a FinishSymbolicExecutionMacro run to MtMacroStressTest (8 workers, 5 reps): the heap/try-close proofs did not exercise the SE cost-feature caches, which is why this and the LexPathOrdering race stayed latent. Asserts SE finishes with a consistent open-goal count across reps. Found while scanning LRUCache usages after the LexPathOrdering fix (#3879 review).

unp1 self-assigned this Jun 30, 2026

unp1 added the 🚀 Performance label Jun 30, 2026

unp1 changed the title ~~[integration] Performance PR series — combined benchmark + determinism gate~~ [integration] Performance Series 2: Overview Jun 30, 2026

unp1 changed the title ~~[integration] Performance Series 2: Overview~~ Performance Series 2: Overview Jun 30, 2026

unp1 force-pushed the bubel/perf-preview branch from af409f4 to 5daf836 Compare June 30, 2026 21:08

unp1 added 18 commits June 30, 2026 23:30

refactor: clarify names in MultiTrigger

a6b5971

Rename for readability (triggers->elements, qvs->clauseVariables, setMultiSubstitution->combineElementSubstitutions, unifySubstitution->mergeIfCompatible, cryptic locals) and document the class. No behavior change. Created with AI tooling support

refactor: clarify names and cycle detection in UniTrigger

8f0d7af

Rename uqvs->universalVariables and the matching helpers, and turn the cycle detection copied from EqualityConstraint into readable, documented containsCycle/reachesItself. No behavior change. Created with AI tooling support

Merge branch 'bubel/renaming-prefilter' into bubel/perf-preview

d7032d2

Merge branch 'bubel/reindex-fastpath' into bubel/perf-preview

6f9415b

Merge branch 'bubel/branching-memory' into bubel/perf-preview

e658663

Merge branch 'bubel/quantheur-cleanup' into bubel/perf-preview

dc3f267

unp1 force-pushed the bubel/perf-preview branch from 5daf836 to dc3f267 Compare June 30, 2026 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Series 2: Overview#3879

Performance Series 2: Overview#3879
unp1 wants to merge 18 commits into
mainfrom
bubel/perf-preview

unp1 commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

unp1 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intended Change

Benchmark — combined effect

Type of pull request

Ensuring quality

Additional information and contact(s)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

unp1 commented Jun 30, 2026 •

edited

Loading