Performance Series 2: Overview#3879
Draft
unp1 wants to merge 18 commits into
Draft
Conversation
This was referenced Jun 30, 2026
RenamingTermProperty.equalsModThisProperty (equality modulo bound renaming) compares two terms with the O(term) unifyHelp recursion. It is hot across the prover -- the quantifier cost heuristic (PredictCostProver.directConsequenceOrContradictionOfAxiom), taclet matching (MatchSchemaVariableInstruction), and sequent-redundancy checks all call it -- and it shows up as a top self-time leaf on arith/quant-heavy proofs. Cache each term's renaming-invariant hashCode (RenamingTermProperty.hashCodeModThisProperty) lazily on TermImpl, exposed as JTerm#hashCodeModRenaming, and use it as an O(1) reject in equalsModThisProperty: if the two hashes differ the terms cannot be equal modulo renaming, so the unifyHelp walk is skipped. Terms that ARE equal modulo renaming share the hash (by the hashCodeModThisProperty / equalsModThisProperty contract), so the filter never rejects a true match -- behaviour-preserving. Proofs stay byte-identical (runAllProofs green; symmArray/SLL/saddleback keep identical node and rule-app counts) while Automode time drops ~13% on those proofs. Cost: one cached int per TermImpl. Created with AI tooling support
When a NoPosTacletApp is added, addedNoPosTacletApp re-walks the affected formulas and, at every position, fetched that position's full operator taclet list and filtered it through a SetRuleFilter to keep the single new taclet. For common operators that list is large, so this membership scan dominated re-indexing (~12.8% of CPU on the wide-branching bike example; 60.7% of all HashMap lookups). Instead, feed the walk a one-taclet index built from just the new taclet, reusing the identical getList/match logic, so matchTaclets iterates ~one candidate per position. The single-threaded index is used deliberately: the multi-threaded matcher only parallelizes above 256 taclets, so a one-taclet index always takes its sequential path anyway. Byte-identical -- the multi-taclet addedNoPosTacletApps path is unchanged. New differential + unit tests in TestTermTacletAppIndex compare the mini-index re-index against the SetRuleFilter path (identical fired apps and apps at every position); full testRAP is green. On bike this cuts automode time ~16% (HashMap.getNode 12.1% -> 4.9%). Created with AI tooling support
Pin the current TriggersSet output for representative quantifier shapes (uni/multi/powerset, equality exclusion, negated implication guards, existential variables, conjunction clauses, and the no-outer-variable case) ahead of a rename/cleanup pass. A behavior-preservation net, not a claim the current output is ideal. Created with AI tooling support
The three-argument intersect computed all three sizes from set0, so the 'intersect the two smaller sets first' optimization never engaged. Use the respective set sizes. The three-way intersection is order-independent so the result is unchanged; the only caller (SplittableQuantifiedFormulaFeature) inspects just the result's emptiness/size. Created with AI tooling support
Rename for readability (triggers->elements, qvs->clauseVariables, setMultiSubstitution->combineElementSubstitutions, unifySubstitution->mergeIfCompatible, cryptic locals) and document the class. No behavior change. Created with AI tooling support
Rename uqvs->universalVariables and the matching helpers, and turn the cycle detection copied from EqualityConstraint into readable, documented containsCycle/reachesItself. No behavior change. Created with AI tooling support
Rename the trigger-discovery methods and locals (collectUniversalVariables, addMaximalUniTriggers, buildCoveringMultiTriggers, tryAddCoveringMultiTrigger, clauseVariables, ...), rename inner ClauseTrigger to ClauseTriggerFinder, document the multi-trigger powerset pruning, and fix smells (HashSet-declared LinkedHashSet, unused import, generic-array suppression). No behavior change. Created with AI tooling support
Substitution.applyWithoutCasts applied each variable in its own ClashFreeSubst pass -- one full traversal of the term per variable. As the substitution is ground, all variables can be replaced in a single shadow-aware traversal, keeping the per-variable path only as a fallback for the rare cast case. Behaviour-preserving; cuts substitution time roughly threefold on quantifier-heavy proofs. Created with AI tooling support
When estimating quantifier instantiation cost, PredictCostProver calls HandleArith.provedByArith for every literal/axiom pair. For a non-arithmetic literal it allocated a cache key, took the cache lock and ran formatArithTerm only to return the literal unproved -- and such literals dominate (~98% of calls even on arithmetic proofs). Both entry points now bail on a cheap arithmetic-shape check (>=, <=, plus = for the single-argument variant), exactly the condition under which the full prover would have returned the literal unchanged. Behaviour-preserving; cuts refinement time by roughly a third. Created with AI tooling support
AbstractMonomialSmallerThanFeature.introductionTime walks the entire appliedRuleApps() history to find the polySimp_newSmallSym taclet that introduced an op (used to order monomial/atom factors). The common monomial atoms -- program variables and ordinary functions -- are never introduced that way, yet hit the full O(history) walk on every comparison and, since the "not introduced" answer (-1) is deliberately not cached, re-walked it every time. On arith-heavy proofs this was a top self-time hotspot (inNewSmallSymRuleSet 4-8%). A polySimp_newSmallSym taclet introduces its symbol as a SkolemTermSV instantiation, which is always a skolem-constant function (TacletApp.createSkolemConstant). So an op that is not a skolem-constant function can never have been introduced by one -- its time is -1 with no history walk. Short-circuit those. Behaviour-preserving: only ops that are structurally impossible to introduce are short-circuited (a newSmallSym taclet always mints a fresh skolem function, never an existing non-skolem op), so every result is unchanged. runAllProofs stays green (every closed=0 is a declared notprovable); symmArray/SLL/saddleback keep identical node and rule-app counts while Automode time drops ~8% on those arith proofs. Covers AtomsSmallerThanFeature too (shared base). Created with AI tooling support
QueueRuleApplicationManager parks assumes-incomplete bases under each of their wake operators in parkedByOp. The buckets were LinkedHashSets, but RuleAppContainer uses identity equality, a base is parked at most once per bucket (removed from all its buckets when woken; see unindexParked), and the cross-operator de-duplication lives in the transient `woken` set -- so the buckets never actually de-dup, yet a set spends a 16-slot table plus a 40-byte Entry per element to hold a handful of refs. Use plain ArrayLists instead (also in the per-branch clone()). On the wide-branching bike example this removes ~70-85 MB at the 381-open-goal peak (~12-15% of the live set): per-container map-entry overhead drops 2.76 -> 1.04, the ~206k per-op LinkedHashSets collapse, and HashMap$Node[] arrays fall 179k -> 56k. Byte-identical (full testRAP green; bike still ends at 26 goals / 100k steps). Created with AI tooling support
The comparison-result cache grew to 100000 entries and then dropped all of them at once, discarding the hot entries on every overflow and averaging ~50000 live entries. Replace it with an LRUCache at a smaller steady bound (default 10000, tunable via -Dkey.lexpath.cachesize). 10000 is well below the old average -- a memory saving -- and A/B-safe: on the heaviest LexPath proof (the wide-branching bike example) shrinking the cache 100000 -> 1000 cost <2% automode time, so the cache barely helps. Byte-identical (pure memoization; full testRAP green). Created with AI tooling support
Splitting a goal clones its TacletIndex, which deep-copied the rwList / antecList / succList maps even though a child goal usually only reads them. Wrap them in a copy-on-write map (CopyOnWriteIndexMap) that shares the backing map across copy() and clones lazily on the first mutation, so sibling goals share one map until one of them adds or removes a taclet. A small structural win -- the heap profile shows the index is a minor slice of per-goal state -- kept for the reduced clone cost. Byte-identical (full testRAP green). Created with AI tooling support
The per-operator parking buckets are ArrayLists for the memory saving, but unindexParked removes by value (O(n)). A few operators collect thousands of parked bases (the update operator on long straight-line proofs), so one giant bucket made wake/unpark O(n^2) -- a 2.7x regression on performance-test/updateSimplification/straightline_8000. Promote a bucket to a LinkedHashSet once it grows past 16 elements (O(1) removal); small buckets stay ArrayLists. Both insertion-ordered, so byte-identical (full testRAP green). straightline_8000 32953->12817 ms (~main). Squash into the parking-buckets commit. Created with AI tooling support
unp1
added a commit
that referenced
this pull request
Jun 30, 2026
LexPathOrdering lives in a per-proof strategy cost feature (SmallerThanFeature), shared across all goals, and the parallel prover evaluates rule-application cost concurrently -- but its two caches were not thread-safe: the comparison-result cache (a HashMap cleared wholesale at 100k) and the sort-depth cache (a WeakHashMap, which mutates even on get via stale-entry expunge). Concurrent access could corrupt either. The MT stress gate never caught this: its proofs are heap-shaped (symmArray, SimplifiedLinkedList.remove) and do not exercise the arithmetic cost features that use LexPathOrdering. * comparison cache -> ConcurrentLruCache (thread-safe bounded LRU; also caps the previously unbounded growth) * sort-depth cache -> Collections.synchronizedMap(WeakHashMap) (keeps weak keys, makes each op atomic) * add an arithmetic proof (gemplusDecimal/add) to MtStressTest so the cost-feature caches are exercised under concurrency Surfaced while reviewing Performance Series 2 (PR #3879), which changes the comparison cache to a plain (also non-thread-safe, access-ordered) LRUCache.
unp1
added a commit
that referenced
this pull request
Jul 1, 2026
ModalityCache is held by the FilterSymbexStrategy of the symbolic-execution macros (FinishSymbolicExecution, AutoMacro, AutoPilotPrepareProof, ...), which run on the multi-core prover. hasModality() is reached from Strategy#isApprovedApp on every worker, off the commit lock, so its two caches raced: * termCache was a plain LRUCache (access-ordered -> get mutates -> corruption) -> ConcurrentLruCache. * the single-element sequent cache (sequent + sequentValue) was updated non- atomically -> a torn read could return another sequent's value (a *correctness* bug, not just variance) -> now a per-thread ThreadLocal. Add a FinishSymbolicExecutionMacro run to MtMacroStressTest (8 workers, 5 reps): the heap/try-close proofs did not exercise the SE cost-feature caches, which is why this and the LexPathOrdering race stayed latent. Asserts SE finishes with a consistent open-goal count across reps. Found while scanning LRUCache usages after the LexPathOrdering fix (#3879 review).
unp1
added a commit
that referenced
this pull request
Jul 2, 2026
LexPathOrdering lives in a per-proof strategy cost feature (SmallerThanFeature), shared across all goals, and the parallel prover evaluates rule-application cost concurrently -- but its two caches were not thread-safe: the comparison-result cache (a HashMap cleared wholesale at 100k) and the sort-depth cache (a WeakHashMap, which mutates even on get via stale-entry expunge). Concurrent access could corrupt either. The MT stress gate never caught this: its proofs are heap-shaped (symmArray, SimplifiedLinkedList.remove) and do not exercise the arithmetic cost features that use LexPathOrdering. * comparison cache -> ConcurrentLruCache (thread-safe bounded LRU; also caps the previously unbounded growth) * sort-depth cache -> Collections.synchronizedMap(WeakHashMap) (keeps weak keys, makes each op atomic) * add an arithmetic proof (gemplusDecimal/add) to MtStressTest so the cost-feature caches are exercised under concurrency Surfaced while reviewing Performance Series 2 (PR #3879), which changes the comparison cache to a plain (also non-thread-safe, access-ordered) LRUCache.
unp1
added a commit
that referenced
this pull request
Jul 2, 2026
ModalityCache is held by the FilterSymbexStrategy of the symbolic-execution macros (FinishSymbolicExecution, AutoMacro, AutoPilotPrepareProof, ...), which run on the multi-core prover. hasModality() is reached from Strategy#isApprovedApp on every worker, off the commit lock, so its two caches raced: * termCache was a plain LRUCache (access-ordered -> get mutates -> corruption) -> ConcurrentLruCache. * the single-element sequent cache (sequent + sequentValue) was updated non- atomically -> a torn read could return another sequent's value (a *correctness* bug, not just variance) -> now a per-thread ThreadLocal. Add a FinishSymbolicExecutionMacro run to MtMacroStressTest (8 workers, 5 reps): the heap/try-close proofs did not exercise the SE cost-feature caches, which is why this and the LexPathOrdering race stayed latent. Asserts SE finishes with a consistent open-goal count across reps. Found while scanning LRUCache usages after the LexPathOrdering fix (#3879 review).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intended Change
Overview / integration branch for Performance Series 2 — a set of behaviour-preserving performance improvements to the KeY prover, used here for the combined end-to-end benchmark and the determinism check before any part merges to
main. The individual changes are:This branch additionally carries the runAllProofs per-fork statistics tooling used for measurement.
Benchmark — combined effect
Automode time, median of 3 runs, serial forks (
-PrapForks=1), isolateduser.homeper run; baseline = merge-base993820ec37.Per-PR attribution benchmarks are being added to the individual PRs.
Type of pull request
Ensuring quality
main.Additional information and contact(s)
Umbrella/integration branch; the individual changes are reviewed in #3874–#3878.
Created with AI tooling support.
The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.