test: speedracer ranking on real workloads by davdhacs · Pull Request #21118 · stackrox/stackrox

davdhacs · 2026-06-13T14:36:57Z

Testing the new rank-based speedracer (davdhacs/speedracer@cfb0410) on real CI workloads.

Changes from PR #20998 (speedracer-minimal) plus pointing at the ranking branch instead of the old binary-gate version.

What's different:

Runners read their CPU rank from runner-pool.tsv (line number = rank)
Rank embedded in check-run external_id: spec-{letter}-r{rank}-{run_id}-{check_name}
Higher-ranked siblings always win; same-rank → higher letter wins
No slow-cpu-pattern input — ranking replaces the binary gate
No copy-a special case — all copies follow the same rank+letter priority

What to watch:

Do runners correctly identify their rank from the TSV?
Does the highest-ranked copy win?
Are check-runs posted and resolved correctly?
Does branch protection work (success check-run for the winning copy)?

This branch is disposable — will be deleted after testing.

pre-build-go-binaries is on the critical path for every image build — it blocks build-and-push-main, build-and-push-operator, and build-and-push-scanner. When it lands on a slow runner (EPYC 7763, ~42% of the pool), the job takes 380s vs 185s on fast hardware — adding 3.25 minutes to the build pipeline wall-clock. Speedracer runs 2 copies in parallel. The first copy to land on a fast runner wins; the other exits as 'cancelled' within seconds. arm64 is excluded from the extra copy (consistent hardware, no variance). The action (.github/actions/speedracer) is self-contained — callers add 'speedracer: [a, b]' to their matrix and a single action call. It auto-derives the check-run name from the matrix, auto-detects the last-copy fallback, and resolves check-runs via a post-step hook. AI-generated with Claude Code Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

With 42% slow-runner rate: 2 copies: 18% chance both slow (0.42²) 3 copies: 7% chance all slow (0.42³) 3 copies nearly eliminates the slow-runner penalty on this critical-path job. arm64 excluded from b+c (consistent hardware). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Critical-path job — 4 copies reduces all-slow probability to 3.1%. Cancelled copies cost ~15s each, trivial vs the 195s slow-runner penalty. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Removes the blocking gate. All copies begin work in parallel. A background monitor polls for a sibling's completed/success check-run every 5s (after 8-10s initial delay). When a winner is found: 1. Monitor writes /tmp/speedracer-killed marker 2. Monitor SIGTERMs the runner ($PPID) 3. Post-step detects the marker and PATCHes the check-run to 'cancelled' (not 'failure') so the PR UI shows the correct status This is a prototype — testing whether: - SIGTERM from a background process during work steps fires the post-step - The /tmp/speedracer-killed marker survives the SIGTERM - The check-run PATCH to 'cancelled' works If SIGTERM doesn't fire the post-step, we'll try other signals. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The check-run (e.g. 'pre-build-go-binaries (default, amd64)') belongs to the winner. When a copy is killed by the monitor, its post-step should do nothing — the winning copy's post-step already POSTed completed/success. The losing copy's GHA job shows as 'failure' in the matrix, but that's cosmetic — branch protection watches the check-run name, not the job name. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The monitor previously watched check-runs for conclusion=success, but that isn't set until the post-step fires (after ALL work steps). By then the duplicate has already finished and uploaded artifacts. Now watches the jobs API for sibling jobs that completed successfully. GHA marks a job as completed/success as soon as its last step finishes, BEFORE the post-step. This gives the monitor a window to kill duplicates before they reach their upload/push steps. Also reduced initial delay from 8-10s to 3-4s and poll interval from 5s to 3s for faster reaction. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…tiebreak) Copy a always wins if fast (no monitor). Copy b watches a. Copy c watches a and b. Copy d watches a, b, and c. When a lower-lettered sibling has posted an in_progress check-run (meaning it passed the CPU gate and is on a fast runner), this copy kills itself — it's a duplicate. This fires within seconds of the siblings starting (5-7s initial delay + first poll), before real work produces artifacts or pushes images. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Two bugs from the last run where ALL copies were killed: 1. fail-fast:true (default) caused copy c's 'failure' (monitor kill) to cascade-cancel all other matrix entries including arm64. Fixed with fail-fast:false. 2. CPU gate's kill -TERM is async — bash continues executing past the kill line before the runner processes the signal. Added exit 0 after sleep 600 to prevent fallthrough to the check-run POST. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

An orphaned 'disown' from the old version ran for all copies — including copy a which has no background job. With set -e (GHA default for bash), disown with nothing to disown exits non-zero, killing the gate script. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Temporary debug step to see the background monitor's log output. The monitor writes to /tmp/speedracer-monitor.log but we can't see it from the API. Adding an always-run step to dump it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Inside 'nohup bash -c', $PPID is the nohup bash's parent (the step's bash process), not the runner worker. SIGTERMing the step bash kills the step but the runner starts the next step normally. Fixed by interpolating $PPID from the outer script into the heredoc string at construction time, so the inner bash gets the literal runner worker PID. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Copy b was killed by the monitor despite its sibling (a) never posting an in_progress check-run. Adding total check-run count, per-sibling found count, and full matched check-run dump to the monitor log. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Uses stackrox/stackrox/.github/actions/speedracer@davdhacs/speedracer-minimal to load the action directly from the repo without needing checkout first. If GHA resolves the branch name with slashes correctly, this eliminates the checkout→speedracer dependency and lets the check-run POST happen ~3-5s earlier. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Branch-ref download took 5.5-8.8s (full repo tarball). Sparse checkout with depth:1 should transfer only the action file's blob — much faster. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…kout Tests the separate-repo approach. The action is a single file in davdhacs/speedracer, tagged v0.1.0. GHA downloads it during "Set up job" — should be near-instant for a single-file repo vs ~5-9s for the full stackrox repo archive. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…trix dims github.job is "pre-build-go-binaries" for BOTH amd64 and arm64 matrix entries. The arm64 copy's in_progress check-run had external_id "spec-a-<run_id>-pre-build-go-binaries" — identical to what amd64 copy d's monitor was watching. d's monitor found arm64's check-run and incorrectly killed d, leaving no amd64 copy alive. Fixed by using CHECK_NAME (which includes the matrix values, e.g. "pre-build-go-binaries (default, amd64)") instead of JOB_KEY in the external_id. This makes external_id unique per matrix dimension combo. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

CHECK_NAME includes matrix values like "pre-build-go-binaries (default, amd64)" — the spaces broke the 'for sid in ${WATCH_IDS}' word splitting in the nohup heredoc. Siblings were never found because the jq query received fragments like "(default," instead of the full external_id. Fixed by using | as delimiter and IFS="|" read -ra to split correctly. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Full rollout for validation — same set as #20923: Build: - pre-build-go-binaries [a,b,c,d] (arm64 excluded from b/c/d) - build-and-push-main [a,b,c] (arm64 excluded from b/c) - build-and-push-operator [a,b,c] (arm64 excluded from b/c) - build-and-push-scanner [a,b] (arm64 excluded from b) Unit Tests: - go [a,b] - go-postgres [a,b] Style: - check-generated-files [a,b] - style-check [a,b] All using davdhacs/speedracer@v0.1.0 (before checkout). Will revert to minimal set after validation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-actions-pin-check requires SHA-pinned references with ratchet comments for all external actions. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Master already has this condition. My addition created a duplicate which caused a YAML parse error: "'if' is already defined" — silently preventing the Build workflow from triggering on PR pushes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…lings are slow When all non-last copies are CPU-gated, the last-copy fallback is the only one that should survive. But the last copy's monitor could find a sibling's in_progress check-run (posted before the sibling's CPU gate killed it) and kill itself — leaving no copy alive. Fixed: if MY_COPY == LAST_COPY, skip monitoring entirely. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Updates davdhacs/speedracer to latest (marker-based post-step). Adds 'run: touch /tmp/speedracer-success' as last step in: - pre-build-go-binaries (build.yaml) - build-and-push-main (build.yaml) - build-and-push-operator (build.yaml) - build-and-push-scanner (build.yaml) - go, go-postgres (unit-tests.yaml) - check-generated-files, style-check (style.yaml) The marker signals job success to the speedracer post-step. Without it, the post-step can't distinguish success from failure. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…cer SHA Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

These jobs have <2% slow/fast CPU timing variance: check-generated-files: fast ~1020s, slow ~1080s (1.06x ratio) style-check: fast ~600s, slow ~610s (1.02x ratio) Speedracer adds value when the slow/fast ratio is significant (>1.2x). For CPU-bound jobs with minimal I/O, the EPYC 7763 vs 9V74 difference is negligible — both CPUs have similar single-thread performance. The variance comes from disk I/O (builds, artifact uploads), which these style/lint jobs don't do. Rule of thumb for applying speedracer: >1.5x slow/fast ratio → strong candidate (builds, tests with I/O) 1.2-1.5x ratio → marginal (consider if on critical path) <1.2x ratio → don't apply (overhead > savings) Jobs that benefit most: build-and-push-scanner: 3.18x ratio → 49% avg saving pre-build-go-binaries: 2.05x ratio → 21% avg saving build-and-push-main: 1.87x ratio → 26% avg saving go unit tests: 1.53x ratio → 19% avg saving go-postgres: 1.50x ratio → 17% avg saving Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Jobs with higher absolute time waste from slow runners get more copies: go, go-postgres [a,b] → [a,b,c,d]: 730s waste (12 min) justifies 4 copies. P(all slow) drops from 17.6% to 3.1%. Expected waste: 129s → 23s. pre-build-go-binaries [a,b,c,d] → [a,b,c]: 195s waste. 3 copies is sufficient — P(all slow) = 7.4%, expected waste = 14s. build-and-push-scanner [a,b] → [a,b,c]: 185s waste with 3.18x ratio (highest). 3 copies gives P(all slow) = 7.4%. build-and-push-main, build-and-push-operator: [a,b,c] unchanged. Copy count rule of thumb: more copies for longer jobs, not just higher ratios. A 1.5x ratio on a 23-minute job wastes 12 minutes — worth 4 copies. A 3.2x ratio on a 1.5-minute job wastes 3 minutes — 2-3 copies is enough. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

With 3 copies, 1 in 13 PRs hits all-slow on each critical path job. That's multiple slow PRs per week on an active team. With 4 copies, it drops to 1 in 32 — roughly weekly. All jobs now [a,b,c,d]: pre-build-go-binaries: P(all slow) 7.4% → 3.1% build-and-push-main: 7.4% → 3.1% build-and-push-operator: 7.4% → 3.1% build-and-push-scanner: 7.4% → 3.1% go unit tests: already [a,b,c,d] go-postgres: already [a,b,c,d] Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…n time ${{ job.status }} in gacts post: field is evaluated at post-step time, not registration time. Earlier failures were from broken YAML >- folding. Removes 'run: touch /tmp/speedracer-success' from all 6 speedracer jobs. Callers now need only the speedracer action call — zero extra steps. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

pre-build-go-binaries had an inadvertent fail-fast:false not present in master. Removed — the blocking gate within the composite step produces 'cancelled' status, so fail-fast is safe without this override. Also collapsed double blank lines in build.yaml and unit-tests.yaml. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

… repo All workflows reference davdhacs/speedracer@SHA (downloaded before checkout). The local copy in .github/actions/speedracer/ is unused. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…-minimal

openshift-ci · 2026-06-13T14:37:01Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-13T14:37:07Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 92e1bdc1-b636-42a4-8fdd-7c53109daac6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch davdhacs/speedracer-ranking-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-13T14:46:44Z

🚀 Build Images Ready

Images are ready for commit cd18726. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-151-gcd18726309

…ob rankings

davdhacs and others added 30 commits June 6, 2026 08:07

perf(ci): use 4 copies for pre-build-go-binaries

e28f001

Critical-path job — 4 copies reduces all-slow probability to 3.1%. Cancelled copies cost ~15s each, trivial vs the 195s slow-runner penalty. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

test(ci): try sparse checkout + depth:1 for faster speedracer load

7cc1730

Branch-ref download took 5.5-8.8s (full repo tarball). Sparse checkout with depth:1 should transfer only the action file's blob — much faster. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fix(ci): ratchet-pin davdhacs/speedracer to SHA

4fe8d4b

github-actions-pin-check requires SHA-pinned references with ratchet comments for all external actions. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ci: trigger CI run — verify speedracer across all workflows

962f175

fix(ci): restore checkout SHA — previous sed replaced it with speedra…

1ebbd13

…cer SHA Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ci: verification run 2 — collect more timing data

ab9523b

ci: verification run 2 — 4-copy on all jobs

1476361

ci: verification run 3 — final confirmation

e58355d

davdhacs and others added 7 commits June 7, 2026 16:54

ci: verification run 4

5706ed9

Merge remote-tracking branch 'origin/master' into davdhacs/speedracer…

0875487

…-minimal

Apply suggestion from @davdhacs

4d12b50

test: point speedracer at ranking branch for testing

ad390fe

openshift-ci Bot added the do-not-merge/work-in-progress label Jun 13, 2026

github-actions Bot added area/ci ai-review coderabbit-review labels Jun 13, 2026

davdhacs added 5 commits June 13, 2026 08:48

test: update speedracer to 817d41f (r003 format, no filter)

ed5388c

fix: use full SHA for speedracer ref

f117ae3

test: go unit tests use stackrox/actions for download time comparison

3261aa4

test: all jobs use stackrox/actions/speculative-run-matrix with per-j…

ce0ed5e

…ob rankings

fix: remove stale ratchet comments referencing davdhacs/speedracer

cd18726

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: speedracer ranking on real workloads#21118

test: speedracer ranking on real workloads#21118
davdhacs wants to merge 42 commits into
masterfrom
davdhacs/speedracer-ranking-test

davdhacs commented Jun 13, 2026

Uh oh!

openshift-ci Bot commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davdhacs commented Jun 13, 2026

Uh oh!

openshift-ci Bot commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

github-actions Bot commented Jun 13, 2026 •

edited

Loading