test: speedracer ranking on real workloads#21118
Conversation
pre-build-go-binaries is on the critical path for every image build — it blocks build-and-push-main, build-and-push-operator, and build-and-push-scanner. When it lands on a slow runner (EPYC 7763, ~42% of the pool), the job takes 380s vs 185s on fast hardware — adding 3.25 minutes to the build pipeline wall-clock. Speedracer runs 2 copies in parallel. The first copy to land on a fast runner wins; the other exits as 'cancelled' within seconds. arm64 is excluded from the extra copy (consistent hardware, no variance). The action (.github/actions/speedracer) is self-contained — callers add 'speedracer: [a, b]' to their matrix and a single action call. It auto-derives the check-run name from the matrix, auto-detects the last-copy fallback, and resolves check-runs via a post-step hook. AI-generated with Claude Code Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
With 42% slow-runner rate: 2 copies: 18% chance both slow (0.42²) 3 copies: 7% chance all slow (0.42³) 3 copies nearly eliminates the slow-runner penalty on this critical-path job. arm64 excluded from b+c (consistent hardware). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Critical-path job — 4 copies reduces all-slow probability to 3.1%. Cancelled copies cost ~15s each, trivial vs the 195s slow-runner penalty. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Removes the blocking gate. All copies begin work in parallel. A background monitor polls for a sibling's completed/success check-run every 5s (after 8-10s initial delay). When a winner is found: 1. Monitor writes /tmp/speedracer-killed marker 2. Monitor SIGTERMs the runner ($PPID) 3. Post-step detects the marker and PATCHes the check-run to 'cancelled' (not 'failure') so the PR UI shows the correct status This is a prototype — testing whether: - SIGTERM from a background process during work steps fires the post-step - The /tmp/speedracer-killed marker survives the SIGTERM - The check-run PATCH to 'cancelled' works If SIGTERM doesn't fire the post-step, we'll try other signals. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The check-run (e.g. 'pre-build-go-binaries (default, amd64)') belongs to the winner. When a copy is killed by the monitor, its post-step should do nothing — the winning copy's post-step already POSTed completed/success. The losing copy's GHA job shows as 'failure' in the matrix, but that's cosmetic — branch protection watches the check-run name, not the job name. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The monitor previously watched check-runs for conclusion=success, but that isn't set until the post-step fires (after ALL work steps). By then the duplicate has already finished and uploaded artifacts. Now watches the jobs API for sibling jobs that completed successfully. GHA marks a job as completed/success as soon as its last step finishes, BEFORE the post-step. This gives the monitor a window to kill duplicates before they reach their upload/push steps. Also reduced initial delay from 8-10s to 3-4s and poll interval from 5s to 3s for faster reaction. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tiebreak) Copy a always wins if fast (no monitor). Copy b watches a. Copy c watches a and b. Copy d watches a, b, and c. When a lower-lettered sibling has posted an in_progress check-run (meaning it passed the CPU gate and is on a fast runner), this copy kills itself — it's a duplicate. This fires within seconds of the siblings starting (5-7s initial delay + first poll), before real work produces artifacts or pushes images. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two bugs from the last run where ALL copies were killed: 1. fail-fast:true (default) caused copy c's 'failure' (monitor kill) to cascade-cancel all other matrix entries including arm64. Fixed with fail-fast:false. 2. CPU gate's kill -TERM is async — bash continues executing past the kill line before the runner processes the signal. Added exit 0 after sleep 600 to prevent fallthrough to the check-run POST. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
An orphaned 'disown' from the old version ran for all copies — including copy a which has no background job. With set -e (GHA default for bash), disown with nothing to disown exits non-zero, killing the gate script. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Temporary debug step to see the background monitor's log output. The monitor writes to /tmp/speedracer-monitor.log but we can't see it from the API. Adding an always-run step to dump it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Inside 'nohup bash -c', $PPID is the nohup bash's parent (the step's bash process), not the runner worker. SIGTERMing the step bash kills the step but the runner starts the next step normally. Fixed by interpolating $PPID from the outer script into the heredoc string at construction time, so the inner bash gets the literal runner worker PID. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy b was killed by the monitor despite its sibling (a) never posting an in_progress check-run. Adding total check-run count, per-sibling found count, and full matched check-run dump to the monitor log. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Uses stackrox/stackrox/.github/actions/speedracer@davdhacs/speedracer-minimal to load the action directly from the repo without needing checkout first. If GHA resolves the branch name with slashes correctly, this eliminates the checkout→speedracer dependency and lets the check-run POST happen ~3-5s earlier. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Branch-ref download took 5.5-8.8s (full repo tarball). Sparse checkout with depth:1 should transfer only the action file's blob — much faster. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…kout Tests the separate-repo approach. The action is a single file in davdhacs/speedracer, tagged v0.1.0. GHA downloads it during "Set up job" — should be near-instant for a single-file repo vs ~5-9s for the full stackrox repo archive. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…trix dims github.job is "pre-build-go-binaries" for BOTH amd64 and arm64 matrix entries. The arm64 copy's in_progress check-run had external_id "spec-a-<run_id>-pre-build-go-binaries" — identical to what amd64 copy d's monitor was watching. d's monitor found arm64's check-run and incorrectly killed d, leaving no amd64 copy alive. Fixed by using CHECK_NAME (which includes the matrix values, e.g. "pre-build-go-binaries (default, amd64)") instead of JOB_KEY in the external_id. This makes external_id unique per matrix dimension combo. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CHECK_NAME includes matrix values like "pre-build-go-binaries (default,
amd64)" — the spaces broke the 'for sid in ${WATCH_IDS}' word splitting
in the nohup heredoc. Siblings were never found because the jq query
received fragments like "(default," instead of the full external_id.
Fixed by using | as delimiter and IFS="|" read -ra to split correctly.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Full rollout for validation — same set as #20923: Build: - pre-build-go-binaries [a,b,c,d] (arm64 excluded from b/c/d) - build-and-push-main [a,b,c] (arm64 excluded from b/c) - build-and-push-operator [a,b,c] (arm64 excluded from b/c) - build-and-push-scanner [a,b] (arm64 excluded from b) Unit Tests: - go [a,b] - go-postgres [a,b] Style: - check-generated-files [a,b] - style-check [a,b] All using davdhacs/speedracer@v0.1.0 (before checkout). Will revert to minimal set after validation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
github-actions-pin-check requires SHA-pinned references with ratchet comments for all external actions. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Master already has this condition. My addition created a duplicate which caused a YAML parse error: "'if' is already defined" — silently preventing the Build workflow from triggering on PR pushes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lings are slow When all non-last copies are CPU-gated, the last-copy fallback is the only one that should survive. But the last copy's monitor could find a sibling's in_progress check-run (posted before the sibling's CPU gate killed it) and kill itself — leaving no copy alive. Fixed: if MY_COPY == LAST_COPY, skip monitoring entirely. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Updates davdhacs/speedracer to latest (marker-based post-step). Adds 'run: touch /tmp/speedracer-success' as last step in: - pre-build-go-binaries (build.yaml) - build-and-push-main (build.yaml) - build-and-push-operator (build.yaml) - build-and-push-scanner (build.yaml) - go, go-postgres (unit-tests.yaml) - check-generated-files, style-check (style.yaml) The marker signals job success to the speedracer post-step. Without it, the post-step can't distinguish success from failure. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…cer SHA Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
These jobs have <2% slow/fast CPU timing variance: check-generated-files: fast ~1020s, slow ~1080s (1.06x ratio) style-check: fast ~600s, slow ~610s (1.02x ratio) Speedracer adds value when the slow/fast ratio is significant (>1.2x). For CPU-bound jobs with minimal I/O, the EPYC 7763 vs 9V74 difference is negligible — both CPUs have similar single-thread performance. The variance comes from disk I/O (builds, artifact uploads), which these style/lint jobs don't do. Rule of thumb for applying speedracer: >1.5x slow/fast ratio → strong candidate (builds, tests with I/O) 1.2-1.5x ratio → marginal (consider if on critical path) <1.2x ratio → don't apply (overhead > savings) Jobs that benefit most: build-and-push-scanner: 3.18x ratio → 49% avg saving pre-build-go-binaries: 2.05x ratio → 21% avg saving build-and-push-main: 1.87x ratio → 26% avg saving go unit tests: 1.53x ratio → 19% avg saving go-postgres: 1.50x ratio → 17% avg saving Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Jobs with higher absolute time waste from slow runners get more copies:
go, go-postgres [a,b] → [a,b,c,d]: 730s waste (12 min) justifies 4 copies.
P(all slow) drops from 17.6% to 3.1%. Expected waste: 129s → 23s.
pre-build-go-binaries [a,b,c,d] → [a,b,c]: 195s waste. 3 copies is
sufficient — P(all slow) = 7.4%, expected waste = 14s.
build-and-push-scanner [a,b] → [a,b,c]: 185s waste with 3.18x ratio
(highest). 3 copies gives P(all slow) = 7.4%.
build-and-push-main, build-and-push-operator: [a,b,c] unchanged.
Copy count rule of thumb: more copies for longer jobs, not just higher ratios.
A 1.5x ratio on a 23-minute job wastes 12 minutes — worth 4 copies.
A 3.2x ratio on a 1.5-minute job wastes 3 minutes — 2-3 copies is enough.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
With 3 copies, 1 in 13 PRs hits all-slow on each critical path job. That's multiple slow PRs per week on an active team. With 4 copies, it drops to 1 in 32 — roughly weekly. All jobs now [a,b,c,d]: pre-build-go-binaries: P(all slow) 7.4% → 3.1% build-and-push-main: 7.4% → 3.1% build-and-push-operator: 7.4% → 3.1% build-and-push-scanner: 7.4% → 3.1% go unit tests: already [a,b,c,d] go-postgres: already [a,b,c,d] Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…n time
${{ job.status }} in gacts post: field is evaluated at post-step time,
not registration time. Earlier failures were from broken YAML >- folding.
Removes 'run: touch /tmp/speedracer-success' from all 6 speedracer jobs.
Callers now need only the speedracer action call — zero extra steps.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
pre-build-go-binaries had an inadvertent fail-fast:false not present in master. Removed — the blocking gate within the composite step produces 'cancelled' status, so fail-fast is safe without this override. Also collapsed double blank lines in build.yaml and unit-tests.yaml. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… repo All workflows reference davdhacs/speedracer@SHA (downloaded before checkout). The local copy in .github/actions/speedracer/ is unused. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🚀 Build Images ReadyImages are ready for commit cd18726. To use with deploy scripts: export MAIN_IMAGE_TAG=4.12.x-151-gcd18726309 |
Testing the new rank-based speedracer (davdhacs/speedracer@cfb0410) on real CI workloads.
Changes from PR #20998 (speedracer-minimal) plus pointing at the ranking branch instead of the old binary-gate version.
What's different:
spec-{letter}-r{rank}-{run_id}-{check_name}slow-cpu-patterninput — ranking replaces the binary gateWhat to watch:
This branch is disposable — will be deleted after testing.