Skip to content

test: speedracer ranking on real workloads#21118

Draft
davdhacs wants to merge 42 commits into
masterfrom
davdhacs/speedracer-ranking-test
Draft

test: speedracer ranking on real workloads#21118
davdhacs wants to merge 42 commits into
masterfrom
davdhacs/speedracer-ranking-test

Conversation

@davdhacs

Copy link
Copy Markdown
Contributor

Testing the new rank-based speedracer (davdhacs/speedracer@cfb0410) on real CI workloads.

Changes from PR #20998 (speedracer-minimal) plus pointing at the ranking branch instead of the old binary-gate version.

What's different:

  • Runners read their CPU rank from runner-pool.tsv (line number = rank)
  • Rank embedded in check-run external_id: spec-{letter}-r{rank}-{run_id}-{check_name}
  • Higher-ranked siblings always win; same-rank → higher letter wins
  • No slow-cpu-pattern input — ranking replaces the binary gate
  • No copy-a special case — all copies follow the same rank+letter priority

What to watch:

  • Do runners correctly identify their rank from the TSV?
  • Does the highest-ranked copy win?
  • Are check-runs posted and resolved correctly?
  • Does branch protection work (success check-run for the winning copy)?

This branch is disposable — will be deleted after testing.

davdhacs and others added 30 commits June 6, 2026 08:07
pre-build-go-binaries is on the critical path for every image build —
it blocks build-and-push-main, build-and-push-operator, and
build-and-push-scanner. When it lands on a slow runner (EPYC 7763,
~42% of the pool), the job takes 380s vs 185s on fast hardware —
adding 3.25 minutes to the build pipeline wall-clock.

Speedracer runs 2 copies in parallel. The first copy to land on a fast
runner wins; the other exits as 'cancelled' within seconds. arm64 is
excluded from the extra copy (consistent hardware, no variance).

The action (.github/actions/speedracer) is self-contained — callers
add 'speedracer: [a, b]' to their matrix and a single action call.
It auto-derives the check-run name from the matrix, auto-detects the
last-copy fallback, and resolves check-runs via a post-step hook.

AI-generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
With 42% slow-runner rate:
  2 copies: 18% chance both slow (0.42²)
  3 copies:  7% chance all slow (0.42³)

3 copies nearly eliminates the slow-runner penalty on this critical-path
job. arm64 excluded from b+c (consistent hardware).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Critical-path job — 4 copies reduces all-slow probability to 3.1%.
Cancelled copies cost ~15s each, trivial vs the 195s slow-runner penalty.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Removes the blocking gate. All copies begin work in parallel. A background
monitor polls for a sibling's completed/success check-run every 5s (after
8-10s initial delay). When a winner is found:

1. Monitor writes /tmp/speedracer-killed marker
2. Monitor SIGTERMs the runner ($PPID)
3. Post-step detects the marker and PATCHes the check-run to 'cancelled'
   (not 'failure') so the PR UI shows the correct status

This is a prototype — testing whether:
- SIGTERM from a background process during work steps fires the post-step
- The /tmp/speedracer-killed marker survives the SIGTERM
- The check-run PATCH to 'cancelled' works

If SIGTERM doesn't fire the post-step, we'll try other signals.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The check-run (e.g. 'pre-build-go-binaries (default, amd64)') belongs to
the winner. When a copy is killed by the monitor, its post-step should do
nothing — the winning copy's post-step already POSTed completed/success.

The losing copy's GHA job shows as 'failure' in the matrix, but that's
cosmetic — branch protection watches the check-run name, not the job name.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The monitor previously watched check-runs for conclusion=success, but
that isn't set until the post-step fires (after ALL work steps). By then
the duplicate has already finished and uploaded artifacts.

Now watches the jobs API for sibling jobs that completed successfully.
GHA marks a job as completed/success as soon as its last step finishes,
BEFORE the post-step. This gives the monitor a window to kill duplicates
before they reach their upload/push steps.

Also reduced initial delay from 8-10s to 3-4s and poll interval from
5s to 3s for faster reaction.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tiebreak)

Copy a always wins if fast (no monitor). Copy b watches a. Copy c watches
a and b. Copy d watches a, b, and c. When a lower-lettered sibling has
posted an in_progress check-run (meaning it passed the CPU gate and is on
a fast runner), this copy kills itself — it's a duplicate.

This fires within seconds of the siblings starting (5-7s initial delay +
first poll), before real work produces artifacts or pushes images.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two bugs from the last run where ALL copies were killed:

1. fail-fast:true (default) caused copy c's 'failure' (monitor kill) to
   cascade-cancel all other matrix entries including arm64. Fixed with
   fail-fast:false.

2. CPU gate's kill -TERM is async — bash continues executing past the
   kill line before the runner processes the signal. Added exit 0 after
   sleep 600 to prevent fallthrough to the check-run POST.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
An orphaned 'disown' from the old version ran for all copies — including
copy a which has no background job. With set -e (GHA default for bash),
disown with nothing to disown exits non-zero, killing the gate script.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Temporary debug step to see the background monitor's log output.
The monitor writes to /tmp/speedracer-monitor.log but we can't see it
from the API. Adding an always-run step to dump it.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Inside 'nohup bash -c', $PPID is the nohup bash's parent (the step's
bash process), not the runner worker. SIGTERMing the step bash kills
the step but the runner starts the next step normally.

Fixed by interpolating $PPID from the outer script into the heredoc
string at construction time, so the inner bash gets the literal runner
worker PID.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy b was killed by the monitor despite its sibling (a) never posting
an in_progress check-run. Adding total check-run count, per-sibling
found count, and full matched check-run dump to the monitor log.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Uses stackrox/stackrox/.github/actions/speedracer@davdhacs/speedracer-minimal
to load the action directly from the repo without needing checkout first.
If GHA resolves the branch name with slashes correctly, this eliminates
the checkout→speedracer dependency and lets the check-run POST happen
~3-5s earlier.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Branch-ref download took 5.5-8.8s (full repo tarball). Sparse checkout
with depth:1 should transfer only the action file's blob — much faster.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…kout

Tests the separate-repo approach. The action is a single file in
davdhacs/speedracer, tagged v0.1.0. GHA downloads it during "Set up job"
— should be near-instant for a single-file repo vs ~5-9s for the full
stackrox repo archive.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…trix dims

github.job is "pre-build-go-binaries" for BOTH amd64 and arm64 matrix
entries. The arm64 copy's in_progress check-run had external_id
"spec-a-<run_id>-pre-build-go-binaries" — identical to what amd64 copy d's
monitor was watching. d's monitor found arm64's check-run and incorrectly
killed d, leaving no amd64 copy alive.

Fixed by using CHECK_NAME (which includes the matrix values, e.g.
"pre-build-go-binaries (default, amd64)") instead of JOB_KEY in the
external_id. This makes external_id unique per matrix dimension combo.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CHECK_NAME includes matrix values like "pre-build-go-binaries (default,
amd64)" — the spaces broke the 'for sid in ${WATCH_IDS}' word splitting
in the nohup heredoc. Siblings were never found because the jq query
received fragments like "(default," instead of the full external_id.

Fixed by using | as delimiter and IFS="|" read -ra to split correctly.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Full rollout for validation — same set as #20923:

Build:
  - pre-build-go-binaries [a,b,c,d] (arm64 excluded from b/c/d)
  - build-and-push-main [a,b,c] (arm64 excluded from b/c)
  - build-and-push-operator [a,b,c] (arm64 excluded from b/c)
  - build-and-push-scanner [a,b] (arm64 excluded from b)

Unit Tests:
  - go [a,b]
  - go-postgres [a,b]

Style:
  - check-generated-files [a,b]
  - style-check [a,b]

All using davdhacs/speedracer@v0.1.0 (before checkout).
Will revert to minimal set after validation.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
github-actions-pin-check requires SHA-pinned references with ratchet
comments for all external actions.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Master already has this condition. My addition created a duplicate
which caused a YAML parse error: "'if' is already defined" — silently
preventing the Build workflow from triggering on PR pushes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lings are slow

When all non-last copies are CPU-gated, the last-copy fallback is the
only one that should survive. But the last copy's monitor could find a
sibling's in_progress check-run (posted before the sibling's CPU gate
killed it) and kill itself — leaving no copy alive.

Fixed: if MY_COPY == LAST_COPY, skip monitoring entirely.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Updates davdhacs/speedracer to latest (marker-based post-step).
Adds 'run: touch /tmp/speedracer-success' as last step in:
  - pre-build-go-binaries (build.yaml)
  - build-and-push-main (build.yaml)
  - build-and-push-operator (build.yaml)
  - build-and-push-scanner (build.yaml)
  - go, go-postgres (unit-tests.yaml)
  - check-generated-files, style-check (style.yaml)

The marker signals job success to the speedracer post-step.
Without it, the post-step can't distinguish success from failure.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…cer SHA

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
These jobs have <2% slow/fast CPU timing variance:
  check-generated-files: fast ~1020s, slow ~1080s (1.06x ratio)
  style-check:           fast ~600s,  slow ~610s  (1.02x ratio)

Speedracer adds value when the slow/fast ratio is significant (>1.2x).
For CPU-bound jobs with minimal I/O, the EPYC 7763 vs 9V74 difference
is negligible — both CPUs have similar single-thread performance. The
variance comes from disk I/O (builds, artifact uploads), which these
style/lint jobs don't do.

Rule of thumb for applying speedracer:
  >1.5x slow/fast ratio → strong candidate (builds, tests with I/O)
  1.2-1.5x ratio → marginal (consider if on critical path)
  <1.2x ratio → don't apply (overhead > savings)

Jobs that benefit most:
  build-and-push-scanner:  3.18x ratio → 49% avg saving
  pre-build-go-binaries:   2.05x ratio → 21% avg saving
  build-and-push-main:     1.87x ratio → 26% avg saving
  go unit tests:           1.53x ratio → 19% avg saving
  go-postgres:             1.50x ratio → 17% avg saving

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Jobs with higher absolute time waste from slow runners get more copies:

  go, go-postgres [a,b] → [a,b,c,d]: 730s waste (12 min) justifies 4 copies.
    P(all slow) drops from 17.6% to 3.1%. Expected waste: 129s → 23s.

  pre-build-go-binaries [a,b,c,d] → [a,b,c]: 195s waste. 3 copies is
    sufficient — P(all slow) = 7.4%, expected waste = 14s.

  build-and-push-scanner [a,b] → [a,b,c]: 185s waste with 3.18x ratio
    (highest). 3 copies gives P(all slow) = 7.4%.

  build-and-push-main, build-and-push-operator: [a,b,c] unchanged.

Copy count rule of thumb: more copies for longer jobs, not just higher ratios.
A 1.5x ratio on a 23-minute job wastes 12 minutes — worth 4 copies.
A 3.2x ratio on a 1.5-minute job wastes 3 minutes — 2-3 copies is enough.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
With 3 copies, 1 in 13 PRs hits all-slow on each critical path job.
That's multiple slow PRs per week on an active team. With 4 copies,
it drops to 1 in 32 — roughly weekly.

All jobs now [a,b,c,d]:
  pre-build-go-binaries:   P(all slow) 7.4% → 3.1%
  build-and-push-main:     7.4% → 3.1%
  build-and-push-operator:  7.4% → 3.1%
  build-and-push-scanner:   7.4% → 3.1%
  go unit tests:            already [a,b,c,d]
  go-postgres:              already [a,b,c,d]

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
davdhacs and others added 7 commits June 7, 2026 16:54
…n time

${{ job.status }} in gacts post: field is evaluated at post-step time,
not registration time. Earlier failures were from broken YAML >- folding.

Removes 'run: touch /tmp/speedracer-success' from all 6 speedracer jobs.
Callers now need only the speedracer action call — zero extra steps.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
pre-build-go-binaries had an inadvertent fail-fast:false not present in
master. Removed — the blocking gate within the composite step produces
'cancelled' status, so fail-fast is safe without this override.

Also collapsed double blank lines in build.yaml and unit-tests.yaml.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… repo

All workflows reference davdhacs/speedracer@SHA (downloaded before checkout).
The local copy in .github/actions/speedracer/ is unused.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 13, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 92e1bdc1-b636-42a4-8fdd-7c53109daac6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch davdhacs/speedracer-ranking-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit cd18726. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-151-gcd18726309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant