Skip to content

fix(connect): start retry backoff small instead of at ~1.4s#289

Open
mishushakov wants to merge 1 commit into
zeromq:masterfrom
mishushakov:fix-connect-backoff
Open

fix(connect): start retry backoff small instead of at ~1.4s#289
mishushakov wants to merge 1 commit into
zeromq:masterfrom
mishushakov:fix-connect-backoff

Conversation

@mishushakov

Copy link
Copy Markdown

Problem

connect_forever (src/util.rs) is the retry loop behind every Socket::connect. When the peer's port isn't bound yet, transport::connect returns ConnectionRefused (correctly classified as retryable) and the loop sleeps before retrying.

The delay was computed as:

if try_num < 5 { try_num += 1 }                  // bumped to 1 BEFORE the first sleep
let delay = E.powf(try_num as f64 / 3.0)         // e^(1/3) ≈ 1.40
          + rng.random_range(0.0..0.1);          // + jitter

So the first retry already waits ~1.4s (then ~1.95s, ~2.7s, …). A peer whose kernel binds in ~330ms therefore isn't reached until ~1.4s — attempt #1 fails, the loop sleeps ~1.4s, attempt #2 succeeds.

Fix

Start the backoff small and grow it, mirroring the crate's existing ReconnectConfig semantics (exponential, capped, jittered):

const INITIAL: Duration = Duration::from_millis(50);
const MAX: Duration = Duration::from_secs(30);
let mut delay = INITIAL;
// ...
let jitter = rand::rng().random_range(0.0f64..0.1f64);
async_rt::task::sleep(delay + delay.mul_f64(jitter)).await;
delay = (delay * 2).min(MAX);

Retries now fall at ~50, 100, 200, 400ms… — a peer ready at ~330ms is reached at ~350ms instead of ~1.4s. Same semantics otherwise: still exponential, still capped at 30s, still jittered. The outer run_with_timeout(connect_timeout) (default 30s) still bounds a peer that never binds.

Testing

  • cargo build succeeds
  • cargo clippy --lib is clean

🤖 Generated with Claude Code

connect_forever opened its retry backoff at e^(1/3) ≈ 1.4s, so a peer
whose port wasn't bound yet (ConnectionRefused) wasn't reached until the
first ~1.4s sleep elapsed, even though the kernel typically binds in
~330ms.

Replace the e^(try_num/3) formula with exponential backoff starting at
50ms and doubling (capped at 30s), mirroring ReconnectConfig semantics.
A peer ready at ~330ms is now reached at ~350ms. Jitter and the outer
connect timeout are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant