Skip to content

fix(redis-ha): recover no-master cycle when sentinel names a replica as master#410

Open
mohamed3laa333 wants to merge 1 commit into
DandyDeveloper:masterfrom
mohamed3laa333:fix/redis-ha-no-master-cycle-promote
Open

fix(redis-ha): recover no-master cycle when sentinel names a replica as master#410
mohamed3laa333 wants to merge 1 commit into
DandyDeveloper:masterfrom
mohamed3laa333:fix/redis-ha-no-master-cycle-promote

Conversation

@mohamed3laa333

Copy link
Copy Markdown

Problem

After disorderly pod restarts (e.g. several redis pods restarting near-simultaneously), the Sentinel quorum can settle on a master address that points at a node whose role is actually slave. Every node ends up a replica of another (an all-replicas / no-master cycle), so HAProxy bk_redis_master has 0 backends and writes fail until a human intervenes with sentinel reset.

#404 (closing #397/#398) added a sentinel reset self-heal, but it only fires when get-master-addr-by-name returns an empty master. The non-empty variant — Sentinel returns a real address that answers role:slave — is not covered.

Root cause

In fix-split-brain.sh, when MASTER == ANNOUNCE_IP (Sentinel names this pod as master) but the local role is slave, the script calls reinit. reinit re-runs init.sh (which re-derives slaveof for any pod that is not StatefulSet ordinal 0) and shutdowns the pod. So the pod restarts straight back into a replica while Sentinel keeps naming it → an endless reinit/shutdown CrashLoop (the "unnecessary master shutdown" reported in #383) that never elects a master, and the cycle never breaks.

Fix

When Sentinel names this pod as master but Redis here is a replica, promote it with replicaof no one instead of reinit-looping it:

  • Sentinel has already elected this pod, so promoting it makes reality match Sentinel's view.
  • Only one pod can match get-master-addr-by-name, so this cannot create two masters.
  • The other replicas already target this address, so their replication links recover once it is a real master.
  • It stops the reinit/shutdown CrashLoop.

This generalizes the self-healing introduced in #404 to the pingable-slave variant. The else branch keeps the original "no need to reinitialize" log for the healthy case.

Validation

  • helm lint passes; helm template confirms promote_self / replicaof no one render into the split-brain ConfigMap and the rest of the chart is unchanged.
  • Diff is 2 files (Chart.yaml version bump 4.39.0 → 4.39.1, and the split-brain script).

Refs #398, #397, #383.

…as master

After disorderly pod restarts the Sentinel quorum can settle on a master
address that points at a node whose role is actually `slave`, leaving the
cluster in an all-replicas / no-master cycle with HAProxy `bk_redis_master`
serving 0 backends (a write outage).

PR DandyDeveloper#404 (closing DandyDeveloper#397/DandyDeveloper#398) only added a `sentinel reset` for the case where
`get-master-addr-by-name` returns an EMPTY master. The non-empty case (Sentinel
names a real address that answers `role:slave`) instead falls into the
`MASTER = ANNOUNCE_IP` branch, which calls `reinit`. For any pod that is not
StatefulSet ordinal 0, `reinit` re-runs init.sh (which re-derives `slaveof`)
and shuts the pod down, so it restarts straight back into a replica while
Sentinel keeps naming it: an endless reinit/shutdown CrashLoop (the
"unnecessary master shutdown" of DandyDeveloper#383) that never elects a master.

Fix: when Sentinel names this pod as master but Redis here is a replica,
promote it with `replicaof no one` instead of reinit-looping it. Sentinel has
already elected this pod, and only one pod can match get-master-addr-by-name,
so this cannot create two masters; the other replicas already target this
address, so their links recover once it is a real master. This generalizes the
self-healing of DandyDeveloper#404 to the pingable-slave variant.

Refs DandyDeveloper#398, DandyDeveloper#397, DandyDeveloper#383.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant