fix(redis-ha): recover no-master cycle when sentinel names a replica as master by mohamed3laa333 · Pull Request #410 · DandyDeveloper/charts

mohamed3laa333 · 2026-06-29T09:36:57Z

Problem

After disorderly pod restarts (e.g. several redis pods restarting near-simultaneously), the Sentinel quorum can settle on a master address that points at a node whose role is actually slave. Every node ends up a replica of another (an all-replicas / no-master cycle), so HAProxy bk_redis_master has 0 backends and writes fail until a human intervenes with sentinel reset.

#404 (closing #397/#398) added a sentinel reset self-heal, but it only fires when get-master-addr-by-name returns an empty master. The non-empty variant — Sentinel returns a real address that answers role:slave — is not covered.

Root cause

In fix-split-brain.sh, when MASTER == ANNOUNCE_IP (Sentinel names this pod as master) but the local role is slave, the script calls reinit. reinit re-runs init.sh (which re-derives slaveof for any pod that is not StatefulSet ordinal 0) and shutdowns the pod. So the pod restarts straight back into a replica while Sentinel keeps naming it → an endless reinit/shutdown CrashLoop (the "unnecessary master shutdown" reported in #383) that never elects a master, and the cycle never breaks.

Fix

When Sentinel names this pod as master but Redis here is a replica, promote it with replicaof no one instead of reinit-looping it:

Sentinel has already elected this pod, so promoting it makes reality match Sentinel's view.
Only one pod can match get-master-addr-by-name, so this cannot create two masters.
The other replicas already target this address, so their replication links recover once it is a real master.
It stops the reinit/shutdown CrashLoop.

This generalizes the self-healing introduced in #404 to the pingable-slave variant. The else branch keeps the original "no need to reinitialize" log for the healthy case.

Validation

helm lint passes; helm template confirms promote_self / replicaof no one render into the split-brain ConfigMap and the rest of the chart is unchanged.
Diff is 2 files (Chart.yaml version bump 4.39.0 → 4.39.1, and the split-brain script).

Refs #398, #397, #383.

…as master After disorderly pod restarts the Sentinel quorum can settle on a master address that points at a node whose role is actually `slave`, leaving the cluster in an all-replicas / no-master cycle with HAProxy `bk_redis_master` serving 0 backends (a write outage). PR DandyDeveloper#404 (closing DandyDeveloper#397/DandyDeveloper#398) only added a `sentinel reset` for the case where `get-master-addr-by-name` returns an EMPTY master. The non-empty case (Sentinel names a real address that answers `role:slave`) instead falls into the `MASTER = ANNOUNCE_IP` branch, which calls `reinit`. For any pod that is not StatefulSet ordinal 0, `reinit` re-runs init.sh (which re-derives `slaveof`) and shuts the pod down, so it restarts straight back into a replica while Sentinel keeps naming it: an endless reinit/shutdown CrashLoop (the "unnecessary master shutdown" of DandyDeveloper#383) that never elects a master. Fix: when Sentinel names this pod as master but Redis here is a replica, promote it with `replicaof no one` instead of reinit-looping it. Sentinel has already elected this pod, and only one pod can match get-master-addr-by-name, so this cannot create two masters; the other replicas already target this address, so their links recover once it is a real master. This generalizes the self-healing of DandyDeveloper#404 to the pingable-slave variant. Refs DandyDeveloper#398, DandyDeveloper#397, DandyDeveloper#383.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(redis-ha): recover no-master cycle when sentinel names a replica as master#410

fix(redis-ha): recover no-master cycle when sentinel names a replica as master#410
mohamed3laa333 wants to merge 1 commit into
DandyDeveloper:masterfrom
mohamed3laa333:fix/redis-ha-no-master-cycle-promote

mohamed3laa333 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohamed3laa333 commented Jun 29, 2026

Problem

Root cause

Fix

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant