OCPBUGS-67134: add grace period before reporting Available=False#1179
OCPBUGS-67134: add grace period before reporting Available=False#1179sg00dwin wants to merge 1 commit into
Conversation
The console operator immediately reports Available=False when deployment replicas drop to zero, even during brief disruptions (~10s) that self-recover. Add a 2-minute grace period that suppresses the condition when the deployment was recently available, preventing false alarms during disruptive CI tests while still reporting genuine outages. Co-Authored-By: Claude Opus 4.6
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-67134, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sg00dwin The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🔗 Linked repositories identifiedCodeRabbit considers these linked repositories for cross-repo context during reviews:
📜 Recent review details🧰 Additional context used📓 Path-based instructions (7)**/*.go📄 CodeRabbit inference engine (AGENTS.md)
Files:
⚙️ CodeRabbit configuration file
Files:
{pkg,cmd}/**/*.go📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/operator/**/*.go📄 CodeRabbit inference engine (Custom checks)
Files:
**⚙️ CodeRabbit configuration file
Files:
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}⚙️ CodeRabbit configuration file
Files:
**/*sync*.go📄 CodeRabbit inference engine (CONVENTIONS.md)
Files:
**/*_test.go📄 CodeRabbit inference engine (AGENTS.md)
Files:
⚙️ CodeRabbit configuration file
Files:
🔇 Additional comments (3)
WalkthroughAdds a ChangesDeployment Availability Grace Period
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/retest |
|
/jira refresh |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-67134, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (yapei@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
/test e2e-aws-console |
1 similar comment
|
/test e2e-aws-console |
|
@sg00dwin good investigation on this one, the root cause analysis is solid and the tests are well written 👍 I've been thinking about this more though, and I'm not sure suppressing The real question is: why are all replicas going down during a single node reboot? If they're colocated on the same node, that's the root cause we should fix. A few alternatives worth exploring:
Can you dig into whether we have topology constraints on the console deployment? That feels like the most impactful fix here. |
|
/test e2e-aws-console |
@jhadvig Thanks for the thorough review! On suggestions 1 and 2 the console deployment already has both:
The blip still happens because the conformance-serial tests do involuntary node reboots, which bypass the PDB. When the test hits the right nodes, both pods go offline for ~10 seconds regardless. On option 3 - there's already an origin exception demoting this to a flake, but OTA-362 is moving to remove those exceptions rather than add smarter ones. So the test layer is heading in the opposite direction. The operator-level grace period was modeled after the sibling fixes (OCPBUGS-24041, OCPBUGS-38676, OCPBUGS-64688), but your point about Available being intentionally real-time is a fair one. Would you prefer we pursue the library-go path instead — proposing inertia for Available on StatusSyncer so it's handled consistently across operators? Happy to go either direction. |
|
/test e2e-aws-console |
1 similar comment
|
/test e2e-aws-console |
|
@sg00dwin: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Available=FalsewithDeployment_InsufficientReplicas, triggering OTA invariant test failuresWhat changed
lastDeploymentAvailableTimeon the operator struct)evaluateDeploymentAvailability()method that checks the grace window before reportingAvailable=FalseRelated
Test plan
make test-unitclean (go test, gofmt, govet)console+Deployment_InsufficientReplicasCo-Authored-By: Claude Opus 4.6
Summary by CodeRabbit