Skip to content

docs: add Rook-Ceph upgrade network troubleshooting#791

Merged
jing2uo merged 1 commit into
mainfrom
docs/acp-53205-rook-ceph-network-kb
Jun 18, 2026
Merged

docs: add Rook-Ceph upgrade network troubleshooting#791
jing2uo merged 1 commit into
mainfrom
docs/acp-53205-rook-ceph-network-kb

Conversation

@Muyan0828

@Muyan0828 Muyan0828 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a Chinese troubleshooting KB for ACP-53205: Rook-Ceph upgrade stalls when the pod network cannot reach the Ceph storage network.
  • Add the matching English KB under docs/en/solutions.
  • Document diagnosis, long-term network remediation, temporary hostNetwork recovery, and pre-upgrade checks.

Test Plan

  • git diff --check -- docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md
  • Front matter and code fence count check with awk
  • rg -n "TBD|TODO|Required field|filed|rook-cef|rook-eph|对应实例" ...

Notes

  • yarn build could not be run in this workspace because node_modules is absent and Yarn reports: Couldn't find the node_modules state file - running an install might help.

Summary by CodeRabbit

  • Documentation
    • Added troubleshooting guides for Rook-Ceph upgrades on Alauda Container Platform (available in English and Chinese), including diagnostic procedures and recovery steps.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Two new troubleshooting documents (English and Chinese) are added describing how Rook-Ceph upgrades on ACP can stall with CephCluster stuck in Progressing when hostNetwork overrides are reverted by OLM reconciliation. The docs cover diagnosis, a long-term network fix, a temporary recovery procedure, and pre-upgrade prevention checks.

Changes

Rook-Ceph Upgrade hostNetwork Troubleshooting (EN + ZH)

Layer / File(s) Summary
Front-matter and issue/root-cause description
docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md, docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md
Document metadata and the issue statement describing CephCluster stuck in Progressing when pre-upgrade hostNetwork customizations are overwritten by OLM CSV/configmap reconciliation during upgrade from Reef to Squid.
Diagnostic steps
docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md, docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md
Commands to verify CephCluster state, check whether operator/CSI/tool pods run with hostNetwork, inspect CSV/configmap settings for reverted values, and test connectivity from a regular pod to storage-network endpoints.
Long-term resolution
docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md, docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md
Describes restoring pod-network-to-Ceph-storage-network routing, required MON/OSD port access, optional NetworkPolicy/SNAT considerations, and follow-up commands to verify reconciliation and upgrade completion.
Temporary recovery procedure and prevention checklist
docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md, docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md
Emergency recovery by editing the rook-ceph CSV to re-enable hostNetwork: true and restoring rook-ceph-operator-config parameters, plus a pre-upgrade prevention checklist and Jira reference (ACP-53205).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

  • fanzy618

Poem

🐰 A rabbit hops through storage lanes,
Where pod networks meet Ceph's domains,
hostNetwork lost, then found again,
The docs now guide through upgrade pains —
No more stuck clusters in the rain! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding Rook-Ceph upgrade network troubleshooting documentation. It is concise, clear, and directly summarizes the primary purpose of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/acp-53205-rook-ceph-network-kb

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md (1)

104-108: 💤 Low value

Clarify phrasing at line 104 for better readability.

The phrase "在与...相关的 Deployment 模板中补回" is grammatically acceptable but slightly awkward. Per LanguageTool feedback, consider rephrasing for clarity:

Current: "编辑 rook-ceph 命名空间中的 Rook-Ceph CSV,在与 rook-ceph-operatorrook-ceph-tools 或现场确认需要访问存储网络的 CSI provisioner 相关的 Deployment 模板中补回"

Suggested: "编辑 rook-ceph 命名空间中的 Rook-Ceph CSV,在 rook-ceph-operatorrook-ceph-tools 或现场确认需要访问存储网络的 CSI provisioner 相关的 Deployment 模板中补回"

(Remove "与" to streamline the clause.)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`
around lines 104 - 108, Locate the phrase containing "在与...相关的 Deployment 模板中补回"
in the documentation section about editing the Rook-Ceph CSV. Remove the
character "与" that appears after "在" to streamline the clause and improve
readability. The corrected phrase should read "在
`rook-ceph-operator`、`rook-ceph-tools` 或现场确认需要访问存储网络的 CSI provisioner 相关的
Deployment 模板中补回" without the connecting particle.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`:
- Line 7: The front-matter KB ID is set to the duplicate value `KB260500167`,
which is already used in another troubleshooting document. Replace this
duplicate ID with a new, distinct KB ID that is not currently used in any other
documentation files. Update the KB ID value in the front-matter of this document
to ensure uniqueness across all troubleshooting articles.

In
`@docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`:
- Around line 1-8: The YAML front-matter in this troubleshooting document is
missing the required id field. Add a new line after the ProductsVersion field
and before the closing --- delimiter with the key id and assign it a unique KB
identifier that does not conflict with any existing troubleshooting document
IDs. The complete front-matter should have kind, products, ProductsVersion, and
id fields in that order.

---

Nitpick comments:
In
`@docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`:
- Around line 104-108: Locate the phrase containing "在与...相关的 Deployment 模板中补回"
in the documentation section about editing the Rook-Ceph CSV. Remove the
character "与" that appears after "在" to streamline the clause and improve
readability. The corrected phrase should read "在
`rook-ceph-operator`、`rook-ceph-tools` 或现场确认需要访问存储网络的 CSI provisioner 相关的
Deployment 模板中补回" without the connecting particle.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2cb122cd-1c7f-48e7-a5d2-59f136668a92

📥 Commits

Reviewing files that changed from the base of the PR and between 746f212 and 303981f.

📒 Files selected for processing (2)
  • docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md
  • docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md

products:
- Alauda Container Platform
ProductsVersion:
- 4.3.1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Duplicate KB ID: KB260500167 conflicts with existing knowledge-base article.

The front-matter ID must be unique across all troubleshooting documents. The context reference snippet shows that docs/en/solutions/Capturing_data_for_an_intermittent_cluster_issue_with_paired_DaemonSets.md already uses this ID. Assign a new, distinct KB ID to this article.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`
at line 7, The front-matter KB ID is set to the duplicate value `KB260500167`,
which is already used in another troubleshooting document. Replace this
duplicate ID with a new, distinct KB ID that is not currently used in any other
documentation files. Update the KB ID value in the front-matter of this document
to ensure uniqueness across all troubleshooting articles.

Comment on lines +1 to +8
---
kind:
- Troubleshooting
products:
- Alauda Container Platform
ProductsVersion:
- '4.3.1'
---

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CRITICAL: Front-matter is missing required id field.

The front-matter template is incomplete. Per the context reference snippets (both EN and ZH Capturing_data_for_an_intermittent_cluster_issue_with_paired_DaemonSets.md), the id field is required after ProductsVersion and before the closing ---. Add a unique KB ID to both the EN and ZH files.

Example format:

---
kind:
  - Troubleshooting
products:
  - Alauda Container Platform
ProductsVersion:
  - '4.3.1'
id: <UNIQUE_KB_ID>
---

Replace <UNIQUE_KB_ID> with a new, distinct identifier (do not reuse existing KB IDs from other troubleshooting documents).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`
around lines 1 - 8, The YAML front-matter in this troubleshooting document is
missing the required id field. Add a new line after the ProductsVersion field
and before the closing --- delimiter with the key id and assign it a unique KB
identifier that does not conflict with any existing troubleshooting document
IDs. The complete front-matter should have kind, products, ProductsVersion, and
id fields in that order.

@jing2uo jing2uo merged commit bd597a2 into main Jun 18, 2026
2 checks passed
@jing2uo jing2uo deleted the docs/acp-53205-rook-ceph-network-kb branch June 18, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants