Bug 2310385
| Summary: | Upon CephFS volume recovery network fencing fails on external mode cluster | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Joy John Pinto <jopinto> |
| Component: | rook | Assignee: | Subham Rai <srai> |
| Status: | CLOSED ERRATA | QA Contact: | Joy John Pinto <jopinto> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.17 | CC: | mrajanna, nberry, odf-bz-bot, sapillai, srai, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.17.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.17.0-107 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-10-30 14:33:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Please update the RDT flag/text appropriately. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:8676 |
Description of problem (please be detailed as possible and provide log snippests): Upon CephFS volume recovery network fencing fails on external mode cluster Version of all relevant components (if applicable): OCP 4.17 and ODF 4.17.0-92 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? NA Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: NA Steps to Reproduce: Steps to Reproduce: 1. Install Openshift data foundation and deploy a app pod in external mode cluster 2. Shutdown the node on which CephFS RWO pod is deployed 3.Once the node is down, add taint ```oc taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute ``` Wait for some time(if the application pod and rook operator are on the same node wait for bit logger) then check the networkFence cr status Actual results: Network fence creation fails on external mode cluster with error 2024-09-06 10:14:57.907050 D | op-k8sutil: creating endpoint "rook-ceph-mgr-external". [{[{10.1.160.145 <nil> nil}] [] [{http-external-metrics 9283 TCP <nil>}]}] 2024-09-06 10:14:57.982565 D | exec: Running command: ceph tell mds.fsvol001.osd-0.icwduo client ls --format json --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring --format json 2024-09-06 10:14:58.140259 E | ceph-cluster-controller: failed to handle node failure. failed to create network fence for node "compute-0".: failed to fence cephfs subvolumes: failed to list watchers for cephfs subvolumeName csi-vol-4bab4cfd-bc94-4886-a2c4-beeecab6dfb2. exit status 13 Expected results: Network fence creation should be successful upon tainting the node Additional info: