Bug 2310385

Summary: Upon CephFS volume recovery network fencing fails on external mode cluster
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Joy John Pinto <jopinto>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED ERRATA QA Contact: Joy John Pinto <jopinto>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.17CC: mrajanna, nberry, odf-bz-bot, sapillai, srai, tnielsen
Target Milestone: ---   
Target Release: ODF 4.17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.17.0-107 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-10-30 14:33:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joy John Pinto 2024-09-06 10:30:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Upon CephFS volume recovery network fencing fails on external mode cluster 


Version of all relevant components (if applicable):
OCP 4.17 and ODF 4.17.0-92


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
Steps to Reproduce:
1. Install Openshift data foundation and deploy a app pod in external mode cluster
2. Shutdown the node on which CephFS RWO pod is deployed
3.Once the node is down, add taint
```oc  taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute ```
Wait for some time(if the application pod and rook operator are on the same node wait for bit logger) then check the networkFence cr status 



Actual results:
Network fence creation fails on external mode cluster with error 

2024-09-06 10:14:57.907050 D | op-k8sutil: creating endpoint "rook-ceph-mgr-external". [{[{10.1.160.145  <nil> nil}] [] [{http-external-metrics 9283 TCP <nil>}]}]
2024-09-06 10:14:57.982565 D | exec: Running command: ceph tell mds.fsvol001.osd-0.icwduo client ls --format json --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring --format json
2024-09-06 10:14:58.140259 E | ceph-cluster-controller: failed to handle node failure. failed to create network fence for node "compute-0".: failed to fence cephfs subvolumes: failed to list watchers for cephfs subvolumeName csi-vol-4bab4cfd-bc94-4886-a2c4-beeecab6dfb2. exit status 13


Expected results:
Network fence creation should be successful upon tainting the node

Additional info:

Comment 10 Sunil Kumar Acharya 2024-09-27 06:46:45 UTC
Please update the RDT flag/text appropriately.

Comment 12 errata-xmlrpc 2024-10-30 14:33:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676