Description of problem (please be detailed as possible and provide log snippests): [External mode]: Failed to run rbd commands from rook ceph operator pod Version of all relevant components (if applicable): OCP 4.14.0-0.nightly-2023-09-15-233408 odf-operator.v4.14.0-135.stable Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: NA Steps to Reproduce: 1. Install Openshift data foundation extrernal mode cluster and deploy a app pod 2. Shutdown the node on which RBD RWO app pod is deployed 3.Once the node is down, add taint ```oc taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute `` 4. check the networkFence cr status and make sure its state is fenced and state is ready ```oc get networkfences.csiaddons.openshift.io ''' Actual results: Networkfence CR is not created Expected results: Network fence CR should be created Additional info: sh-5.1$ rbd status rbd/csi-vol-22e6259d-90ff-460f-9adc-c401995d0e8b --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring 2023-09-20T09:19:30.497+0000 7fa83af02640 -1 librbd::image::OpenRequest: failed to stat v2 image header: (1) Operation not permitted 2023-09-20T09:19:30.497+0000 7fa83b703640 -1 librbd::ImageState: 0x562bb5dc1010 failed to open image: (1) Operation not permitted rbd: error opening image csi-vol-22e6259d-90ff-460f-9adc-c401995d0e8b: (1) Operation not permitted eph status --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring cluster: id: bcff658e-5a7e-11ed-9895-0050568fc7cd health: HEALTH_OK services: mon: 3 daemons, quorum rhcs-1-node-1,rhcs-1-node-2,rhcs-1-node-3 (age 13d) mgr: rhcs-1-node-2.lchaow(active, since 13d), standbys: rhcs-1-node-3.vxyzcy, rhcs-1-node-1.khuych mds: 1/1 daemons up, 1 standby osd: 6 osds: 6 up (since 13d), 6 in (since 13d) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 16 pools, 481 pgs objects: 237.35k objects, 645 GiB usage: 1.3 TiB used, 4.7 TiB / 6.0 TiB avail pgs: 481 active+clean io: client: 65 KiB/s wr, 0 op/s rd, 1 op/s wr
With ODF 4.14.0-156 client.healthchecker doesnot have sufficient permissions (https://bugzilla.redhat.com/show_bug.cgi?id=2246484) and network fences are not created post marking the node as unschedulable (oc taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute). Hence marking the bug as FailedQA (venv) [jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io No resources found (venv) [jopinto@jopinto new]$
Verified in OCP 4.14.0-0.nightly-2023-10-31-145859 and odf-operator.v4.14.0-158 1. Created external mode cluster, and created an app pod on compute-1 2. Powered off compute-1 and tainted the node using command '[jopinto@jopinto new]$ oc adm taint nodes compute-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute' >>>node/compute-1 tainted 3. Networkfence and cidr entry was created and pod was running on new node [jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io NAME DRIVER CIDRS FENCESTATE AGE RESULT compute-1 openshift-storage.rbd.csi.ceph.com ["10.1.160.199/32"] Fenced 53s Succeeded sh-5.1# ceph osd blocklist ls .... 10.0.211.1:6801/577936614 2023-11-03T07:05:05.293134+0000 cidr:10.1.160.199:0/32 2028-11-02T11:51:57.039577+0000 listed 18 entries 4. Untainted the node using command 'oc adm taint nodes compute-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-' 5. Network fence and cidr entry was removed [jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io No resources found sh-5.1# ceph osd blocklist ls ... listed 17 entries Its working as expected on external mode cluster.. Hence closing the bug
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832