Bug 2239802

Summary: [External mode]: Failed to run rbd commands from rook ceph operator pod
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Joy John Pinto <jopinto>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED ERRATA QA Contact: Joy John Pinto <jopinto>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.14CC: ebenahar, kramdoss, mrajanna, muagarwa, odf-bz-bot, tnielsen
Target Milestone: ---   
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.14.0-157 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 18:54:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joy John Pinto 2023-09-20 09:41:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[External mode]: Failed to run rbd commands from rook ceph operator pod

Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-09-15-233408
odf-operator.v4.14.0-135.stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install Openshift data foundation extrernal mode cluster and deploy a app pod 
2. Shutdown the node on which RBD RWO app pod is deployed
3.Once the node is down, add taint
```oc  taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute ``
4. check the networkFence cr status and make sure its state is fenced and state is ready
```oc get networkfences.csiaddons.openshift.io '''


Actual results:
Networkfence CR is not created

Expected results:
Network fence CR should be created

Additional info:

sh-5.1$ rbd status rbd/csi-vol-22e6259d-90ff-460f-9adc-c401995d0e8b  --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring
2023-09-20T09:19:30.497+0000 7fa83af02640 -1 librbd::image::OpenRequest: failed to stat v2 image header: (1) Operation not permitted

2023-09-20T09:19:30.497+0000 7fa83b703640 -1 librbd::ImageState: 0x562bb5dc1010 failed to open image: (1) Operation not permitted

rbd: error opening image csi-vol-22e6259d-90ff-460f-9adc-c401995d0e8b: (1) Operation not permitted


eph status --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring

  cluster:
    id:     bcff658e-5a7e-11ed-9895-0050568fc7cd
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum rhcs-1-node-1,rhcs-1-node-2,rhcs-1-node-3 (age 13d)
    mgr: rhcs-1-node-2.lchaow(active, since 13d), standbys: rhcs-1-node-3.vxyzcy, rhcs-1-node-1.khuych
    mds: 1/1 daemons up, 1 standby
    osd: 6 osds: 6 up (since 13d), 6 in (since 13d)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 481 pgs
    objects: 237.35k objects, 645 GiB
    usage:   1.3 TiB used, 4.7 TiB / 6.0 TiB avail
    pgs:     481 active+clean
 
  io:
    client:   65 KiB/s wr, 0 op/s rd, 1 op/s wr

Comment 8 Joy John Pinto 2023-10-27 07:27:46 UTC
With ODF 4.14.0-156 client.healthchecker doesnot have sufficient permissions (https://bugzilla.redhat.com/show_bug.cgi?id=2246484) and network fences are not created post marking the node as unschedulable (oc  taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute). Hence marking the bug as FailedQA

(venv) [jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io
No resources found
(venv) [jopinto@jopinto new]$

Comment 9 Joy John Pinto 2023-11-03 07:43:23 UTC
Verified in OCP 4.14.0-0.nightly-2023-10-31-145859 and odf-operator.v4.14.0-158


1. Created external mode cluster, and created an app pod on compute-1

2. Powered off compute-1 and tainted the node using command '[jopinto@jopinto new]$ oc adm taint nodes compute-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute'
   >>>node/compute-1 tainted
3. Networkfence and cidr entry was created and pod was running on new node
   [jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io
NAME        DRIVER                               CIDRS                 FENCESTATE   AGE   RESULT
compute-1   openshift-storage.rbd.csi.ceph.com   ["10.1.160.199/32"]   Fenced       53s   Succeeded

sh-5.1# ceph osd blocklist ls
....
10.0.211.1:6801/577936614 2023-11-03T07:05:05.293134+0000
cidr:10.1.160.199:0/32 2028-11-02T11:51:57.039577+0000
listed 18 entries

4. Untainted the node using command 'oc adm taint nodes compute-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-'

5. Network fence and cidr entry was removed
[jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io
No resources found
sh-5.1# ceph osd blocklist ls
...
listed 17 entries

Its working as expected on external mode cluster.. Hence closing the bug

Comment 11 errata-xmlrpc 2023-11-08 18:54:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832