2239802 – [External mode]: Failed to run rbd commands from rook ceph operator pod

Bug 2239802 - [External mode]: Failed to run rbd commands from rook ceph operator pod

Summary: [External mode]: Failed to run rbd commands from rook ceph operator pod

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Subham Rai
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-20 09:41 UTC by Joy John Pinto
Modified:	2023-11-08 18:56 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.14.0-157
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 18:54:58 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 523	None	open	Bug 2239802: external: update healthchecker caps for rbd command	2023-09-28 13:37:43 UTC
Github	red-hat-storage rook pull 532	None	open	Bug 2239802: pool: rbd cmd shouldn't use admin in external mode	2023-10-30 06:00:13 UTC
Github	rook rook pull 12941	None	open	external: update healthchecker caps for rbd command	2023-09-22 10:43:51 UTC
Github	rook rook pull 13114	None	open	pool: rbd cmd shouldn't use admin in external mode	2023-10-27 09:16:18 UTC
Red Hat Product Errata	RHSA-2023:6832	None	None	None	2023-11-08 18:56:32 UTC

Description Joy John Pinto 2023-09-20 09:41:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):
[External mode]: Failed to run rbd commands from rook ceph operator pod

Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-09-15-233408
odf-operator.v4.14.0-135.stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install Openshift data foundation extrernal mode cluster and deploy a app pod 
2. Shutdown the node on which RBD RWO app pod is deployed
3.Once the node is down, add taint
```oc  taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute ``
4. check the networkFence cr status and make sure its state is fenced and state is ready
```oc get networkfences.csiaddons.openshift.io '''


Actual results:
Networkfence CR is not created

Expected results:
Network fence CR should be created

Additional info:

sh-5.1$ rbd status rbd/csi-vol-22e6259d-90ff-460f-9adc-c401995d0e8b  --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring
2023-09-20T09:19:30.497+0000 7fa83af02640 -1 librbd::image::OpenRequest: failed to stat v2 image header: (1) Operation not permitted

2023-09-20T09:19:30.497+0000 7fa83b703640 -1 librbd::ImageState: 0x562bb5dc1010 failed to open image: (1) Operation not permitted

rbd: error opening image csi-vol-22e6259d-90ff-460f-9adc-c401995d0e8b: (1) Operation not permitted


eph status --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.healthchecker --keyring=/var/lib/rook/openshift-storage/client.healthchecker.keyring

  cluster:
    id:     bcff658e-5a7e-11ed-9895-0050568fc7cd
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum rhcs-1-node-1,rhcs-1-node-2,rhcs-1-node-3 (age 13d)
    mgr: rhcs-1-node-2.lchaow(active, since 13d), standbys: rhcs-1-node-3.vxyzcy, rhcs-1-node-1.khuych
    mds: 1/1 daemons up, 1 standby
    osd: 6 osds: 6 up (since 13d), 6 in (since 13d)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   16 pools, 481 pgs
    objects: 237.35k objects, 645 GiB
    usage:   1.3 TiB used, 4.7 TiB / 6.0 TiB avail
    pgs:     481 active+clean
 
  io:
    client:   65 KiB/s wr, 0 op/s rd, 1 op/s wr

Comment 8 Joy John Pinto 2023-10-27 07:27:46 UTC

With ODF 4.14.0-156 client.healthchecker doesnot have sufficient permissions (https://bugzilla.redhat.com/show_bug.cgi?id=2246484) and network fences are not created post marking the node as unschedulable (oc  taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute). Hence marking the bug as FailedQA

(venv) [jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io
No resources found
(venv) [jopinto@jopinto new]$

Comment 9 Joy John Pinto 2023-11-03 07:43:23 UTC

Verified in OCP 4.14.0-0.nightly-2023-10-31-145859 and odf-operator.v4.14.0-158


1. Created external mode cluster, and created an app pod on compute-1

2. Powered off compute-1 and tainted the node using command '[jopinto@jopinto new]$ oc adm taint nodes compute-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute'
   >>>node/compute-1 tainted
3. Networkfence and cidr entry was created and pod was running on new node
   [jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io
NAME        DRIVER                               CIDRS                 FENCESTATE   AGE   RESULT
compute-1   openshift-storage.rbd.csi.ceph.com   ["10.1.160.199/32"]   Fenced       53s   Succeeded

sh-5.1# ceph osd blocklist ls
....
10.0.211.1:6801/577936614 2023-11-03T07:05:05.293134+0000
cidr:10.1.160.199:0/32 2028-11-02T11:51:57.039577+0000
listed 18 entries

4. Untainted the node using command 'oc adm taint nodes compute-1 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-'

5. Network fence and cidr entry was removed
[jopinto@jopinto new]$ oc get networkfences.csiaddons.openshift.io
No resources found
sh-5.1# ceph osd blocklist ls
...
listed 17 entries

Its working as expected on external mode cluster.. Hence closing the bug

Comment 11 errata-xmlrpc 2023-11-08 18:54:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Note You need to log in before you can comment on or make changes to this bug.