2308304 – ObjectNotFound('RADOS object not found (error calling conf_read_file) message is displayed in rook-ceph-operator log upon trying cephFS volume recovery

Bug 2308304 - ObjectNotFound('RADOS object not found (error calling conf_read_file) message is displayed in rook-ceph-operator log upon trying cephFS volume recovery

Summary: ObjectNotFound('RADOS object not found (error calling conf_read_file) message...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.17.0
Assignee:	Subham Rai
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-08-28 07:39 UTC by Joy John Pinto
Modified:	2024-10-30 14:32 UTC (History)
CC List:	3 users (show)
Fixed In Version:	4.17.0-98
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-10-30 14:32:31 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 725	None	open	Bug 2308304: csi: stop deleting csi-operator resources	2024-09-10 12:19:35 UTC
Red Hat Issue Tracker	OCSBZM-8877	None	None	None	2024-08-28 07:39:41 UTC
Red Hat Product Errata	RHSA-2024:8676	None	None	None	2024-10-30 14:32:34 UTC

Description Joy John Pinto 2024-08-28 07:39:24 UTC

Description of problem (please be detailed as possible and provide log
snippests):
ObjectNotFound('RADOS object not found (error calling conf_read_file) message is displayed in rook-ceph-operator log upon trying cephFS volume recovery

Version of all relevant components (if applicable):
OCP 4.17.0-0.nightly-2024-08-19-165854
ODF 4.17.0-84.stable provided by Red Hat

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
Steps to Reproduce:
1. Install ODF 4.17.0-84 onn IBM cloud
2. Create a deployment pod, on my test setup i created logwriter-ceph pod
3.Add taint to the node
```oc adm taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute```
Wait for some time(if the application pod and rook operator are on the same node wait for bit logger) then check the networkFence cr status and make sure its state is fenced 
4. As network fence is not created, check for rook-ceph-operator log, you can see following message

ceph-cluster-controller: failed to handle node failure. failed to create network fence for node "jopinto-clu19-c4pcf-worker-0-jbcww".: failed to fence cephfs subvolumes: failed to get ceph status for check active mds: failed to get status. . Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)'): exit status 1

Actual results:
Network fence is not created

Expected results:
Networkfence should be created

Additional info:
rook ceph operator log:
2024-08-28 05:35:52.772621 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down and a possible node drain is detected
2024-08-28 05:35:52.929902 I | ceph-cluster-controller: Found taint: Key=node.kubernetes.io/out-of-service, Value=nodeshutdown on node jopinto-clu19-c4pcf-worker-0-jbcww
2024-08-28 05:35:52.929931 I | ceph-cluster-controller: volumeInUse after split based on '^' [csi.vsphere.vmware.com 64c02db6-3edd-458d-bc79-36461858cb42]
2024-08-28 05:35:52.929938 I | ceph-cluster-controller: volumeInUse after split based on '^' [csi.vsphere.vmware.com b1954e96-d763-4344-993a-9f8022d0520c]
2024-08-28 05:35:52.929943 I | ceph-cluster-controller: volumeInUse after split based on '^' [openshift-storage.cephfs.csi.ceph.com 0001-0011-openshift-storage-0000000000000001-48286e7e-84e5-443f-a064-39a3f0609885]
2024-08-28 05:35:53.545803 I | ceph-cluster-controller: node "jopinto-clu19-c4pcf-worker-0-jbcww" require fencing, found cephfs subvolumes in use
2024-08-28 05:35:54.107446 I | ceph-spec: parsing mon endpoints: a=172.30.195.199:3300,b=172.30.15.18:3300,c=172.30.195.182:3300
2024-08-28 05:35:54.199578 I | ceph-block-pool-controller: skipping reconcile since operator is still initializing
2024-08-28 05:35:54.509763 I | ceph-spec: parsing mon endpoints: a=172.30.195.199:3300,b=172.30.15.18:3300,c=172.30.195.182:3300
2024-08-28 05:35:54.509856 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2024-08-28 05:35:54.509985 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2024-08-28 05:35:54.543839 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/noobaa-ceph-objectstore-user". failed to initialized rgw admin ops client api: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": skipping reconcile since operator is still initializing
2024-08-28 05:35:54.906487 I | ceph-spec: parsing mon endpoints: a=172.30.195.199:3300,b=172.30.15.18:3300,c=172.30.195.182:3300
2024-08-28 05:35:54.906522 I | ceph-cluster-controller: fencing cephfs subvolume "pvc-0ca18ad2-1123-4ff8-8e94-500b565c0dc4" on node "jopinto-clu19-c4pcf-worker-0-jbcww"
2024-08-28 05:35:55.006105 E | ceph-cluster-controller: failed to handle node failure. failed to create network fence for node "jopinto-clu19-c4pcf-worker-0-jbcww".: failed to fence cephfs subvolumes: failed to get ceph status for check active mds: failed to get status. . Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)'): exit status 1
2024-08-28 05:35:55.306810 I | ceph-spec: parsing mon endpoints: a=172.30.195.199:3300,b=172.30.15.18:3300,c=172.30.195.182:3300
2024-08-28 05:35:55.508108 E | ceph-csi: failed to delete CSI-operator Ceph Connection "". cephconnections.csi.ceph.io is forbidden: User "system:serviceaccount:openshift-storage:rook-ceph-system" cannot deletecollection resource "cephconnections" in API group "csi.ceph.io" in the namespace "openshift-storage"
2024-08-28 05:35:55.509410 E | ceph-csi: failed to delete CSI-operator client profile "". clientprofiles.csi.ceph.io is forbidden: User "system:serviceaccount:openshift-storage:rook-ceph-system" cannot deletecollection resource "clientprofiles" in API group "csi.ceph.io" in the namespace "openshift-storage"
2024-08-28 05:35:55.510457 E | ceph-csi: failed to delete CSI-operator Ceph Connection "". cephconnections.csi.ceph.io is forbidden: User "system:serviceaccount:openshift-storage:rook-ceph-system" cannot deletecollection resource "cephconnections" in API group "csi.ceph.io" in the namespace "openshift-storage"
2024-08-28 05:35:55.511470 E | ceph-csi: failed to delete CSI-operator client profile "". clientprofiles.csi.ceph.io is forbidden: User "system:serviceaccount:openshift-storage:rook-ceph-system" cannot deletecollection resource "clientprofiles" in API group "csi.ceph.io" in the namespace "openshift-storage"
2024-08-28 05:35:55.512458 E | ceph-csi: failed to delete CSI-operator driver config "". drivers.csi.ceph.io is forbidden: User "system:serviceaccount:openshift-storage:rook-ceph-system" cannot deletecollection resource "drivers" in API group "csi.ceph.io" in the namespace "openshift-storage"
2024-08-28 05:35:55.513542 E | ceph-csi: failed to delete CSI-operator operator config "". operatorconfigs.csi.ceph.io is forbidden: User "system:serviceaccount:openshift-storage:rook-ceph-system" cannot deletecollection resource "operatorconfigs" in API group "csi.ceph.io" in the namespace "openshift-storage"
2024-08-28 05:35:55.517559 I | ceph-csi: Kubernetes version is 1.30
2024-08-28 05:35:55.920387 I | ceph-csi: skipping csi version check, since unsupported versions are allowed or csi is disabled
2024-08-28 05:35:56.043670 I | ceph-csi: successfully started CSI Ceph RBD driver
2024-08-28 05:35:56.118578 I | ceph-spec: parsing mon endpoints: a=172.30.195.199:3300,b=172.30.15.18:3300,c=172.30.195.182:3300
2024-08-28 05:35:56.118667 I | ceph-fs-subvolumegroup-controller: creating ceph filesystem subvolume group ocs-storagecluster-cephfilesystem-csi in namespace openshift-storage
2024-08-28 05:35:56.118674 I | cephclient: creating cephfs "ocs-storagecluster-cephfilesystem" subvolume group "csi"
2024-08-28 05:35:56.165126 I | ceph-csi: successfully started CSI CephFS driver
2024-08-28 05:35:56.186958 I | ceph-csi: CSIDriver object updated for driver "openshift-storage.rbd.csi.ceph.com"
2024-08-28 05:35:56.200826 I | ceph-csi: CSIDriver object updated for driver "openshift-storage.cephfs.csi.ceph.com"
2024-08-28 05:35:56.200850 I | op-k8sutil: removing daemonset csi-nfsplugin if it exists
2024-08-28 05:35:56.204169 I | op-k8sutil: removing deployment csi-nfsplugin-provisioner if it exists
2024-08-28 05:35:56.302405 I | ceph-fs-subvolumegroup-controller: skipping reconcile since operator is still initializing
2024-08-28 05:35:56.310362 I | ceph-csi: successfully removed CSI NFS driver
2024-08-28 05:35:56.864420 I | ceph-spec: parsing mon endpoints: a=172.30.195.199:3300,b=172.30.15.18:3300,c=172.30.195.182:3300
2024-08-28 05:35:56.965705 I | ceph-block-pool-controller: skipping reconcile since operator is still initializing
2024-08-28 05:35:57.116948 I | ceph-spec: parsing mon endpoints: a=172.30.195.199:3300,b=172.30.15.18:3300,c=172.30.195.182:3300
2024-08-28 05:35:57.117085 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2024-08-28 05:35:57.117357 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2024-08-28 05:35:57.163901 E | ceph-object-store-user-controller: failed to reconcile CephObjectStoreUser "openshift-storage/ocs-storagecluster-cephobjectstoreuser". failed to initialized rgw admin ops client api: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": skipping reconcile since operator is still initializing

Comment 6 Sunil Kumar Acharya 2024-09-18 12:06:54 UTC

Please update the RDT flag/text appropriately.

Comment 8 errata-xmlrpc 2024-10-30 14:32:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676

Note You need to log in before you can comment on or make changes to this bug.