1962207 – CSI pods fail to start if the rook operator fails to read config map

Bug 1962207 - CSI pods fail to start if the rook operator fails to read config map

Summary: CSI pods fail to start if the rook operator fails to read config map

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Rakshith
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-19 13:59 UTC by N Balachandran
Modified:	2021-08-25 09:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-03 18:16:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 245	None	open	Sync from release-1.6 to downstream release-4.8	2021-06-03 07:57:39 UTC
Github	rook rook issues 7950	None	open	CSI pods fail to start if the rook operator fails to read `rook-ceph-operator-config`configmap.	2021-05-25 12:27:04 UTC
Github	rook rook pull 8020	None	closed	ceph: retry starting CSI drivers on failure	2021-06-03 06:18:45 UTC
Red Hat Product Errata	RHBA-2021:3003	None	None	None	2021-08-03 18:16:26 UTC

Description N Balachandran 2021-05-19 13:59:27 UTC

Description of problem (please be detailed as possible and provide log
snippets):

Reported while testing the OCS integration with the Assisted Installer on a 3 master + 3 worker bare metal cluster.
The OCS operator installation failed to complete because the CSI pods were not running and hence the noobaa pod could not start as the PVC could not be created.


Version of all relevant components (if applicable):

OCS 4.7.0 
OCP 4.7.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Yes. The CSI pods were started correctly on restarting the rook operator pod 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?

No
Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1.
2.
3.


Actual results:
The CSI pods are not running

Expected results:
All the OCS pods should be running

Additional info:

From the rook operator log:

2021-05-19 10:32:06.905743 I | ceph-csi: Kubernetes version is 1.20
2021-05-19 10:32:07.595903 E | ceph-csi: failed to start Ceph csi drivers. failed to load ROOK_CSI_RESIZER_IMAGE setting: error reading ConfigMap "rook-ceph-operator-config". etcdserver: leader changed

Comment 2 Travis Nielsen 2021-05-19 14:24:58 UTC

Assigning to Rakshith to take a look at retrying starting the CSI driver in case there is some intermittent error from K8s.

Comment 3 Sébastien Han 2021-05-24 15:43:57 UTC

Rakshith, any updates so far? Thanks

Comment 4 Rakshith 2021-05-25 12:27:05 UTC

I have opened a corresponding upstream issue for this https://github.com/rook/rook/issues/7950 and updated it with a possible solution. 
Rook starts csi drivers in a go routine and retrying in a go routine may not be the best solution.

Comment 5 Sébastien Han 2021-05-31 07:20:54 UTC

Rakshith, I see the issue has made some progress, are you working on a patch? FYI: dev freeze is Wed June 2nd.

Comment 6 Rakshith 2021-06-03 06:18:46 UTC

The pr is merged in upstream. https://github.com/rook/rook/pull/8020

Comment 11 Petr Balogh 2021-06-10 09:52:04 UTC

We need to wait also for this assisted installer PR to be merged:
https://github.com/openshift/assisted-service/pull/1970

Which allows install internal build of OCS via assisted installer.

Comment 12 Priyanka 2021-06-23 05:07:13 UTC

On testing OCS 4.8 internal build through assisted-installer: 
[root@mccarthy assisted-test-infra]# oc get po -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-7rv8t                                            3/3     Running     0          20m
csi-cephfsplugin-cwmrh                                            3/3     Running     0          20m
csi-cephfsplugin-provisioner-76bf754586-gkvzt                     6/6     Running     13         20m
csi-cephfsplugin-provisioner-76bf754586-jvxpg                     6/6     Running     2          20m
csi-cephfsplugin-xcdz7                                            3/3     Running     0          20m
csi-rbdplugin-2d2nl                                               3/3     Running     0          20m
csi-rbdplugin-provisioner-849964d8cc-fgwfl                        6/6     Running     13         20m
csi-rbdplugin-provisioner-849964d8cc-qdcs7                        6/6     Running     0          20m
csi-rbdplugin-vvfbd                                               3/3     Running     0          20m
csi-rbdplugin-x9hhj                                               3/3     Running     0          20m
noobaa-core-0                                                     1/1     Running     0          7m59s
noobaa-db-pg-0                                                    1/1     Running     0          8m8s
noobaa-endpoint-79f5b5d9f8-4vthj                                  1/1     Running     0          4m40s
noobaa-operator-745cd954d-kllgp                                   1/1     Running     0          24m
ocs-metrics-exporter-76f9dd4bcd-7wb5r                             1/1     Running     0          24m
ocs-operator-6d7b95d7f-l6nr7                                      1/1     Running     6          24m
rook-ceph-crashcollector-18ca7c023561d39cfc8cbaef22c720d1-wdvgk   1/1     Running     0          8m40s
rook-ceph-crashcollector-3014d915265a238318c62d8bde3ae1ad-sfzs6   1/1     Running     0          8m35s
rook-ceph-crashcollector-386203c863a9f85abfecefe891be7a10-njwvf   1/1     Running     0          8m16s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-66ff69979z7lb   2/2     Running     0          7m43s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-776f9f59km4q4   2/2     Running     0          7m41s
rook-ceph-mgr-a-78556588f5-lpwc9                                  2/2     Running     0          8m50s
rook-ceph-mon-a-7547cdf796-d2mnf                                  2/2     Running     0          9m44s
rook-ceph-mon-b-765fb5b564-fswld                                  2/2     Running     0          9m28s
rook-ceph-mon-c-b4f966f44-v57x2                                   2/2     Running     0          9m9s
rook-ceph-operator-7489d68f79-rzvtg                               1/1     Running     2          24m
rook-ceph-osd-0-68876f97b9-kz7hs                                  2/2     Running     0          8m23s
rook-ceph-osd-1-7dc59b5f8f-8jlrm                                  2/2     Running     0          8m19s
rook-ceph-osd-2-59467d7855-h6kd9                                  2/2     Running     0          8m16s
rook-ceph-osd-prepare-ocs-deviceset-0-data-02gt9q-q4w7r           0/1     Completed   0          8m40s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0h285d-9jtp8           0/1     Completed   0          8m38s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0rxt67-vlznc           0/1     Completed   0          8m37s
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bf4cc8vv5xj   2/2     Running     0          6m21s
[root@mccarthy assisted-test-infra]# oc get StorageCluster -n openshift-storage
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   24m   Ready              2021-06-17T08:53:35Z   4.8.0

[root@mccarthy assisted-test-infra]# oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-417.ci   OpenShift Container Storage   4.8.0-417.ci              Succeeded

Comment 13 Petr Balogh 2021-06-29 14:27:49 UTC

Based on comment from Priyanka marking as verified.

Comment 15 errata-xmlrpc 2021-08-03 18:16:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003

Note You need to log in before you can comment on or make changes to this bug.