Description of problem (please be detailed as possible and provide log snippets): Reported while testing the OCS integration with the Assisted Installer on a 3 master + 3 worker bare metal cluster. The OCS operator installation failed to complete because the CSI pods were not running and hence the noobaa pod could not start as the PVC could not be created. Version of all relevant components (if applicable): OCS 4.7.0 OCP 4.7.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Yes. The CSI pods were started correctly on restarting the rook operator pod Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? No Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: NA Steps to Reproduce: 1. 2. 3. Actual results: The CSI pods are not running Expected results: All the OCS pods should be running Additional info: From the rook operator log: 2021-05-19 10:32:06.905743 I | ceph-csi: Kubernetes version is 1.20 2021-05-19 10:32:07.595903 E | ceph-csi: failed to start Ceph csi drivers. failed to load ROOK_CSI_RESIZER_IMAGE setting: error reading ConfigMap "rook-ceph-operator-config". etcdserver: leader changed
Assigning to Rakshith to take a look at retrying starting the CSI driver in case there is some intermittent error from K8s.
Rakshith, any updates so far? Thanks
I have opened a corresponding upstream issue for this https://github.com/rook/rook/issues/7950 and updated it with a possible solution. Rook starts csi drivers in a go routine and retrying in a go routine may not be the best solution.
Rakshith, I see the issue has made some progress, are you working on a patch? FYI: dev freeze is Wed June 2nd.
The pr is merged in upstream. https://github.com/rook/rook/pull/8020
We need to wait also for this assisted installer PR to be merged: https://github.com/openshift/assisted-service/pull/1970 Which allows install internal build of OCS via assisted installer.
On testing OCS 4.8 internal build through assisted-installer: [root@mccarthy assisted-test-infra]# oc get po -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-7rv8t 3/3 Running 0 20m csi-cephfsplugin-cwmrh 3/3 Running 0 20m csi-cephfsplugin-provisioner-76bf754586-gkvzt 6/6 Running 13 20m csi-cephfsplugin-provisioner-76bf754586-jvxpg 6/6 Running 2 20m csi-cephfsplugin-xcdz7 3/3 Running 0 20m csi-rbdplugin-2d2nl 3/3 Running 0 20m csi-rbdplugin-provisioner-849964d8cc-fgwfl 6/6 Running 13 20m csi-rbdplugin-provisioner-849964d8cc-qdcs7 6/6 Running 0 20m csi-rbdplugin-vvfbd 3/3 Running 0 20m csi-rbdplugin-x9hhj 3/3 Running 0 20m noobaa-core-0 1/1 Running 0 7m59s noobaa-db-pg-0 1/1 Running 0 8m8s noobaa-endpoint-79f5b5d9f8-4vthj 1/1 Running 0 4m40s noobaa-operator-745cd954d-kllgp 1/1 Running 0 24m ocs-metrics-exporter-76f9dd4bcd-7wb5r 1/1 Running 0 24m ocs-operator-6d7b95d7f-l6nr7 1/1 Running 6 24m rook-ceph-crashcollector-18ca7c023561d39cfc8cbaef22c720d1-wdvgk 1/1 Running 0 8m40s rook-ceph-crashcollector-3014d915265a238318c62d8bde3ae1ad-sfzs6 1/1 Running 0 8m35s rook-ceph-crashcollector-386203c863a9f85abfecefe891be7a10-njwvf 1/1 Running 0 8m16s rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-66ff69979z7lb 2/2 Running 0 7m43s rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-776f9f59km4q4 2/2 Running 0 7m41s rook-ceph-mgr-a-78556588f5-lpwc9 2/2 Running 0 8m50s rook-ceph-mon-a-7547cdf796-d2mnf 2/2 Running 0 9m44s rook-ceph-mon-b-765fb5b564-fswld 2/2 Running 0 9m28s rook-ceph-mon-c-b4f966f44-v57x2 2/2 Running 0 9m9s rook-ceph-operator-7489d68f79-rzvtg 1/1 Running 2 24m rook-ceph-osd-0-68876f97b9-kz7hs 2/2 Running 0 8m23s rook-ceph-osd-1-7dc59b5f8f-8jlrm 2/2 Running 0 8m19s rook-ceph-osd-2-59467d7855-h6kd9 2/2 Running 0 8m16s rook-ceph-osd-prepare-ocs-deviceset-0-data-02gt9q-q4w7r 0/1 Completed 0 8m40s rook-ceph-osd-prepare-ocs-deviceset-1-data-0h285d-9jtp8 0/1 Completed 0 8m38s rook-ceph-osd-prepare-ocs-deviceset-2-data-0rxt67-vlznc 0/1 Completed 0 8m37s rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6bf4cc8vv5xj 2/2 Running 0 6m21s [root@mccarthy assisted-test-infra]# oc get StorageCluster -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 24m Ready 2021-06-17T08:53:35Z 4.8.0 [root@mccarthy assisted-test-infra]# oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.8.0-417.ci OpenShift Container Storage 4.8.0-417.ci Succeeded
Based on comment from Priyanka marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003