Description of problem (please be detailed as possible and provide log snippests): When Deploying a cluster with conf: AWS 1 AZ, OCP 4.7, OCS 4.6 the ocs-operator stucks with status "installing", and the noobaa pod "noobaa-db-0" in a "Pending" state. Version of all relevant components (if applicable): OCP version: Client Version: 4.6.0-0.nightly-2020-12-08-021151 Server Version: 4.7.0-0.nightly-2020-12-21-131655 Kubernetes Version: v1.20.0+87544c5 OCS verison: ocs-operator.v4.6.0 OpenShift Container Storage 4.6.0 Installing cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-21-131655 True False 107m Cluster version is 4.7.0-0.nightly-2020-12-21-131655 Rook version rook: 4.6-80.1ae5ac6a.release_4.6 go: go1.15.2 Ceph version ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, we can't install the OCS operator successfully. Is there any workaround available to the best of your knowledge? No. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: It may be a regression, cause I didn't notice this issue in OCP 4.6. Steps to Reproduce: 1. Install the OCP 4.7 cluster with AWS 1 AZ(This issue may also be reproduced with vmware). 2. Install OCS 4.6 from operator hub 3. Create the storage cluster Actual results: noobaa-db-0 pod in status "Pending", and ocs-operator stuck in "Installing" phase Expected results: All the pods should be in status "Running", and ocs-operator should be in a "Succeeded" phase Additional info:
Additional info: Jenkins job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15865/ Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/logs-20201224-200538/
In the must gather it looks like noobaa-db-0 is pending waiting for a PV. from noobaa-db-0 pod: status: conditions: - lastProbeTime: null lastTransitionTime: "2020-12-24T14:32:08Z" message: '0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.' reason: Unschedulable from describe pvc: Name: db-noobaa-db-0 Namespace: openshift-storage StorageClass: ocs-storagecluster-ceph-rbd Status: Pending Volume: Labels: app=noobaa noobaa-db=noobaa Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Mounted By: noobaa-db-0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 177m (x10 over 3h30m) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-b46dd5c7-bmqqp_520fb3a8-db03-49a6-9f34-8636f580900e failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2 already exists Normal ExternalProvisioning 3m38s (x863 over 3h33m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator Normal Provisioning 2m51s (x51 over 3h33m) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-b46dd5c7-bmqqp_520fb3a8-db03-49a6-9f34-8636f580900e External provisioner is provisioning volume for claim "openshift-storage/db-noobaa-db-0"
Niels, PTAL
Looks like this is happening because cephblockpool is not yet created csi.storage.k8s.io/node-stage-secret-name:rook-csi-rbd-node csi.storage.k8s.io/node-stage-secret-namespace:openshift-storage csi.storage.k8s.io/provisioner-secret-name:rook-csi-rbd-provisioner csi.storage.k8s.io/provisioner-secret-namespace:openshift-storage imageFeatures:layering imageFormat:2 pool:ocs-storagecluster-cephblockpool] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} 2020-12-24T14:32:08.690567060Z I1224 14:32:08.690541 1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume 2020-12-24T14:32:08.690849103Z I1224 14:32:08.690556 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":53687091200},"name":"pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2","parameters":{"clusterID":"openshift-storage","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} 2020-12-24T14:32:08.756186544Z I1224 14:32:08.756093 1 connection.go:185] GRPC response: {} 2020-12-24T14:32:08.756186544Z I1224 14:32:08.756172 1 connection.go:186] GRPC error: rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster 2020-12-24T14:32:08.756210761Z I1224 14:32:08.756197 1 controller.go:642] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Finished: rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster 2020-12-24T14:32:08.756245976Z I1224 14:32:08.756231 1 controller.go:1084] Final error received, removing PVC 135dd866-4826-4aff-b4f2-d90bd4d327f2 from claims in progress 2020-12-24T14:32:08.756253895Z W1224 14:32:08.756243 1 controller.go:943] Retrying syncing claim "135dd866-4826-4aff-b4f2-d90bd4d327f2", failure 0 2020-12-24T14:32:08.756274547Z E1224 14:32:08.756261 1 controller.go:966] error syncing claim "135dd866-4826-4aff-b4f2-d90bd4d327f2": failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster 2020-12-24T14:32:08.756302696Z I1224 14:32:08.756291 1 controller.go:1317] provision "openshift-storage/db-noobaa-db-0" class "ocs-storagecluster-ceph-rbd": started 2020-12-24T14:32:08.756363375Z I1224 14:32:08.756330 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-storage", Name:"db-noobaa-db-0", UID:"135dd866-4826-4aff-b4f2-d90bd4d327f2", APIVersion:"v1", ResourceVersion:"49156", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster Moving it to rook
Rook took about 4 minutes to initialize the cluster, then created the pool [1]: 2020-12-24T14:32:16.585089490Z 2020-12-24 14:32:16.585050 I | ceph-block-pool-controller: creating pool "ocs-storagecluster-cephblockpool" in namespace "openshift-storage" This timestamp is about 10s after the CSI driver logged above that it could not find the pool. After that, the CSI provisioner seems to be stuck in a loop indicating that the volume already exists [2]: 2020-12-24T14:35:41.314353576Z I1224 14:35:41.314337 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-storage", Name:"db-noobaa-db-0", UID:"135dd866-4826-4aff-b4f2-d90bd4d327f2", APIVersion:"v1", ResourceVersion:"49156", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2 already exists CSI team can you take another look at this? [1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/logs-20201224-200538/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-fbcaa53e653a3bb4c281dbb3e98706cfe9ee5f8effb1d177cbf6a694555c4fe5/namespaces/openshift-storage/pods/rook-ceph-operator-868c759664-w4k57/rook-ceph-operator/rook-ceph-operator/logs/current.log [2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/logs-20201224-200538/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-fbcaa53e653a3bb4c281dbb3e98706cfe9ee5f8effb1d177cbf6a694555c4fe5/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-b46dd5c7-bmqqp/csi-provisioner/csi-provisioner/logs/current.log
Thanks Travis for looking. >> Yeah, initially it was failing because of the unavailability of the pool 2020-12-24T14:32:15.288395183Z I1224 14:32:15.288384 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-storage", Name:"db-noobaa-db-0", UID:"135dd866-4826-4aff-b4f2-d90bd4d327f2", APIVersion:"v1", ResourceVersion:"49156", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Internal desc = pool not found: pool (ocs-storagecluster-cephblockpool) not found in Ceph cluster 2020-12-24T14:32:23.288535724Z I1224 14:32:23.288497 1 controller.go:1317] provision "openshift-storage/db-noobaa-db-0" class "ocs-storagecluster-ceph-rbd": started 2020-12-24T14:32:23.288654535Z I1224 14:32:23.288621 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-storage", Name:"db-noobaa-db-0", UID:"135dd866-4826-4aff-b4f2-d90bd4d327f2", APIVersion:"v1", ResourceVersion:"49156", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-storage/db-noobaa-db-0" >> Once the pool got created, the same request failed with DeadlineExceeded error: 2020-12-24T14:32:23.297253984Z I1224 14:32:23.297236 1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume 2020-12-24T14:32:23.297335778Z I1224 14:32:23.297248 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":53687091200},"name":"pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2","parameters":{"clusterID":"openshift-storage","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} 2020-12-24T14:34:53.297438383Z I1224 14:34:53.297381 1 connection.go:185] GRPC response: {} 2020-12-24T14:34:53.297438383Z I1224 14:34:53.297421 1 connection.go:186] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded 2020-12-24T14:34:53.297484023Z I1224 14:34:53.297445 1 controller.go:642] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = DeadlineExceeded desc = context deadline exceeded >> Ultimately, it went into an endless loop saying another operation is in progress for the same volume: 2020-12-24T14:35:09.303817190Z I1224 14:35:09.303799 1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume 2020-12-24T14:35:09.303906196Z I1224 14:35:09.303812 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":53687091200},"name":"pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2","parameters":{"clusterID":"openshift-storage","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} 2020-12-24T14:35:09.305811989Z I1224 14:35:09.305783 1 connection.go:185] GRPC response: {} 2020-12-24T14:35:09.305822036Z I1224 14:35:09.305807 1 connection.go:186] GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2 already exists
Looks like this was hit only once, reducing the severity and removing the blocker flag. We might need to reproduce it to see what process was actually running on the volume.
Debugging this might required getting the go-routines from the provisioner to see if there is something stuck. That requires https://github.com/ceph/ceph-csi/issues/1699 (not yet available).
I seem to have hit the same issue deploying an internal OCS 4.6.1 cluster running in OCP 4.6.12 on VMware (VMware ESXi, 6.7.0, 15160138)
From the logs mentioned in comment #2, we can confirm that the first CreateVolume request after the pool was created did not return, and is indeed still in progress: "ID: 26" in http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/logs-20201224-200538/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-fbcaa53e653a3bb4c281dbb3e98706cfe9ee5f8effb1d177cbf6a694555c4fe5/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-b46dd5c7-bmqqp/csi-rbdplugin/csi-rbdplugin/logs/current.log 2020-12-24T14:32:23.297624597Z I1224 14:32:23.297583 1 utils.go:159] ID: 26 Req-ID: pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2 GRPC call: /csi.v1.Controller/CreateVolume 2020-12-24T14:32:23.298876783Z I1224 14:32:23.298860 1 utils.go:160] ID: 26 Req-ID: pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2 GRPC request: {"capacity_range":{"required_bytes":53687091200},"name":"pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2","parameters":{"clusterID":"openshift-storage","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} 2020-12-24T14:32:23.299017801Z I1224 14:32:23.299004 1 rbd_util.go:771] ID: 26 Req-ID: pvc-135dd866-4826-4aff-b4f2-d90bd4d327f2 setting disableInUseChecks on rbd volume to: false It is not possible to identify where exactly, and how this request is stuck. It possibly is blocking on some operation in Ceph libraries (through go-ceph). The ability to get a stack of the running go-routines would help with finding the place where the request is blocked. I do not think that there is any further analysis that can be done with the current available data. A workaround could be to only start deploying NooBaa once the CephBlockPool is sucessfully created.
Right now, we don't have enough data to debug this issue. As Niels mentioned in the previous comment that enabling golang profiling will help us in debugging such issues, I am moving this to next release where hopefully we would have profiling enabled. Again, that will not help in debugging this issue but similar issues in future. Not closing it so that we don't miss the same.
I also came across this issue twice while deploying OCS 4.6.3 on 2 clusters running OCP 4.6.19 UPI bare metal install on VMware ESXi, 6.5.0, 15256549. On a 3rd OCP setup the installation of OCS went without any issues. The environment and versions was exactly the same on all installs.
I am currently experiencing this issue with an OCP 4.7.2 "Compact Cluster" configuration on 3 bare metal machines. CPU: i7-8750H Ram: 64gb Storage: 1x256GB nvme for boot 1x2TB 2.5" Seagate HDD Nic: 1x10GbE Mellanox ConnectX-3
I'm also experiencing this bug on Red Hat Virtualization 4.4.3 Client Version: 4.7.5 Server Version: 4.7.8 Kubernetes Version: v1.20.0+7d0a2b2 I upgraded my cluster from 4.7.5 to 4.7.8 and then installed OCS 4.6 using the OperatorHub.
As it seems that the provisioning is stuck somewhere in the Ceph libraries (our current best guess), we may be able to add a workaround in the ocs-operator that takes care of the deployment. The ocs-operator creates the CephCluster, StorageClass(es) and related objects. It currently waits for the CephCluster to become healty before continuing with the next steps. It would be cleaner (and probably prevent the issue in this bz) if the ocs-operator creates the CephBlockPool (and CephFilesystem), waits for its completion and then creates the StorageClasses. This would reasult in Ceph-CSI getting requests when the CephBlockPool is ready to be used. Umanga, does the above seem correct? Should I create a separate BZ for the ocs-operator to introduce the additional status checks before creating the StorageClasses?
(In reply to Niels de Vos from comment #21) > As it seems that the provisioning is stuck somewhere in the Ceph libraries > (our current best guess), we may be able to add a workaround in the > ocs-operator that takes care of the deployment. > > The ocs-operator creates the CephCluster, StorageClass(es) and related > objects. It currently waits for the CephCluster to become healty before > continuing with the next steps. It would be cleaner (and probably prevent > the issue in this bz) if the ocs-operator creates the CephBlockPool (and > CephFilesystem), waits for its completion and then creates the > StorageClasses. This would reasult in Ceph-CSI getting requests when the > CephBlockPool is ready to be used. > > Umanga, does the above seem correct? Should I create a separate BZ for the > ocs-operator to introduce the additional status checks before creating the > StorageClasses? Yes, you should. Or this BZ is also fine if that is the only fix required.
Is there a workaround that can be applied on cluster once cu hits this?
I think the below steps can help to recover * Make sure the ceph cluster is healthy * The pool creation should be succeed * Restart the rbd provisioner pod.
Umanga, are we using this bug for fix? If so, can you reassign and set the component correctly?
I expect that this can be solved (prevented in the future) by a change in the ocs-operator. By creating the StorageClass(es) after the CephBlockPool and CephFilesystem are created and ready, the there should not be any problems/hangs during PVC creation.
Since we've hit this more than once, it would be good to finally get this resolved. Giving devel_ack+.
Based on today's bug triage and original bug description, providing qa ack.
This exists since 4.6 and we don't have a fix ready yet. Moving it out of 4.8 Can be taken in 4.8.z if required.
I have a client with this problem on AWS. (running 4.6.32 I think) There is a description up above that says a potential work-around is to "Restart the rbd provisioner pod". Can you tell me which NS and the exact name of the pod(s) that need restarting? csi-rbdplugin-provisioner-xxxxx.xxxxx and rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xxxxx in openshift-storage. Is it one of these?
(In reply to Stu Lipshires from comment #37) > I have a client with this problem on AWS. (running 4.6.32 I think) There is > a description up above that says a potential work-around is to "Restart the > rbd provisioner pod". Can you tell me which NS and the exact name of the > pod(s) that need restarting? csi-rbdplugin-provisioner-xxxxx.xxxxx and > rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xxxxx in > openshift-storage. Is it one of these? The provisioner pods are in the openshift-storage namespace and have a label app=csi-rbdplugin-provisioner. You can restart them by running this command: $ oc -n openshift-storage delete pod -l app=csi-rbdplugin-provisioner
*** Bug 1931875 has been marked as a duplicate of this bug. ***
Please test with the latest build.
I tried to test the bug again. Steps I did to reproduce: 1. I deployed an AWS with 1AZ, OCP 4.9, without OCS. 2. Install the OCS 4.9, the latest build from the UI. 3. Create the storage cluster. After all the steps above, the OCS pods were running, and the ocs-operator was in a "Succeeded" phase. Link to the Jenkins job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/5909/ Versions of all components: OCP version: Client Version: 4.8.0-0.nightly-2021-09-11-042202 Server Version: 4.9.0-0.nightly-2021-09-10-170926 Kubernetes Version: v1.22.0-rc.0+75ee307 OCS verison: ocs-operator.v4.9.0-132.ci OpenShift Container Storage 4.9.0-132.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-10-170926 True False 102m Cluster version is 4.9.0-0.nightly-2021-09-10-170926 Rook version rook: 4.9-187.c4c788f.release_4.9 go: go1.16.6 Ceph version ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
Can I move the bug to Verified according to the results above?
lgtm
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086