Bug 2006322
| Summary: | failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = DeadlineExceeded desc = context deadline exceeded | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Daniel Horák <dahorak> |
| Component: | csi-driver | Assignee: | Humble Chirammal <hchiramm> |
| Status: | CLOSED DUPLICATE | QA Contact: | Daniel Horák <dahorak> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | madam, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, pbalogh, rar, sapillai, tnielsen |
| Target Milestone: | --- | Keywords: | Automation, Regression |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-23 06:32:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Daniel Please collect an OCS must-gather or at least logs and pod descriptions for the pods in the openshift-storage namespace. looks like the same as https://github.com/rook/rook/issues/8696 (In reply to Daniel Horák from comment #0) > Can this issue reproduce from the UI? > Yes I wrongly marked it as reproducible from UI (we haven't automated UI deployment for 4.9 so it wasn't tried from UI). Moving to CSI to finalize if it's a dup or needs further investigation. Sorry I missed that earlier comment...
The mons and osds are crashing because their backing PVCs are gone. For example, see the following pod descriptions. Has the environment already started to be destroyed?
From mon-a
Warning FailedAttachVolume 5m15s (x694 over 23h) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-4c28ccac-290b-4cfd-8476-c2212e04b1e1" : InvalidVolume.NotFound: The volume 'vol-00a919752a2bb4b87' does not exist.
From osd-0:
Warning FailedAttachVolume 12s (x697 over 23h) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-1ecfa93e-10b8-4780-bb26-fedec4b23f86" : InvalidVolume.NotFound: The volume 'vol-056695aa5b8572b9a' does not exist.
From osd-1:
Warning FailedAttachVolume 5m56s (x694 over 23h) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-ab759aa9-fa15-48b5-85ae-22bdc5f11824" : InvalidVolume.NotFound: The volume 'vol-09556bc7548b700a2' does not exist.
status code: 400, request id: aa37ed7e-6290-464b-ae54-32921c5f1eb0
|
Description of problem (please be detailed as possible and provide log snippests): Fresh ODF deployment sometime fails on running some of the pods with following events: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Warning FailedScheduling 151m default-scheduler 0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The reason seems to be that all the PVCs related to ocs-storagecluster-ceph-rbd StorageClass are stuck in Pending state: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ $ oc get pvc -A | grep -v Bound NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE openshift-monitoring my-alertmanager-claim-alertmanager-main-0 Pending ocs-storagecluster-ceph-rbd 134m openshift-monitoring my-alertmanager-claim-alertmanager-main-1 Pending ocs-storagecluster-ceph-rbd 134m openshift-monitoring my-alertmanager-claim-alertmanager-main-2 Pending ocs-storagecluster-ceph-rbd 134m openshift-monitoring my-prometheus-claim-prometheus-k8s-0 Pending ocs-storagecluster-ceph-rbd 134m openshift-monitoring my-prometheus-claim-prometheus-k8s-1 Pending ocs-storagecluster-ceph-rbd 134m openshift-storage db-noobaa-db-pg-0 Pending ocs-storagecluster-ceph-rbd 134m ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ $ oc describe pvc -n openshift-storage db-noobaa-db-pg-0 Name: db-noobaa-db-pg-0 Namespace: openshift-storage StorageClass: ocs-storagecluster-ceph-rbd Status: Pending Volume: Labels: app=noobaa noobaa-db=postgres Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Used By: noobaa-db-pg-0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 154m openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5c85747995-mmwz2_9d200108-8fcc-4470-b9ee-0c2ce86c23cd failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning ProvisioningFailed 130m (x14 over 154m) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5c85747995-mmwz2_9d200108-8fcc-4470-b9ee-0c2ce86c23cd failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-e258cfbd-dd4b-41cb-83cf-53f42c310115 already exists Normal ExternalProvisioning 109s (x636 over 156m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator Normal Provisioning 46s (x50 over 156m) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5c85747995-mmwz2_9d200108-8fcc-4470-b9ee-0c2ce86c23cd External provisioner is provisioning volume for claim "openshift-storage/db-noobaa-db-pg-0" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version of all relevant components (if applicable): $ oc version Client Version: 4.9.0-0.nightly-2021-09-20-203004 Server Version: 4.9.0-0.nightly-2021-09-20-203004 Kubernetes Version: v1.22.0-rc.0+af080cb $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0-142.ci NooBaa Operator 4.9.0-142.ci Succeeded ocs-operator.v4.9.0-142.ci OpenShift Container Storage 4.9.0-142.ci Succeeded odf-operator.v4.9.0-142.ci OpenShift Data Foundation 4.9.0-142.ci Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, the ocs-storagecluster is not correctly deployed. Is there any workaround available to the best of your knowledge? No (I'm not aware of any) Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Occasionally, spotted already few times. It seems to be more probably reproducible on AWS IPI deployment with 3 masters, 3 workers and 3 infra,worker nodes, but seen also on vSphere UPI deployment with only 3 masters and 3 worker nodes. Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: I think yes, we didn't see this issue on previous versions. Steps to Reproduce: 1. Install ODF on top of cluster with 3 worker and 3 infra,worker nodes on AWS Actual results: PVCs assigned to ocs-storagecluster-ceph-rbd are stuck in Pending state and because of that some of the pods are also stuck in Pending state. Expected results: Are PVCs are correctly Bound (and all underlying resources are correctly created) -> all relevant pods are running. Additional info: