Bug 2006322

Summary: failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = DeadlineExceeded desc = context deadline exceeded
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Daniel Horák <dahorak>
Component: csi-driverAssignee: Humble Chirammal <hchiramm>
Status: CLOSED DUPLICATE QA Contact: Daniel Horák <dahorak>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.9CC: madam, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, pbalogh, rar, sapillai, tnielsen
Target Milestone: ---Keywords: Automation, Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-23 06:32:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Horák 2021-09-21 13:41:27 UTC
Description of problem (please be detailed as possible and provide log
snippests):

  Fresh ODF deployment sometime fails on running some of the pods with
  following events:

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Warning  FailedScheduling  151m  default-scheduler  0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The reason seems to be that all the PVCs related to
  ocs-storagecluster-ceph-rbd StorageClass are stuck in Pending state:

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  $ oc get pvc -A | grep -v Bound
  NAMESPACE              NAME                                        STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
  openshift-monitoring   my-alertmanager-claim-alertmanager-main-0   Pending                                                                        ocs-storagecluster-ceph-rbd   134m
  openshift-monitoring   my-alertmanager-claim-alertmanager-main-1   Pending                                                                        ocs-storagecluster-ceph-rbd   134m
  openshift-monitoring   my-alertmanager-claim-alertmanager-main-2   Pending                                                                        ocs-storagecluster-ceph-rbd   134m
  openshift-monitoring   my-prometheus-claim-prometheus-k8s-0        Pending                                                                        ocs-storagecluster-ceph-rbd   134m
  openshift-monitoring   my-prometheus-claim-prometheus-k8s-1        Pending                                                                        ocs-storagecluster-ceph-rbd   134m
  openshift-storage      db-noobaa-db-pg-0                           Pending                                                                        ocs-storagecluster-ceph-rbd   134m
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  $ oc describe pvc -n openshift-storage db-noobaa-db-pg-0
  Name:          db-noobaa-db-pg-0
  Namespace:     openshift-storage
  StorageClass:  ocs-storagecluster-ceph-rbd
  Status:        Pending
  Volume:        
  Labels:        app=noobaa
                 noobaa-db=postgres
  Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
  Finalizers:    [kubernetes.io/pvc-protection]
  Capacity:      
  Access Modes:  
  VolumeMode:    Filesystem
  Used By:       noobaa-db-pg-0
  Events:
    Type     Reason                Age                    From                                                                                                                Message
    ----     ------                ----                   ----                                                                                                                -------
    Warning  ProvisioningFailed    154m                   openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5c85747995-mmwz2_9d200108-8fcc-4470-b9ee-0c2ce86c23cd  failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = DeadlineExceeded desc = context deadline exceeded
    Warning  ProvisioningFailed    130m (x14 over 154m)   openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5c85747995-mmwz2_9d200108-8fcc-4470-b9ee-0c2ce86c23cd  failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-e258cfbd-dd4b-41cb-83cf-53f42c310115 already exists
    Normal   ExternalProvisioning  109s (x636 over 156m)  persistentvolume-controller                                                                                         waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator
    Normal   Provisioning          46s (x50 over 156m)    openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5c85747995-mmwz2_9d200108-8fcc-4470-b9ee-0c2ce86c23cd  External provisioner is provisioning volume for claim "openshift-storage/db-noobaa-db-pg-0"
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Version of all relevant components (if applicable):
  $ oc version
  Client Version: 4.9.0-0.nightly-2021-09-20-203004
  Server Version: 4.9.0-0.nightly-2021-09-20-203004
  Kubernetes Version: v1.22.0-rc.0+af080cb

  $ oc get csv -n openshift-storage
  NAME                            DISPLAY                       VERSION        REPLACES   PHASE
  noobaa-operator.v4.9.0-142.ci   NooBaa Operator               4.9.0-142.ci              Succeeded
  ocs-operator.v4.9.0-142.ci      OpenShift Container Storage   4.9.0-142.ci              Succeeded
  odf-operator.v4.9.0-142.ci      OpenShift Data Foundation     4.9.0-142.ci              Succeeded


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
  Yes, the ocs-storagecluster is not correctly deployed.


Is there any workaround available to the best of your knowledge?
  No (I'm not aware of any)


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
  3


Can this issue reproducible?
  Occasionally, spotted already few times.

  It seems to be more probably reproducible on AWS IPI deployment with 3
  masters, 3 workers and 3 infra,worker nodes, but seen also on vSphere UPI
  deployment with only 3 masters and 3 worker nodes.


Can this issue reproduce from the UI?
  Yes


If this is a regression, please provide more details to justify this:
  I think yes, we didn't see this issue on previous versions.


Steps to Reproduce:
1. Install ODF on top of cluster with 3 worker and 3 infra,worker nodes on AWS


Actual results:
  PVCs assigned to ocs-storagecluster-ceph-rbd are stuck in Pending state and
  because of that some of the pods are also stuck in Pending state.


Expected results:
  Are PVCs are correctly Bound (and all underlying resources are correctly
  created) -> all relevant pods are running.


Additional info:

Comment 2 Travis Nielsen 2021-09-21 13:50:18 UTC
Daniel Please collect an OCS must-gather or at least logs and pod descriptions for the pods in the openshift-storage namespace.

Comment 3 Santosh Pillai 2021-09-21 13:58:33 UTC
looks like the same as https://github.com/rook/rook/issues/8696

Comment 5 Daniel Horák 2021-09-21 14:07:55 UTC
(In reply to Daniel Horák from comment #0)
> Can this issue reproduce from the UI?
>   Yes

I wrongly marked it as reproducible from UI (we haven't automated UI deployment for 4.9 so it wasn't tried from UI).

Comment 9 Travis Nielsen 2021-09-22 13:58:02 UTC
Moving to CSI to finalize if it's a dup or needs further investigation.

Comment 11 Travis Nielsen 2021-09-22 14:18:24 UTC
Sorry I missed that earlier comment...

The mons and osds are crashing because their backing PVCs are gone. For example, see the following pod descriptions. Has the environment already started to be destroyed?

From mon-a
  Warning  FailedAttachVolume  5m15s (x694 over 23h)  attachdetach-controller  (combined from similar events): AttachVolume.Attach failed for volume "pvc-4c28ccac-290b-4cfd-8476-c2212e04b1e1" : InvalidVolume.NotFound: The volume 'vol-00a919752a2bb4b87' does not exist.

From osd-0:
  Warning  FailedAttachVolume  12s (x697 over 23h)      attachdetach-controller  (combined from similar events): AttachVolume.Attach failed for volume "pvc-1ecfa93e-10b8-4780-bb26-fedec4b23f86" : InvalidVolume.NotFound: The volume 'vol-056695aa5b8572b9a' does not exist.

From osd-1:
  Warning  FailedAttachVolume  5m56s (x694 over 23h)    attachdetach-controller  (combined from similar events): AttachVolume.Attach failed for volume "pvc-ab759aa9-fa15-48b5-85ae-22bdc5f11824" : InvalidVolume.NotFound: The volume 'vol-09556bc7548b700a2' does not exist.
           status code: 400, request id: aa37ed7e-6290-464b-ae54-32921c5f1eb0