1816688 – Reduced data availability right after installation

Bug 1816688 - Reduced data availability right after installation

Summary: Reduced data availability right after installation

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jose A. Rivera
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-24 14:27 UTC by Michal Minar
Modified:	2020-04-06 08:26 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-25 12:55:29 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
OCS container logs and ceph health (14.36 MB, application/gzip) 2020-03-24 15:51 UTC, Michal Minar	no flags	Details
Described storage related objects (2.81 KB, application/gzip) 2020-03-24 16:02 UTC, Michal Minar	no flags	Details
View All

Description Michal Minar 2020-03-24 14:27:44 UTC

Description of problem (please be detailed as possible and provide log
snippests):
  OCS 4.2 does not provision volumes.
  A partner installed OCS 4.2 on on OCP 4.2 on KVM/libvirt with spinning disks underneath.
  OCS sits on top PVs provisioned by Local Storage Operator using qemu images attached as disks to VMs.

Version of all relevant components (if applicable):
  oc version
  Client Version: openshift-clients-4.2
  Server Version: 4.2.20
  Kubernetes Version: v1.14.6+999bb21

  oc get csv -n openshift-storage
  NAME                  DISPLAY                       VERSION   REPLACES   PHASE
  ocs-operator.v4.2.2   OpenShift Container Storage   4.2.2                Installing

  oc get csv -n local-storage
  NAME                                         DISPLAY         VERSION               REPLACES   PHASE
  local-storage-operator.4.2.22-202003020552   Local Storage   4.2.22-202003020552              Succeeded

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? yes

Is there any workaround available to the best of your knowledge? using LSO directly without OCS on top

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 1

Can this issue reproducible? not sure, the installation has not been repeated yet

Can this issue reproduce from the UI? mostly performed from CLI

If this is a regression, please provide more details to justify this: probably not


Steps to Reproduce:
1. Deploy OCP 4.2 on KVM/libvirt with spinning drives underneath
2. Deploy LSO with enough volumes
3. Deploy OCS 4.2 on top of LSO PVs

Actual results:
  Installation does not finish because NooBaa's PV cannot be attached:

      Warning  FailedMount             16s (x8 over 2m25s)   kubelet, pvx180.wdf.sap.corp  MountVolume.MountDevice failed for volume "pvc-14e97ec0-685b-11ea-b5ce-52540017001e" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-213a69c7-685b-11ea-9cc1-0a580a820214 already exists

  Subsequent attempts to provision volumes using OCS storage classes end up in pending PVs.

Expected results:
  OCS is usable.

Additional info:
  The following guide was followed to deploy LSO and OCS: https://blog.openshift.com/ocs-4-2-in-ocp-4-2-14-upi-installation-in-rhv/

  oc rsh -n openshift-storage $TOOLS_POD ceph health detail                                                                                                                                                                                                                                                                                                                                 
  HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 42 pgs inactive; Degraded data redundancy: 62 pgs undersized
  MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mdsocs-storagecluster-cephfilesystem-a(mds.0): 14 slow metadata IOs are blocked > 30 secs, oldest blocked for 600793 secs
  PG_AVAILABILITY Reduced data availability: 42 pgs inactive
    pg 1.3 is stuck inactive for 600826.571739, current state undersized+peered, last acting [1]
    pg 2.4 is stuck inactive for 600824.554328, current state undersized+peered, last acting [1]
    pg 2.6 is stuck inactive for 600824.554328, current state undersized+peered, last acting [1]
    pg 3.1 is stuck inactive for 600821.758956, current state undersized+peered, last acting [1]
    pg 3.4 is stuck inactive for 600821.758956, current state undersized+peered, last acting [1]
    pg 3.6 is stuck inactive for 600821.758956, current state undersized+peered, last acting [1]
    pg 3.7 is stuck inactive for 600821.758956, current state undersized+peered, last acting [1]
    pg 4.0 is stuck inactive for 600816.054681, current state undersized+peered, last acting [1]
    ...

  oc describe pod noobaa-core-0 -n openshift-storage
  <snip>
  Events:
  Type     Reason                  Age                   From                          Message
  ----     ------                  ----                  ----                          -------
  Warning  FailedScheduling        4m43s (x5 over 5m7s)  default-scheduler             pod has unbound immediate PersistentVolumeClaims (repeated 4 times)
  Normal   Scheduled               4m42s                 default-scheduler             Successfully assigned openshift-storage/noobaa-core-0 to pvx180.wdf.sap.corp
  Normal   SuccessfulAttachVolume  4m42s                 attachdetach-controller       AttachVolume.Attach succeeded for volume "pvc-14e97ec0-685b-11ea-b5ce-52540017001e"
  Warning  FailedMount             2m26s                 kubelet, pvx180.wdf.sap.corp  MountVolume.MountDevice failed for volume "pvc-14e97ec0-685b-11ea-b5ce-52540017001e" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             25s (x2 over 2m39s)   kubelet, pvx180.wdf.sap.corp  Unable to mount volumes for pod "noobaa-core-0_openshift-storage(15139cdb-685b-11ea-b5ce-52540017001e)": timeout expired waiting for volumes to attach or mount for pod "openshift-storage"/"noobaa-core-0". list of unmounted volumes=[db]. list of unattached volumes=[db logs mgmt-secret s3-secret noobaa-token-gzql6]
  Warning  FailedMount             16s (x8 over 2m25s)   kubelet, pvx180.wdf.sap.corp  MountVolume.MountDevice failed for volume "pvc-14e97ec0-685b-11ea-b5ce-52540017001e" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-213a69c7-685b-11ea-9cc1-0a580a820214 already exists
  
  oc describe pvc db-noobaa-core-0
  Name:          db-noobaa-core-0
  Namespace:     openshift-storage
  StorageClass:  ocs-storagecluster-ceph-rbd
  Status:        Bound
  Volume:        pvc-8b6a17cb-6863-11ea-b5ce-52540017001e
  Labels:        noobaa-core=noobaa
  Annotations:   pv.kubernetes.io/bind-completed: yes
                 pv.kubernetes.io/bound-by-controller: yes
                 volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
  Finalizers:    [kubernetes.io/pvc-protection]
  Capacity:      50Gi
  Access Modes:  RWO
  VolumeMode:    Filesystem
  Events:
    Type       Reason                 Age   From                                                                                                                Message
    ----       ------                 ----  ----                                                                                                                -------
    Normal     ExternalProvisioning   15m   persistentvolume-controller                                                                                         waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator
    Normal     Provisioning           15m   openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5dcdb49bb9-s254c_6b362ba7-685a-11ea-95f7-0a580a820214  External provisioner is provisioning volume for claim "openshift-storage/db-noobaa-core-0"
    Normal     ProvisioningSucceeded  14m   openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5dcdb49bb9-s254c_6b362ba7-685a-11ea-95f7-0a580a820214  Successfully provisioned volume pvc-8b6a17cb-6863-11ea-b5ce-52540017001e
  Mounted By:  noobaa-core-0
  oc describe pv pvc-8b6a17cb-6863-11ea-b5ce-52540017001e 
  Name:            pvc-8b6a17cb-6863-11ea-b5ce-52540017001e
  Labels:          <none>
  Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
  Finalizers:      [kubernetes.io/pv-protection]
  StorageClass:    ocs-storagecluster-ceph-rbd
  Status:          Bound
  Claim:           openshift-storage/db-noobaa-core-0
  Reclaim Policy:  Delete
  Access Modes:    RWO
  VolumeMode:      Filesystem
  Capacity:        50Gi
  Node Affinity:   <none>
  Message:         
  Source:
      Type:              CSI (a Container Storage Interface (CSI) volume source)
      Driver:            openshift-storage.rbd.csi.ceph.com
      VolumeHandle:      0001-0011-openshift-storage-0000000000000001-8ba71bf5-6863-11ea-9cc1-0a580a820214
      ReadOnly:          false
      VolumeAttributes:      clusterID=openshift-storage
                             imageFeatures=layering
                             imageFormat=2
                             pool=ocs-storagecluster-cephblockpool
                             storage.kubernetes.io/csiProvisionerIdentity=1584454814327-8081-openshift-storage.rbd.csi.ceph.com
  Events:                <none>
  
  oc describe sc ocs-storagecluster-ceph-rbd
  Name:                  ocs-storagecluster-ceph-rbd
  IsDefaultClass:        No
  Annotations:           <none>
  Provisioner:           openshift-storage.rbd.csi.ceph.com
  Parameters:            clusterID=openshift-storage,csi.storage.k8s.io/fstype=ext4,csi.storage.k8s.io/node-stage-secret-name=rook-csi-rbd-node,csi.storage.k8s.io/node-stage-secret-namespace=openshift-storage,csi.storage.k8s.io/provisioner-secret-name=rook-csi-rbd-provisioner,csi.storage.k8s.io/provisioner-secret-namespace=openshift-storage,imageFeatures=layering,imageFormat=2,pool=ocs-storagecluster-cephblockpool
  AllowVolumeExpansion:  <unset>
  MountOptions:          <none>
  ReclaimPolicy:         Delete
  VolumeBindingMode:     Immediate
  Events:                <none>


More information in the attachements.

Comment 2 Michal Minar 2020-03-24 15:51:05 UTC

Created attachment 1673137 [details]
OCS container logs and ceph health

Comment 3 Michal Minar 2020-03-24 16:02:22 UTC

Created attachment 1673160 [details]
Described storage related objects

Note You need to log in before you can comment on or make changes to this bug.