Bug 2162504

Summary:	Fail to Bound PVC during timeout after Worker node drained ODF 4.12
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Daniel Osypenko <dosypenk>
Component:	csi-driver	Assignee:	Rakshith <rar>
Status:	CLOSED NOTABUG	QA Contact:	krishnaram Karthick <kramdoss>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.12	CC:	bniver, ebenahar, jopinto, madam, muagarwa, ocs-bugs, odf-bz-bot, rar, sostapov, ypersky
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-31 11:14:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Osypenko 2023-01-19 18:15:52 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Fail to Bound PVC during timeout after Worker node drained ODF 4.12

Version of all relevant components (if applicable):

Observed on configurations:

OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-EXTERNAL-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-KMS-VAULT-V1-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-EXTERNAL-3M-3W
OCS4-12-Downstream-OCP4-12-BAREMETAL-UPI-1AZ-RHCOS-NVME-INTEL-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-13-AZURE-IPI-3AZ-RHCOS-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-KMS-VAULT-V1-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-FIPS-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-12-AWS-IPI-3AZ-RHCOS-3M-3W

Version details:
Ceph Version	16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)
Cluster Version	4.12.0-0.nightly-2023-01-10-062211
OCS operator	4.12.0-164

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, temporarily: PVC in status Pending at least 60 sec at max 2 min 20 sec

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

10 times at least. Issue happened in past (earliest September 2022)

Can this issue reproduce from the UI?

no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:

OCS-1269/OCS-1272:
- Maintenance (mark as unscheduable and drain) 1 worker/master node
- Check cluster functionality by creating resources
(pools, storageclasses, PVCs, pods - both CephFS and RBD)
- Mark the node as scheduable
- Check cluster and Ceph health


Actual results:

After drain Worker Node, at step Create resources during timeout PVC fail to reach status Bound during 60 sec

right before the timeout is reached, we can see how much the cluster is loaded:
================================================================================
Throughput: 0.46 MB/s || Latency: 0.93 ms || IOPS: 13.93 || Used Space: 43.49 GB
================================================================================

PVC YAML:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
   name: pvc-test-58a23346164f4ecf9390d244faa0450
   namespace: namespace-test-40830713a86b43cfa9b4da45b
spec:
   accessModes:
   - ReadWriteOnce
   resources:
     requests:
       storage: 3Gi
   storageClassName: ocs-storagecluster-cephfs


Expected results:

PVC reach status Bound in 60 sec

Additional info:

2023-01-11 23:23:40  21:23:40 - MainThread - ocs_ci.ocs.ocp - ERROR  - Wait for PersistentVolumeClaim resource pvc-test-58a23346164f4ecf9390d244faa0450 at column STATUS to reach desired condition Bound failed, last actual status was Pending 2023-01-11 23:23:40  21:23:40 - MainThread - ocs_ci.helpers.helpers - ERROR  - PersistentVolumeClaim pvc-test-58a23346164f4ecf9390d244faa0450 failed to reach Bound

Description of the resource on the time of failure is:
2023-01-11 23:23:40 21:23:40 - MainThread - ocs_ci.ocs.ocp - WARNING - Description of the resource(s) we were waiting for:
2023-01-11 23:23:40 Name: pvc-test-58a23346164f4ecf9390d244faa0450
2023-01-11 23:23:40 Namespace: namespace-test-40830713a86b43cfa9b4da45b
2023-01-11 23:23:40 StorageClass: ocs-storagecluster-cephfs
2023-01-11 23:23:40 Status: Pending
2023-01-11 23:23:40 Volume:
2023-01-11 23:23:40 Labels:
2023-01-11 23:23:40 Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
2023-01-11 23:23:40 volume.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
2023-01-11 23:23:40 Finalizers: [kubernetes.io/pvc-protection]
2023-01-11 23:23:40 Capacity:
2023-01-11 23:23:40 Access Modes:
2023-01-11 23:23:40 VolumeMode: Filesystem
2023-01-11 23:23:40 Used By:
2023-01-11 23:23:40 Events:
2023-01-11 23:23:40 Type Reason Age From Message
2023-01-11 23:23:40 ---- ------ ---- ---- -------
2023-01-11 23:23:40 Normal ExternalProvisioning 6s (x5 over 60s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.cephfs.csi.ceph.com" or manually created by system administrator

In the same time must-gather logs state that pvc with name pvc-test-58a23346164f4ecf9390d244faa0450 has been bound latest 2m20s after creation of the PVC.

NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
bg-fio-load-6763b pvc-test-632f4b61f3504874aac4795c7ab4e5d Bound pvc-84af41ea-8221-45f1-8c88-63e6729a6fb9 7Gi RWO ocs-storagecluster-ceph-rbd 8h
namespace-test-40830713a86b43cfa9b4da45b pvc-test-3b55e4a218824042af47a8b3b40d2d2 Bound pvc-86cae5e2-d721-4a5d-ae3f-7895740b9f7e 3Gi RWO ocs-storagecluster-ceph-rbd 2m41s
namespace-test-40830713a86b43cfa9b4da45b pvc-test-58a23346164f4ecf9390d244faa0450 Bound pvc-495e1572-49d1-4b2f-9fe9-ba93cc59fe69 3Gi RWO ocs-storagecluster-cephfs 2m20s
...

See /ocsci-jenkins/openshift-clusters/j-253vuf1cs33-t1/j-253vuf1cs33-t1_20230111T115745/logs/failed_testcase_ocs_logs_1673441794/test_node_maintenance[worker]_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1d2e3b562c65a633844855c65f0017e18d735d22ef3fbd51ff6fbb258ff23497/namespaces/all/pvc_all_namespaces

Full must-gather log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-253vuf1cs33-t1/j-253vuf1cs33-t1_20230111T115745/logs/failed_testcase_ocs_logs_1673441794/test_node_maintenance%5bworker%5d_ocs_logs/ocs_must_gather/

Report portal: OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-FIPS-1AZ-RHCOS-VSAN-3M-3W-tier1-jenkins

Test fails 9 times logged in Report Portal, but never failed running from local computer (5 test runs performed). Failures happened priorly - earliest Sept 2022. Therefore 60 sec for PVC creation after node drain is clearly not enough.

Links to other Report Portal occurrences:

https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8178/361568/361845/361847/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8175/361258/361533/361534/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8148/359492/359760/359761/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8144/359135/359405/359406/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8118/357719/357987/357988/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8102/356826/357082/357083/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8086/356026/356281/356282/log
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8080/355418/355688/355690/log

Comment 5 Yuli Persky 2023-01-25 22:58:11 UTC

@Daniel Osypenko , @Elad

Per your question : "What timeout will be efficient for PVC creation & bound after node drain?" 

We do not have a performance test or a performance requirement for PVC creation and bound after node drain. 

The time limit for PVC creation in our tests without the node drain flavour is 1 sec ( for pvc to reach status Bound) while the actual average results show time much much shorter than 1 sec  ( up to 0.2 sec) for 5/15/25 GB PVCs on both AWS and VMware LSO platforms. 

Therefore a limit of 60 sec sound like a very long time for PVC to be created, comparing to the results we have on a cluster without node drain. 

I would suggest to ask a product team to add a limit for PVC creation during node drain, rather than just increasing the limit to make the test pass.