DescriptionDaniel Osypenko
2023-01-19 18:15:52 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Fail to Bound PVC during timeout after Worker node drained ODF 4.12
Version of all relevant components (if applicable):
Observed on configurations:
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-EXTERNAL-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-KMS-VAULT-V1-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-EXTERNAL-3M-3W
OCS4-12-Downstream-OCP4-12-BAREMETAL-UPI-1AZ-RHCOS-NVME-INTEL-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-13-AZURE-IPI-3AZ-RHCOS-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-KMS-VAULT-V1-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-FIPS-1AZ-RHCOS-VSAN-3M-3W
OCS4-12-Downstream-OCP4-12-AWS-IPI-3AZ-RHCOS-3M-3W
Version details:
Ceph Version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)
Cluster Version 4.12.0-0.nightly-2023-01-10-062211
OCS operator 4.12.0-164
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, temporarily: PVC in status Pending at least 60 sec at max 2 min 20 sec
Is there any workaround available to the best of your knowledge?
No
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
10 times at least. Issue happened in past (earliest September 2022)
Can this issue reproduce from the UI?
no
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
OCS-1269/OCS-1272:
- Maintenance (mark as unscheduable and drain) 1 worker/master node
- Check cluster functionality by creating resources
(pools, storageclasses, PVCs, pods - both CephFS and RBD)
- Mark the node as scheduable
- Check cluster and Ceph health
Actual results:
After drain Worker Node, at step Create resources during timeout PVC fail to reach status Bound during 60 sec
right before the timeout is reached, we can see how much the cluster is loaded:
================================================================================
Throughput: 0.46 MB/s || Latency: 0.93 ms || IOPS: 13.93 || Used Space: 43.49 GB
================================================================================
PVC YAML:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-test-58a23346164f4ecf9390d244faa0450
namespace: namespace-test-40830713a86b43cfa9b4da45b
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: ocs-storagecluster-cephfs
Expected results:
PVC reach status Bound in 60 sec
Additional info:
2023-01-11 23:23:40 21:23:40 - MainThread - ocs_ci.ocs.ocp - ERROR - Wait for PersistentVolumeClaim resource pvc-test-58a23346164f4ecf9390d244faa0450 at column STATUS to reach desired condition Bound failed, last actual status was Pending 2023-01-11 23:23:40 21:23:40 - MainThread - ocs_ci.helpers.helpers - ERROR - PersistentVolumeClaim pvc-test-58a23346164f4ecf9390d244faa0450 failed to reach Bound
Description of the resource on the time of failure is:
2023-01-11 23:23:40 21:23:40 - MainThread - ocs_ci.ocs.ocp - WARNING - Description of the resource(s) we were waiting for:
2023-01-11 23:23:40 Name: pvc-test-58a23346164f4ecf9390d244faa0450
2023-01-11 23:23:40 Namespace: namespace-test-40830713a86b43cfa9b4da45b
2023-01-11 23:23:40 StorageClass: ocs-storagecluster-cephfs
2023-01-11 23:23:40 Status: Pending
2023-01-11 23:23:40 Volume:
2023-01-11 23:23:40 Labels:
2023-01-11 23:23:40 Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
2023-01-11 23:23:40 volume.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
2023-01-11 23:23:40 Finalizers: [kubernetes.io/pvc-protection]
2023-01-11 23:23:40 Capacity:
2023-01-11 23:23:40 Access Modes:
2023-01-11 23:23:40 VolumeMode: Filesystem
2023-01-11 23:23:40 Used By:
2023-01-11 23:23:40 Events:
2023-01-11 23:23:40 Type Reason Age From Message
2023-01-11 23:23:40 ---- ------ ---- ---- -------
2023-01-11 23:23:40 Normal ExternalProvisioning 6s (x5 over 60s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.cephfs.csi.ceph.com" or manually created by system administrator
In the same time must-gather logs state that pvc with name pvc-test-58a23346164f4ecf9390d244faa0450 has been bound latest 2m20s after creation of the PVC.
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
bg-fio-load-6763b pvc-test-632f4b61f3504874aac4795c7ab4e5d Bound pvc-84af41ea-8221-45f1-8c88-63e6729a6fb9 7Gi RWO ocs-storagecluster-ceph-rbd 8h
namespace-test-40830713a86b43cfa9b4da45b pvc-test-3b55e4a218824042af47a8b3b40d2d2 Bound pvc-86cae5e2-d721-4a5d-ae3f-7895740b9f7e 3Gi RWO ocs-storagecluster-ceph-rbd 2m41s
namespace-test-40830713a86b43cfa9b4da45b pvc-test-58a23346164f4ecf9390d244faa0450 Bound pvc-495e1572-49d1-4b2f-9fe9-ba93cc59fe69 3Gi RWO ocs-storagecluster-cephfs 2m20s
...
See /ocsci-jenkins/openshift-clusters/j-253vuf1cs33-t1/j-253vuf1cs33-t1_20230111T115745/logs/failed_testcase_ocs_logs_1673441794/test_node_maintenance[worker]_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1d2e3b562c65a633844855c65f0017e18d735d22ef3fbd51ff6fbb258ff23497/namespaces/all/pvc_all_namespaces
Full must-gather log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-253vuf1cs33-t1/j-253vuf1cs33-t1_20230111T115745/logs/failed_testcase_ocs_logs_1673441794/test_node_maintenance%5bworker%5d_ocs_logs/ocs_must_gather/
Report portal: OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-FIPS-1AZ-RHCOS-VSAN-3M-3W-tier1-jenkins
Test fails 9 times logged in Report Portal, but never failed running from local computer (5 test runs performed). Failures happened priorly - earliest Sept 2022. Therefore 60 sec for PVC creation after node drain is clearly not enough.
Links to other Report Portal occurrences:
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8178/361568/361845/361847/loghttps://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8175/361258/361533/361534/loghttps://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8148/359492/359760/359761/loghttps://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8144/359135/359405/359406/loghttps://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8118/357719/357987/357988/loghttps://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8102/356826/357082/357083/loghttps://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8086/356026/356281/356282/loghttps://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8080/355418/355688/355690/log
@Daniel Osypenko , @Elad
Per your question : "What timeout will be efficient for PVC creation & bound after node drain?"
We do not have a performance test or a performance requirement for PVC creation and bound after node drain.
The time limit for PVC creation in our tests without the node drain flavour is 1 sec ( for pvc to reach status Bound) while the actual average results show time much much shorter than 1 sec ( up to 0.2 sec) for 5/15/25 GB PVCs on both AWS and VMware LSO platforms.
Therefore a limit of 60 sec sound like a very long time for PVC to be created, comparing to the results we have on a cluster without node drain.
I would suggest to ask a product team to add a limit for PVC creation during node drain, rather than just increasing the limit to make the test pass.
Description of problem (please be detailed as possible and provide log snippests): Fail to Bound PVC during timeout after Worker node drained ODF 4.12 Version of all relevant components (if applicable): Observed on configurations: OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-EXTERNAL-3M-3W OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-KMS-VAULT-V1-1AZ-RHCOS-VSAN-3M-3W OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-EXTERNAL-3M-3W OCS4-12-Downstream-OCP4-12-BAREMETAL-UPI-1AZ-RHCOS-NVME-INTEL-3M-3W OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-1AZ-RHCOS-VSAN-3M-3W OCS4-12-Downstream-OCP4-13-AZURE-IPI-3AZ-RHCOS-3M-3W OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-KMS-VAULT-V1-1AZ-RHCOS-VSAN-3M-3W OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-FIPS-1AZ-RHCOS-VSAN-3M-3W OCS4-12-Downstream-OCP4-12-AWS-IPI-3AZ-RHCOS-3M-3W Version details: Ceph Version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable) Cluster Version 4.12.0-0.nightly-2023-01-10-062211 OCS operator 4.12.0-164 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, temporarily: PVC in status Pending at least 60 sec at max 2 min 20 sec Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? 10 times at least. Issue happened in past (earliest September 2022) Can this issue reproduce from the UI? no If this is a regression, please provide more details to justify this: Steps to Reproduce: OCS-1269/OCS-1272: - Maintenance (mark as unscheduable and drain) 1 worker/master node - Check cluster functionality by creating resources (pools, storageclasses, PVCs, pods - both CephFS and RBD) - Mark the node as scheduable - Check cluster and Ceph health Actual results: After drain Worker Node, at step Create resources during timeout PVC fail to reach status Bound during 60 sec right before the timeout is reached, we can see how much the cluster is loaded: ================================================================================ Throughput: 0.46 MB/s || Latency: 0.93 ms || IOPS: 13.93 || Used Space: 43.49 GB ================================================================================ PVC YAML: apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-test-58a23346164f4ecf9390d244faa0450 namespace: namespace-test-40830713a86b43cfa9b4da45b spec: accessModes: - ReadWriteOnce resources: requests: storage: 3Gi storageClassName: ocs-storagecluster-cephfs Expected results: PVC reach status Bound in 60 sec Additional info: 2023-01-11 23:23:40 21:23:40 - MainThread - ocs_ci.ocs.ocp - ERROR - Wait for PersistentVolumeClaim resource pvc-test-58a23346164f4ecf9390d244faa0450 at column STATUS to reach desired condition Bound failed, last actual status was Pending 2023-01-11 23:23:40 21:23:40 - MainThread - ocs_ci.helpers.helpers - ERROR - PersistentVolumeClaim pvc-test-58a23346164f4ecf9390d244faa0450 failed to reach Bound Description of the resource on the time of failure is: 2023-01-11 23:23:40 21:23:40 - MainThread - ocs_ci.ocs.ocp - WARNING - Description of the resource(s) we were waiting for: 2023-01-11 23:23:40 Name: pvc-test-58a23346164f4ecf9390d244faa0450 2023-01-11 23:23:40 Namespace: namespace-test-40830713a86b43cfa9b4da45b 2023-01-11 23:23:40 StorageClass: ocs-storagecluster-cephfs 2023-01-11 23:23:40 Status: Pending 2023-01-11 23:23:40 Volume: 2023-01-11 23:23:40 Labels: 2023-01-11 23:23:40 Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com 2023-01-11 23:23:40 volume.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com 2023-01-11 23:23:40 Finalizers: [kubernetes.io/pvc-protection] 2023-01-11 23:23:40 Capacity: 2023-01-11 23:23:40 Access Modes: 2023-01-11 23:23:40 VolumeMode: Filesystem 2023-01-11 23:23:40 Used By: 2023-01-11 23:23:40 Events: 2023-01-11 23:23:40 Type Reason Age From Message 2023-01-11 23:23:40 ---- ------ ---- ---- ------- 2023-01-11 23:23:40 Normal ExternalProvisioning 6s (x5 over 60s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.cephfs.csi.ceph.com" or manually created by system administrator In the same time must-gather logs state that pvc with name pvc-test-58a23346164f4ecf9390d244faa0450 has been bound latest 2m20s after creation of the PVC. NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE bg-fio-load-6763b pvc-test-632f4b61f3504874aac4795c7ab4e5d Bound pvc-84af41ea-8221-45f1-8c88-63e6729a6fb9 7Gi RWO ocs-storagecluster-ceph-rbd 8h namespace-test-40830713a86b43cfa9b4da45b pvc-test-3b55e4a218824042af47a8b3b40d2d2 Bound pvc-86cae5e2-d721-4a5d-ae3f-7895740b9f7e 3Gi RWO ocs-storagecluster-ceph-rbd 2m41s namespace-test-40830713a86b43cfa9b4da45b pvc-test-58a23346164f4ecf9390d244faa0450 Bound pvc-495e1572-49d1-4b2f-9fe9-ba93cc59fe69 3Gi RWO ocs-storagecluster-cephfs 2m20s ... See /ocsci-jenkins/openshift-clusters/j-253vuf1cs33-t1/j-253vuf1cs33-t1_20230111T115745/logs/failed_testcase_ocs_logs_1673441794/test_node_maintenance[worker]_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1d2e3b562c65a633844855c65f0017e18d735d22ef3fbd51ff6fbb258ff23497/namespaces/all/pvc_all_namespaces Full must-gather log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-253vuf1cs33-t1/j-253vuf1cs33-t1_20230111T115745/logs/failed_testcase_ocs_logs_1673441794/test_node_maintenance%5bworker%5d_ocs_logs/ocs_must_gather/ Report portal: OCS4-12-Downstream-OCP4-12-VSPHERE6-UPI-FIPS-1AZ-RHCOS-VSAN-3M-3W-tier1-jenkins Test fails 9 times logged in Report Portal, but never failed running from local computer (5 test runs performed). Failures happened priorly - earliest Sept 2022. Therefore 60 sec for PVC creation after node drain is clearly not enough. Links to other Report Portal occurrences: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8178/361568/361845/361847/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8175/361258/361533/361534/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8148/359492/359760/359761/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8144/359135/359405/359406/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8118/357719/357987/357988/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8102/356826/357082/357083/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8086/356026/356281/356282/log https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/362/8080/355418/355688/355690/log