Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): (yulienv) [ypersky@ypersky ocs-ci]$ oc version Client Version: 4.6.0-0.nightly-2020-10-10-041109 Server Version: 4.6.0-0.nightly-2020-10-22-034051 Kubernetes Version: v1.19.0+d59ce34 (yulienv) [ypersky@ypersky ocs-ci]$ yulienv) [ypersky@ypersky ocs-ci]$ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') rook version rook: 4.6-67.afaf3353.release_4.6 go: go1.15.0 (yulienv) [ypersky@ypersky ocs-ci]$ go: go1.15.0 (yulienv) [ypersky@ypersky ocs-ci]$ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph version ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable) (yulienv) [ypersky@ypersky ocs-ci]$ (yulienv) [ypersky@ypersky ocs-ci]$ oc describe pod $(oc get pods -n openshift-storage|grep operator|awk '{print$1}') -n openshift-storage|grep Image containerImage: quay.io/ocs-dev/ocs-operator:4.6.0 Image: quay.io/rhceph-dev/mcg-operator@sha256:8ce296e311cb4b0d73f2ad01473b1a8de92c0e7445894f485e6e03e2c5517725 Image ID: quay.io/rhceph-dev/mcg-operator@sha256:8ce296e311cb4b0d73f2ad01473b1a8de92c0e7445894f485e6e03e2c5517725 containerImage: quay.io/ocs-dev/ocs-operator:4.6.0 Image: quay.io/rhceph-dev/ocs-operator@sha256:e1243317f52e64a0003d53dde8451dc08b00e5a9f02afa50aef5db67035e67e7 Image ID: quay.io/rhceph-dev/ocs-operator@sha256:4f45847238d8a241a33e163304934de0db0bbe8c511239007710bf49fd42b500 containerImage: quay.io/ocs-dev/ocs-operator:4.6.0 Image: quay.io/rhceph-dev/rook-ceph@sha256:1f45cf3c5e3ccf1ca43f74275131ad70bcc025fa3b64eae6c21bb80d78a6f249 Image ID: quay.io/rhceph-dev/rook-ceph@sha256:1f45cf3c5e3ccf1ca43f74275131ad70bcc025fa3b64eae6c21bb80d78a6f249 (yulienv) [ypersky@ypersky ocs-ci]$ (yulienv) [ypersky@ypersky ocs-ci]$ oc describe pod $(oc get pods -n openshift-storage|grep mon-a|awk '{print$1}') -n openshift-storage|grep Image Image: quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 Image ID: quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 Image: quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 Image ID: quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 Image: quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 Image ID: quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 (yulienv) [ypersky@ypersky ocs-ci]$ Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Cannot create required multiple number of clones for CephFS PVC. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. Run test_pvc_clone_performance.py test ( which includes test_multiple_clone_create_delete_performance while num_of_clones is at least 100 Actual results: When the tests starts to create clone number 95 for CephPS PVC, pvc size 10 GB) , it fails after 5 hours timeout with the following errors ( 95th clone is not created ) : 08:21:52 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-a31c53b1da484837962831b66229c2cb --kubeconfig /home/ypersky/ypersky-clusters/vmwar e/ypersky-perf/auth/kubeconfig get PersistentVolumeClaim clone-pvc-test-57f7954413f04d-afebc8631ecf42ce86c05c2cbd90b8fd -n namespace-test-a31c53b1da484837962831b66229c2cb 08:21:53 - MainThread - ocs_ci.ocs.ocp - INFO - status of clone-pvc-test-57f7954413f04d-afebc8631ecf42ce86c05c2cbd90b8fd at column STATUS was Pending, but we were waiting for B ound 08:21:53 - MainThread - ocs_ci.ocs.ocp - ERROR - timeout expired: Timed Out: (18000,) 08:21:53 - MainThread - ocs_ci.ocs.ocp - ERROR - Wait for PersistentVolumeClaim resource clone-pvc-test-57f7954413f04d-afebc8631ecf42ce86c05c2cbd90b8fd at column STATUS to reac h desired condition Bound failed, last actual status was Pending 08:21:53 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-a31c53b1da484837962831b66229c2cb --kubeconfig /home/ypersky/ypersky-clusters/vmwar e/ypersky-perf/auth/kubeconfig describe PersistentVolumeClaim clone-pvc-test-57f7954413f04d-afebc8631ecf42ce86c05c2cbd90b8fd 08:21:55 - MainThread - ocs_ci.ocs.ocp - WARNING - Description of the resource(s) we were waiting for: Name: clone-pvc-test-57f7954413f04d-afebc8631ecf42ce86c05c2cbd90b8fd Namespace: namespace-test-a31c53b1da484837962831b66229c2cb StorageClass: ocs-storagecluster-cephfs Status: Pending Volume: Labels: <none> Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem DataSource: Kind: PersistentVolumeClaim Name: pvc-test-57f7954413f04d5188d5225024fda238 Mounted By: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 115m (x37 over 4h55m) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-76d985866b-trqwk_d50980ff-8638-4250-9bec-05c3b58f311a (combined from similar events): failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Unknown desc = an error (exit status 31) occurred while running ceph args: [fs subvolume snapshot create ocs-storagecluster-cephfilesystem csi-vol-fdc81702-17e6-11eb-81ce-0a580a800217 csi-vol-a18687bd-180c-11eb-81ce-0a580a8002 17 --group_name csi -m 172.30.181.205:6789,172.30.220.166:6789,172.30.118.126:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] Normal ExternalProvisioning 4m57s (x1213 over 5h) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.cephfs.csi.ceph.com" or manually created by system administrator Normal Provisioning 4m23s (x68 over 5h) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-76d985866b-trqwk_d50980ff-8638-4250-9bec-05c3b58f311a External provisioner is provisioning volume for claim "namespace-test-a31c53b1da484837962831b66229c2cb/clone-pvc-test-57f7954413f04d-afebc8631ecf42ce86c05c2cbd90b8fd" 08:21:55 - MainThread - tests.helpers - ERROR - PersistentVolumeClaim clone-pvc-test-57f7954413f04d-afebc8631ecf42ce86c05c2cbd90b8fd failed to reach Bound Expected results: at least 512 clones should be created Additional info: Link to the logs will be posted immediately.
Yug, PTAL. Yuli, please provide must-gather logs.
Logs are available here ( including must gather - their upload is currently processing): http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1892234/
Reproduced the issue multiple times with PVC of 10 GB with 2 GB data... On creation of approximately 100 clones, the last 5 -10 clones remained pending in my case getting stuck at snapshot create. Once we reach this threshold, any attempt to create any further PVCs or snapshot remained in a PENDING state. On adding stdErr log, the following was captured: ``` E1103 14:11:26.331208 1 util.go:39] ID: 6431 Req-ID: pvc-aac6ae48-20eb-439c-b68e-9d59121cf13f stdErr occured : Error EMLINK: error in mkdir /volumes/csi/csi-vol-194955fe-1ddd-11eb-a9e4-0a580a8002ca/.snap/csi-vol-75be277d-1dde-11eb-a9e4-0a580a8002ca ``` In some runs, failure to remove snapshots was also observed. Currently debugging the issue more to analyze the root cause.
Proposing this as a blocker and providing dev ack. This needs to get in as there is a bug in cephcsi which is causing stale clones/snapshots. Thanks Madhu and Yug for catching it. Yug will provide a detailed RCA.
There were a couple of issues from our side which were causing this issue: 1. The subvolumes were leaking on multiple requests as they were unable to find the appropriate omap. 2. This also leads to snapshot leaking and snapshots were hitting 100 count very fast, and due to that many clone attempts later would stay pending as max count was hit pretty early. 3. Cephfs csi driver still uses the ceph CLI for all the ceph fs operations, when cephcsi gets a request from the Kubernetes it will internally start threads for each call. Due to the default PID limit, we were facing a core dump due to a resource crunch. 4. Ceph csi code was missing support for the `Pending` state for a clone. So if a clone status is Pending, it couldn't be identified. 5. Faulty error handling: In some places err was being checked in place of stdErr, due to which, some error checks always returned false. The following PRs will work as a fix for all the above-mentioned issues: 1. https://github.com/ceph/ceph-csi/pull/1671 2. https://github.com/ceph/ceph-csi/pull/1669 3. https://github.com/rook/rook/pull/6561 Thanks, Madhu and John for the PRs Thanks a lot, Madhu for the debugging help :)
To summarize, CephFS clone feature is working but because of the issues mentioned by Yug in the above comment we won't be able to scale the same according to the requirements. These are not corner cases and should have been caught in the testing earlier, I would request QE to run full feature coverage including the scale/performance testing asap.
Verified in 4.6.0-156.ci . Successfully created 512 clones for CephFS PVC.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605