Description of problem (please be detailed as possible and provide log snippests): using Kasten (k10) to backup our Applications and part fo it involves taking Snapshots we test deployed an application with pvc and tried to take a snapshot - it's throwing error Seems to be timing out: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatingSnapshot 3m22s snapshot-controller Waiting for a snapshot default/test-k10-snapshot to be created by the CSI driver. Additionally customer is unable to upload must gather due to timeout error. We suspect there issue with CSI drivers and taking volume snapshots - don't see any issues with Ceph cluster NOTE: customer prod env Kasten (k10) backup works as expected on same version OCS Per Kasten K-10 support, the volume snap-shot issue seems to be a bug associated with OCS. Please see below: I have completed reviewing the csi provisioner log files. Here are my findings. 1. "failed to snapshot of volume" errors was observed 330 times 2, The above errors were caused by ceph snapshot_controller.go:292. it appears to be a ceph bug base on my research please read the following post https://bugzilla.redhat.com/show_bug.cgi?id=1892234 3, Due to the above error, we observed 2320 api calls, which took significant amount of time. Not sure if this is the reason that you observed large etcd api calls. Version of all relevant components (if applicable): # ocs version 4.6.13 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Customer unable to backup Applications Is there any workaround available to the best of your knowledge? No Customer tried deployment.apps/csi-cephfsplugin-provisioner scaled down then back up with no relief. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
curl –s https://docs.kasten.io/tools/k10_primer.sh > primer bash primer -c "storage csi-checker -s ocs-storagecluster-cephfs --runAsUser=1000" Test deployed a App with PVC and snapshot Creating application -> Created pod (kubestr-csi-original-podm4ffx) and pvc (kubestr-csi-original-pvc24gsm) Taking a snapshot Cleaning up resources Error deleteing PVC (kubestr-csi-original-pvc24gsm) - (context deadline exceeded) Error deleteing Pod (kubestr-csi-original-podm4ffx) - (context deadline exceeded) CSI Snapshot Walkthrough: Using annotated VolumeSnapshotClass (k10-clone-ocs-storagecluster-cephfsplugin-snapclass) Using annotated VolumeSnapshotClass (ocs-storagecluster-cephfsplugin-snapclass) Failed to create Snapshot: CSI Driver failed to create snapshot for PVC (kubestr-csi-original-pvc24gsm) in Namespace (default): Context done while polling: context deadline exceeded - Error
ypadia Snapshots are successful using a fresh creation of PVC and then snapshot using the volumesnapshot object. It only fails when using Kasten. Strangely customer prod env Kasten (k10) backup works as expected on same version OCS I can set up access to the cluster since the must gathers are failing. How would you like to approach that, remote session or remote access ? Thanks
I don't have idea about kasten and how it is different from normal process, but you can share remote access, I can check the logs.
What will you need for remote access to the customers cluster ? I have never set this up with a customer env before.
Must-gather works for me but since we don't have it, access to the customers cluster or similar cluster also works for me.
@khover As a replacement for ocs-must-gather the following details would also work instead of the remote access to the cluster: 1. Provisioner pods logs 2. mgr and md5 logs 3. Subvolume Info 4. Ceph -s 5. Describe output for PVC, pv, volumesnapshot and volumesnapshotcontent I see a few details are already shared in the customer portal here (https://access.redhat.com/support/cases/#/case/03248921/discussion?commentId=a0a6R00000SirHVQAZ) but that is not sufficient.
The customer has uploaded requested data to the case. 27692_OCS_logs_06242022.tar.gz
@khover on checking the logs here is what I found: From Provisioner logs: ``` CSI CreateSnapshot: snapshot-e0eea82c-7300-4d6d-a391-34b38c43cbc2 I0624 14:01:54.993796 1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-ef6afab5-e6cc-4d42-9303-3f7d6b072f6b returned error: rpc error: code = Internal desc = an error (exit status 31) and stdError (Error EMLINK: error in mkdir /volumes/csi/csi-vol-3531bc42-2c3f-11ec-b5d9-0a580a83040c/.snap/csi-snap-3414aff2-f3c6-11ec-86af-0a580a83060b ) occurred while running ceph args: [fs subvolume snapshot create ocs-storagecluster-cephfilesystem csi-vol-3531bc42-2c3f-11ec-b5d9-0a580a83040c csi-snap-3414aff2-f3c6-11ec-86af-0a580a83060b --group_name csi -m 172.30.37.32:6789,172.30.194.45:6789,172.30.89.65:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] I0624 14:01:54.993841 1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-ef6afab5-e6cc-4d42-9303-3f7d6b072f6b] I0624 14:01:54.994046 1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-750d0a3b-b404-4ea2-8a57-a9e99a6d844b returned error: rpc error: code = Internal desc = an error (exit status 31) and stdError (Error EMLINK: error in mkdir /volumes/csi/csi-vol-2c43ca03-285b-11ec-b5d9-0a580a83040c/.snap/csi-snap-33f966c4-f3c6-11ec-86af-0a580a83060b ) occurred while running ceph args: [fs subvolume snapshot create ocs-storagecluster-cephfilesystem csi-vol-2c43ca03-285b-11ec-b5d9-0a580a83040c csi-snap-33f966c4-f3c6-11ec-86af-0a580a83060b --group_name csi -m 172.30.37.32:6789,172.30.194.45:6789,172.30.89.65:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] ``` The above Error EMLINK occurs when the limit is reached on the volume and hence it enables to create the snapshot. Same can be seen from the mgr logs which says too many links and hence it fails to create the dir. ``` debug 2022-06-24 13:19:00.244 7f46ded93700 -1 mgr.server reply reply (31) Too many links error in mkdir /volumes/csi/csi-vol-395e6ac7-94d0-11ec-9d87-0a580a830633/.snap/csi-snap-356b2bcb-f3c0-11ec-86af-0a580a83060b ```
Would this solution be applicable here for cu ? https://access.redhat.com/solutions/45676
Yes, that should work.
Hello Yati, Customer is unable to generate must gathers due to etcd slowness and api issues being worked on in OCP parallel case. Is there anything specific needed that we can capture during a remote session scheduled for 6/28 2:30pm NA/EST ?
Snapshot info for each pvcs, subvolume info, would be enough as of now.
Hey, In that case we can close this bug?
2. [ @mrajanna ] -- Assuming Patrick agrees with 'EDQUOT', can you update[4] & delete the failed volumesnapshot/volumesnapshotcontent? [4] - https://github.com/ceph/ceph-csi/blob/devel/internal/cephfs/core/snapshot.go#L90 From CephCSI we cannot delete the failed volumesnapshot/volumesnapshotcontent. Do you want us to handle deleting cephfs snapshot if cephfs snapshot creation fails? cephfs snapshot is a sync call, If ceph fs fails to create the snapshot, it should take care of automatically delete the failed snapshot