Bug 2100186
| Summary: | [GSS]Issue with CSI drivers and volume snapshots | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | khover |
| Component: | csi-driver | Assignee: | yati padia <ypadia> |
| Status: | CLOSED NOTABUG | QA Contact: | krishnaram Karthick <kramdoss> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | assingh, hchiramm, hnallurv, kelwhite, khartsoe, madam, mhackett, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, pdonnell, r.martinez, ypadia |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-10-04 02:23:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
khover
2022-06-22 17:00:07 UTC
curl –s https://docs.kasten.io/tools/k10_primer.sh > primer bash primer -c "storage csi-checker -s ocs-storagecluster-cephfs --runAsUser=1000" Test deployed a App with PVC and snapshot Creating application -> Created pod (kubestr-csi-original-podm4ffx) and pvc (kubestr-csi-original-pvc24gsm) Taking a snapshot Cleaning up resources Error deleteing PVC (kubestr-csi-original-pvc24gsm) - (context deadline exceeded) Error deleteing Pod (kubestr-csi-original-podm4ffx) - (context deadline exceeded) CSI Snapshot Walkthrough: Using annotated VolumeSnapshotClass (k10-clone-ocs-storagecluster-cephfsplugin-snapclass) Using annotated VolumeSnapshotClass (ocs-storagecluster-cephfsplugin-snapclass) Failed to create Snapshot: CSI Driver failed to create snapshot for PVC (kubestr-csi-original-pvc24gsm) in Namespace (default): Context done while polling: context deadline exceeded - Error ypadia Snapshots are successful using a fresh creation of PVC and then snapshot using the volumesnapshot object. It only fails when using Kasten. Strangely customer prod env Kasten (k10) backup works as expected on same version OCS I can set up access to the cluster since the must gathers are failing. How would you like to approach that, remote session or remote access ? Thanks I don't have idea about kasten and how it is different from normal process, but you can share remote access, I can check the logs. What will you need for remote access to the customers cluster ? I have never set this up with a customer env before. Must-gather works for me but since we don't have it, access to the customers cluster or similar cluster also works for me. @khover As a replacement for ocs-must-gather the following details would also work instead of the remote access to the cluster: 1. Provisioner pods logs 2. mgr and md5 logs 3. Subvolume Info 4. Ceph -s 5. Describe output for PVC, pv, volumesnapshot and volumesnapshotcontent I see a few details are already shared in the customer portal here (https://access.redhat.com/support/cases/#/case/03248921/discussion?commentId=a0a6R00000SirHVQAZ) but that is not sufficient. The customer has uploaded requested data to the case. 27692_OCS_logs_06242022.tar.gz @khover on checking the logs here is what I found: From Provisioner logs: ``` CSI CreateSnapshot: snapshot-e0eea82c-7300-4d6d-a391-34b38c43cbc2 I0624 14:01:54.993796 1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-ef6afab5-e6cc-4d42-9303-3f7d6b072f6b returned error: rpc error: code = Internal desc = an error (exit status 31) and stdError (Error EMLINK: error in mkdir /volumes/csi/csi-vol-3531bc42-2c3f-11ec-b5d9-0a580a83040c/.snap/csi-snap-3414aff2-f3c6-11ec-86af-0a580a83060b ) occurred while running ceph args: [fs subvolume snapshot create ocs-storagecluster-cephfilesystem csi-vol-3531bc42-2c3f-11ec-b5d9-0a580a83040c csi-snap-3414aff2-f3c6-11ec-86af-0a580a83060b --group_name csi -m 172.30.37.32:6789,172.30.194.45:6789,172.30.89.65:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] I0624 14:01:54.993841 1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-ef6afab5-e6cc-4d42-9303-3f7d6b072f6b] I0624 14:01:54.994046 1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-750d0a3b-b404-4ea2-8a57-a9e99a6d844b returned error: rpc error: code = Internal desc = an error (exit status 31) and stdError (Error EMLINK: error in mkdir /volumes/csi/csi-vol-2c43ca03-285b-11ec-b5d9-0a580a83040c/.snap/csi-snap-33f966c4-f3c6-11ec-86af-0a580a83060b ) occurred while running ceph args: [fs subvolume snapshot create ocs-storagecluster-cephfilesystem csi-vol-2c43ca03-285b-11ec-b5d9-0a580a83040c csi-snap-33f966c4-f3c6-11ec-86af-0a580a83060b --group_name csi -m 172.30.37.32:6789,172.30.194.45:6789,172.30.89.65:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] ``` The above Error EMLINK occurs when the limit is reached on the volume and hence it enables to create the snapshot. Same can be seen from the mgr logs which says too many links and hence it fails to create the dir. ``` debug 2022-06-24 13:19:00.244 7f46ded93700 -1 mgr.server reply reply (31) Too many links error in mkdir /volumes/csi/csi-vol-395e6ac7-94d0-11ec-9d87-0a580a830633/.snap/csi-snap-356b2bcb-f3c0-11ec-86af-0a580a83060b ``` Would this solution be applicable here for cu ? https://access.redhat.com/solutions/45676 Yes, that should work. Hello Yati, Customer is unable to generate must gathers due to etcd slowness and api issues being worked on in OCP parallel case. Is there anything specific needed that we can capture during a remote session scheduled for 6/28 2:30pm NA/EST ? Snapshot info for each pvcs, subvolume info, would be enough as of now. Hey, In that case we can close this bug? 2. [ @mrajanna ] -- Assuming Patrick agrees with 'EDQUOT', can you update[4] & delete the failed volumesnapshot/volumesnapshotcontent? [4] - https://github.com/ceph/ceph-csi/blob/devel/internal/cephfs/core/snapshot.go#L90 From CephCSI we cannot delete the failed volumesnapshot/volumesnapshotcontent. Do you want us to handle deleting cephfs snapshot if cephfs snapshot creation fails? cephfs snapshot is a sync call, If ceph fs fails to create the snapshot, it should take care of automatically delete the failed snapshot |