2100186 – [GSS]Issue with CSI drivers and volume snapshots

Bug 2100186 - [GSS]Issue with CSI drivers and volume snapshots

Summary: [GSS]Issue with CSI drivers and volume snapshots

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	yati padia
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-22 17:00 UTC by khover
Modified:	2023-08-09 16:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-04 02:23:34 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 47174	0	None	open	mds/Server.cc: update cephfs snapshot quota errno	2022-07-19 17:12:16 UTC
Red Hat Bugzilla	1892234	0	unspecified	CLOSED	clone #95 creation failed for CephFS PVC ( 10 GB PVC size) during multiple clones creation test	2022-06-22 17:11:33 UTC

Description khover 2022-06-22 17:00:07 UTC

Description of problem (please be detailed as possible and provide log
snippests):

using Kasten (k10) to backup our Applications and part fo it involves taking Snapshots we test deployed an application with pvc and tried to take a snapshot - it's throwing error

Seems to be timing out:

Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingSnapshot 3m22s snapshot-controller Waiting for a snapshot default/test-k10-snapshot to be created by the CSI driver.

Additionally customer is unable to upload must gather due to timeout error.

We suspect there issue with CSI drivers and taking volume snapshots - don't see any issues with Ceph cluster

NOTE: customer prod env Kasten (k10) backup works as expected on same version OCS

Per Kasten K-10 support, the volume snap-shot issue seems to be a bug associated with OCS. Please see below:
I have completed reviewing the csi provisioner log files.

Here are my findings.

1. "failed to snapshot of volume" errors was observed 330 times

2, The above errors were caused by ceph snapshot_controller.go:292. it appears to be a ceph bug base on my research
please read the following post

https://bugzilla.redhat.com/show_bug.cgi?id=1892234
3, Due to the above error, we observed 2320 api calls, which took significant amount of time. Not sure if this is the reason that you observed large etcd api calls.

Version of all relevant components (if applicable):

# ocs
version 4.6.13

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Customer unable to backup Applications

Is there any workaround available to the best of your knowledge?

No
Customer tried deployment.apps/csi-cephfsplugin-provisioner scaled down then back up with no relief.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 2 khover 2022-06-22 17:09:41 UTC

curl –s https://docs.kasten.io/tools/k10_primer.sh > primer bash primer -c "storage csi-checker -s ocs-storagecluster-cephfs --runAsUser=1000"


Test deployed a App with PVC and snapshot

Creating application
-> Created pod (kubestr-csi-original-podm4ffx) and pvc (kubestr-csi-original-pvc24gsm)
Taking a snapshot
Cleaning up resources
Error deleteing PVC (kubestr-csi-original-pvc24gsm) - (context deadline exceeded)
Error deleteing Pod (kubestr-csi-original-podm4ffx) - (context deadline exceeded)
CSI Snapshot Walkthrough:
Using annotated VolumeSnapshotClass (k10-clone-ocs-storagecluster-cephfsplugin-snapclass)
Using annotated VolumeSnapshotClass (ocs-storagecluster-cephfsplugin-snapclass)
Failed to create Snapshot: CSI Driver failed to create snapshot for PVC (kubestr-csi-original-pvc24gsm) in Namespace (default): Context done while polling: context deadline exceeded - Error

Comment 4 khover 2022-06-23 14:10:54 UTC

ypadia

Snapshots are successful using a fresh creation of PVC and then snapshot using the volumesnapshot object.

It only fails when using Kasten.

Strangely customer prod env Kasten (k10) backup works as expected on same version OCS 

I can set up access to the cluster since the must gathers are failing.

How would you like to approach that, remote session or remote access ?


Thanks

Comment 5 yati padia 2022-06-23 14:31:44 UTC

I don't have idea about kasten and how it is different from normal process, but you can share remote access, I can check the logs.

Comment 6 khover 2022-06-23 15:22:36 UTC

What will you need for remote access to the customers cluster ?

I have never set this up with a customer env before.

Comment 7 yati padia 2022-06-23 15:30:27 UTC

Must-gather works for me but since we don't have it, access to the customers cluster or similar cluster also works for me.

Comment 8 yati padia 2022-06-24 04:57:11 UTC

@khover As a replacement for ocs-must-gather the following details would also work instead of the remote access to the cluster:
1. Provisioner pods logs
2. mgr and md5 logs
3. Subvolume Info
4. Ceph -s
5. Describe output for PVC, pv, volumesnapshot and volumesnapshotcontent

I see a few details are already shared in the customer portal here (https://access.redhat.com/support/cases/#/case/03248921/discussion?commentId=a0a6R00000SirHVQAZ) but that is not sufficient.

Comment 9 khover 2022-06-24 14:37:49 UTC

The customer has uploaded requested data to the case.


27692_OCS_logs_06242022.tar.gz

Comment 10 yati padia 2022-06-24 16:12:40 UTC

@khover on checking the logs here is what I found:

From Provisioner logs:
```
CSI CreateSnapshot: snapshot-e0eea82c-7300-4d6d-a391-34b38c43cbc2
I0624 14:01:54.993796       1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-ef6afab5-e6cc-4d42-9303-3f7d6b072f6b returned error: rpc error: code = Internal desc = an error (exit status 31) and stdError (Error EMLINK: error in mkdir /volumes/csi/csi-vol-3531bc42-2c3f-11ec-b5d9-0a580a83040c/.snap/csi-snap-3414aff2-f3c6-11ec-86af-0a580a83060b
) occurred while running ceph args: [fs subvolume snapshot create ocs-storagecluster-cephfilesystem csi-vol-3531bc42-2c3f-11ec-b5d9-0a580a83040c csi-snap-3414aff2-f3c6-11ec-86af-0a580a83060b --group_name csi -m 172.30.37.32:6789,172.30.194.45:6789,172.30.89.65:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]
I0624 14:01:54.993841       1 snapshot_controller.go:142] updateContentStatusWithEvent[snapcontent-ef6afab5-e6cc-4d42-9303-3f7d6b072f6b]
I0624 14:01:54.994046       1 snapshot_controller.go:292] createSnapshotWrapper: CreateSnapshot for content snapcontent-750d0a3b-b404-4ea2-8a57-a9e99a6d844b returned error: rpc error: code = Internal desc = an error (exit status 31) and stdError (Error EMLINK: error in mkdir /volumes/csi/csi-vol-2c43ca03-285b-11ec-b5d9-0a580a83040c/.snap/csi-snap-33f966c4-f3c6-11ec-86af-0a580a83060b
) occurred while running ceph args: [fs subvolume snapshot create ocs-storagecluster-cephfilesystem csi-vol-2c43ca03-285b-11ec-b5d9-0a580a83040c csi-snap-33f966c4-f3c6-11ec-86af-0a580a83060b --group_name csi -m 172.30.37.32:6789,172.30.194.45:6789,172.30.89.65:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]
```

The above Error EMLINK occurs when the limit is reached on the volume and hence it enables to create the snapshot.
Same can be seen from the mgr logs which says too many links and hence it fails to create the dir.
```
debug 2022-06-24 13:19:00.244 7f46ded93700 -1 mgr.server reply reply (31) Too many links error in mkdir /volumes/csi/csi-vol-395e6ac7-94d0-11ec-9d87-0a580a830633/.snap/csi-snap-356b2bcb-f3c0-11ec-86af-0a580a83060b
```

Comment 11 khover 2022-06-24 17:53:45 UTC

Would this solution be applicable here for cu ? 


https://access.redhat.com/solutions/45676

Comment 12 yati padia 2022-06-27 02:50:29 UTC

Yes, that should work.

Comment 15 khover 2022-06-28 13:39:36 UTC

Hello Yati,

Customer is unable to generate must gathers due to etcd slowness and api issues being worked on in OCP parallel case.


Is there anything specific needed that we can capture during a remote session scheduled for 6/28 2:30pm NA/EST ?

Comment 16 yati padia 2022-06-28 14:12:59 UTC

Snapshot info for each pvcs, subvolume info, would be enough as of now.

Comment 19 yati padia 2022-06-29 13:28:45 UTC

Hey,
In that case we can close this bug?

Comment 21 Madhu Rajanna 2022-06-30 05:31:30 UTC

2. [ @mrajanna ] -- Assuming Patrick agrees with 'EDQUOT', can you update[4] & delete the failed volumesnapshot/volumesnapshotcontent?

[4] - https://github.com/ceph/ceph-csi/blob/devel/internal/cephfs/core/snapshot.go#L90

From CephCSI we cannot delete the failed volumesnapshot/volumesnapshotcontent.
Do you want us to handle deleting cephfs snapshot if cephfs snapshot creation fails? cephfs snapshot is a sync call, If ceph fs fails to create the snapshot, it should take care of automatically delete the failed snapshot

Note You need to log in before you can comment on or make changes to this bug.