1952708 – restore of cephfs csi volume stuck pending; ProvisioningFailed Aborted desc = in progress

Bug 1952708 - restore of cephfs csi volume stuck pending; ProvisioningFailed Aborted desc = in progress

Summary: restore of cephfs csi volume stuck pending; ProvisioningFailed Aborted desc =...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Patrick Donnelly
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-22 22:38 UTC by henrychi
Modified:	2023-08-09 16:37 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-13 02:24:05 UTC
Embargoed:

Attachments	(Terms of Use)
csi-cephfsplugin-provisioner logs (504.44 KB, application/zip) 2021-04-22 22:45 UTC, henrychi	no flags	Details
rook-ceph-mgr-a-5f65476665-jdwwk.log.tgz (7.29 MB, application/gzip) 2021-07-09 19:05 UTC, henrychi	no flags	Details
View All

Description henrychi 2021-04-22 22:38:10 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After restoring from a OADP backup with cephfs csi volumes, there is one PVC "wmla-conda" consistently stuck pending.  The PVC is around 13GB and 185K files.


[chudg@wml1x171 /home/chudg]$ oc get pvc
NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
wmla-conda         Pending                                                                        ocs-storagecluster-cephfs   33h
wmla-cws-share     Bound     pvc-3aa99b42-922b-435a-aa92-2b9ba181b489   15Gi       RWX            ocs-storagecluster-cephfs   33h
wmla-edi           Bound     pvc-00a8de84-9ad3-4523-add0-a2ed8ebb522a   10Gi       RWX            ocs-storagecluster-cephfs   33h
wmla-infoservice   Bound     pvc-09009aa2-8b6c-4be7-a24f-f9a3d0af6b2b   4Gi        RWX            ocs-storagecluster-cephfs   33h
wmla-logging       Bound     pvc-6e2399f8-2938-47a8-84fa-fcbcceed9934   10Gi       RWX            ocs-storagecluster-cephfs   33h
wmla-mygpfs        Bound     pvc-2d74707d-870e-4e9c-a74b-93beb77ec52a   20Gi       RWX            ocs-storagecluster-cephfs   33h


[chudg@wml1x171 /home/chudg]$ oc describe pvc wmla-conda
Name:          wmla-conda
Namespace:     chudg
StorageClass:  ocs-storagecluster-cephfs
Status:        Pending
Volume:
Labels:        app.kubernetes.io/instance=wmla-cr---wmla
               app.kubernetes.io/managed-by=Ansible
               app.kubernetes.io/name=wmla-dlpd
               app.kubernetes.io/version=1.0.0
               env=conda-share
               icpd-addon/status=1618990653296570
               icpdsupport/addOnId=wml-accelerator
               icpdsupport/addOnName=WML-Accelerator
               icpdsupport/addOnType=component
               icpdsupport/app=wml-accelerator
               icpdsupport/controlPlaneNamespace=zen-chudg
               icpdsupport/serviceInstanceId=1618990653296570
               release=wmla
               velero.io/backup-name=b8
               velero.io/restore-name=r8
               velero.io/volume-snapshot-name=velero-wmla-conda-rldfg
Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
DataSource:
  APIGroup:  snapshot.storage.k8s.io
  Kind:      VolumeSnapshot
  Name:      velero-wmla-conda-rldfg
Mounted By:  wmla-conda-849674cf4c-vk8q7
Events:
  Type    Reason                Age                   From                                                                                                                      Message
  ----    ------                ----                  ----                                                                                                                      -------
  Normal  Provisioning          82s (x170 over 13h)   openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-66c59d467f-7hvc6_8d07eb8b-3b62-4dd6-91ec-d80aa49c92d8  External provisioner is provisioning volume for claim "chudg/wmla-conda"
  Normal  ExternalProvisioning  25s (x4143 over 16h)  persistentvolume-controller                                                                                               waiting for a volume to be created, either by external provisioner "openshift-storage.cephfs.csi.ceph.com" or manually created by system administrator


oc logs csi-cephfsplugin-provisioner-66c59d467f-7hvc6 -c csi-provisioner

I0422 21:04:06.986781       1 connection.go:186] GRPC error: rpc error: code = Aborted desc = in progress
I0422 21:04:06.986848       1 controller.go:642] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = Aborted desc = in progress
I0422 21:04:06.986926       1 controller.go:1091] Temporary error received, adding PVC 4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4 to claims in progress
W0422 21:04:06.986952       1 controller.go:943] Retrying syncing claim "4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4", failure 157
E0422 21:04:06.986976       1 controller.go:966] error syncing claim "4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4": failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = in progress
I0422 21:04:06.987013       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"chudg", Name:"wmla-conda", UID:"4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4", APIVersion:"v1", ResourceVersion:"296904287", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = in progress
I0422 21:09:06.987266       1 controller.go:1317] provision "chudg/wmla-conda" class "ocs-storagecluster-cephfs": started
I0422 21:09:06.987484       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"chudg", Name:"wmla-conda", UID:"4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4", APIVersion:"v1", ResourceVersion:"296904287", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "chudg/wmla-conda"


Version of all relevant components (if applicable):
OADP 0.2.1 with CSI plugin
OCP 4.6.8
OCS 4.6.4


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Using OADP as a backup and restore utility, users cannot restore certain cephfs csi volumes.  It's unknown what's special about the PVC.


Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:

n/a

Steps to Reproduce:

I haven't been able to reproduce with standalone application.  This may be specific to the application being backed up.  General steps are basically:

1.  Perform OADP backup of Cloud Pak for Data wml-accelerator application.  Storage class is ocs-storagecluster-cephfs.  The particular PVC having problems is around 13GB with 185000 files.

2.  Delete namespace

3.  Restore from backup


Actual results:

Restore of PVC "wmla-conda" is stuck Pending.  No PV is created.  In the first occurrence, wmla team waited over 2 days.  wmla team retried the steps on a different cluster and hit the same issue, with the same PVC stuck for 33+ hours now.


Expected results:

After OADP restore, PVCs should all be bound.


Additional info:


ceph status looks ok

sh-4.4# ceph -s
  cluster:
    id:     7b132412-410e-4e70-bb91-e9c4d2bbdf11
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum b,d,e (age 2h)
    mgr: a(active, since 9h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 2h), 3 in (since 5w)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   10 pools, 176 pgs
    objects: 1.74M objects, 129 GiB
    usage:   544 GiB used, 1.8 TiB / 2.3 TiB avail
    pgs:     176 active+clean

  io:
    client:   1.4 MiB/s rd, 1.5 MiB/s wr, 78 op/s rd, 363 op/s wr



sh-4.4# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    hdd       2.3 TiB     1.8 TiB     532 GiB      545 GiB         22.69
    TOTAL     2.3 TiB     1.8 TiB     532 GiB      545 GiB         22.69

POOLS:
    POOL                                                      ID     STORED      OBJECTS     USED        %USED     MAX AVAIL
    ocs-storagecluster-cephblockpool                           1     3.1 GiB         877     9.4 GiB      0.63       498 GiB
    ocs-storagecluster-cephobjectstore.rgw.control             2         0 B           8         0 B         0       498 GiB
    ocs-storagecluster-cephfilesystem-metadata                 3     610 MiB     246.12k     1.1 GiB      0.07       498 GiB
    ocs-storagecluster-cephobjectstore.rgw.meta                4     3.2 KiB          12     1.9 MiB         0       498 GiB
    ocs-storagecluster-cephfilesystem-data0                    5     125 GiB       1.50M     522 GiB     25.88       498 GiB
    ocs-storagecluster-cephobjectstore.rgw.log                 6     1.2 MiB         212     7.6 MiB         0       498 GiB
    ocs-storagecluster-cephobjectstore.rgw.buckets.index       7       342 B          22       342 B         0       498 GiB
    ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec      8         0 B           0         0 B         0       498 GiB
    .rgw.root                                                  9     4.7 KiB          16     2.8 MiB         0       498 GiB
    ocs-storagecluster-cephobjectstore.rgw.buckets.data       10       1 KiB           1     192 KiB         0       498 GiB

Comment 2 henrychi 2021-04-22 22:45:22 UTC

Created attachment 1774628 [details]
csi-cephfsplugin-provisioner logs

Comment 3 Mudit Agarwal 2021-04-26 10:45:34 UTC

Looks like the snapshot from which restore was tried is not proper:

DataSource:
  APIGroup:  snapshot.storage.k8s.io
  Kind:      VolumeSnapshot
  Name:      velero-wmla-conda-rldfg

I0422 08:31:10.363168       1 controller.go:964] VolumeContentSource_Snapshot {Snapshot:snapshot_id:"0001-0011-openshift-storage-                 0000000000000001-c1d7d4e1-a29a-11eb-b5d8-0a580a800542" }
I0422 08:31:10.363264       1 controller.go:972] Requested volume size is 21474836480 and snapshot size is 0 for the source snapshot velero-wmla-conda-rldfg
W0422 08:31:10.363276       1 controller.go:979] requested volume size 21474836480 is greater than the size 0 for the source snapshot velero-wmla-conda-rldfg. Volume plugin needs to handle volume expansion.

But if I can interpret the logs correctly this is not what causing the issue.

CreateVolume is always returning with this error:

I0422 08:31:54.987982       1 connection.go:185] GRPC response: {}
I0422 08:31:54.988093       1 connection.go:186] GRPC error: rpc error: code = Aborted desc = in progress
I0422 08:31:54.988126       1 controller.go:642] CreateVolume failed, supports topology = false, node selected false => may reschedule = false    => state = Background: rpc error: code = Aborted desc = in progress
I0422 08:32:32.633750       1 controller.go:1091] Temporary error received, adding PVC 4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4 to claims in progress
W0422 08:32:32.633771       1 controller.go:943] Retrying syncing claim "4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4", failure 6
E0422 08:32:32.633797       1 controller.go:966] error syncing claim "4b61e5ee-0ffc-41d5-b2a8-fa6ee5fd6ed4": failed to provision volume with      StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = in progress

This means that the request to create a clone is not getting completed at the Ceph side.

Do we have the Ceph cluster, we might need to check why the clone is not getting created there.

Comment 4 henrychi 2021-04-27 01:21:08 UTC

wmla team still has the cluster, but probably not for much longer.  Let me know what is needed.

$ oc get pvc
NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
wmla-conda         Pending                                                                        ocs-storagecluster-cephfs   5d12h
wmla-cws-share     Bound     pvc-3aa99b42-922b-435a-aa92-2b9ba181b489   15Gi       RWX            ocs-storagecluster-cephfs   5d12h
wmla-edi           Bound     pvc-00a8de84-9ad3-4523-add0-a2ed8ebb522a   10Gi       RWX            ocs-storagecluster-cephfs   5d12h
wmla-infoservice   Bound     pvc-09009aa2-8b6c-4be7-a24f-f9a3d0af6b2b   4Gi        RWX            ocs-storagecluster-cephfs   5d12h
wmla-logging       Bound     pvc-6e2399f8-2938-47a8-84fa-fcbcceed9934   10Gi       RWX            ocs-storagecluster-cephfs   5d12h
wmla-mygpfs        Bound     pvc-2d74707d-870e-4e9c-a74b-93beb77ec52a   20Gi       RWX            ocs-storagecluster-cephfs   5d12h

Comment 5 Madhu Rajanna 2021-04-27 05:03:55 UTC

@henrychi if you can provide access to the cluster that would be helpful or can you  get the output of the below command from the ceph cluster

```
ceph fs clone status ocs-storagecluster-cephfilesystem csi-vol-118b914c-a38d-11eb-9e8b-0a580a81000d --group_name csi
```

Comment 6 henrychi 2021-04-27 14:06:06 UTC

Comment 7 Mudit Agarwal 2021-04-29 09:40:13 UTC

Hi @henrychi.com,

As requested in https://bugzilla.redhat.com/show_bug.cgi?id=1952708#c5 can you please share the clone status? 

It would also be great if we can get the must-gather.

Comment 8 Yug Gupta 2021-04-29 13:14:58 UTC

On reproducing, this issue was encountered both in local velero setup, as well as on setting up velero via OADP.

After restoring a backup, the PVC did not come back to BOUND state and stayed PENDING with the following error:

```
Warning ProvisioningFailed 148m (x15 over 176m)  openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-66c59d467f-2pjgw_b9d3b9a4-818c-42a1-b585-b9f7202694ca failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a"
```

csi-cephfsplugin:
```
[ygupta@localhost ~]$ oc logs csi-cephfsplugin-provisioner-66c59d467f-2pjgw csi-cephfsplugin -n openshift-storage | grep pvc-ab71a168-21f9-430c-b89a-80a416de477f
I0429 09:07:07.290072       1 utils.go:159] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f GRPC call: /csi.v1.Controller/CreateVolume
I0429 09:07:07.291693       1 utils.go:160] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-ab71a168-21f9-430c-b89a-80a416de477f","parameters":{"clusterID":"openshift-storage","fsName":"ocs-storagecluster-cephfilesystem"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{}},"access_mode":{"mode":1}}],"volume_content_source":{"Type":{"Snapshot":{"snapshot_id":"0001-0011-openshift-storage-0000000000000001-6702cdf8-a8c9-11eb-9bc2-0a580a81022a"}}}}
I0429 09:07:07.607927       1 cephcmds.go:59] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f command succeeded: ceph [-m 172.30.33.171:6789,172.30.213.18:6789,172.30.255.196:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs get ocs-storagecluster-cephfilesystem --format=json]
I0429 09:07:07.919624       1 cephcmds.go:59] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f command succeeded: ceph [-m 172.30.33.171:6789,172.30.213.18:6789,172.30.255.196:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs ls --format=json]
I0429 09:07:08.234498       1 cephcmds.go:59] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f command succeeded: ceph [-m 172.30.33.171:6789,172.30.213.18:6789,172.30.255.196:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs dump --format=json]
I0429 09:07:08.547046       1 cephcmds.go:59] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f command succeeded: ceph [-m 172.30.33.171:6789,172.30.213.18:6789,172.30.255.196:6789 --id csi-cephfs-provisioner --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs ls --format=json]
E0429 09:07:08.548774       1 omap.go:77] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f omap not found (pool="ocs-storagecluster-cephfilesystem-metadata", namespace="csi", name="csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a"): rados: ret=-2, No such file or directory
W0429 09:07:08.548805       1 voljournal.go:624] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f unable to read omap keys: pool or key missing: key not found: rados: ret=-2, No such file or directory
E0429 09:07:08.548841       1 utils.go:163] ID: 50 Req-ID: pvc-ab71a168-21f9-430c-b89a-80a416de477f GRPC error: rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a"
```

csi-cephfsplugin-provisioner
```
[ygupta@localhost ~]$ oc logs csi-cephfsplugin-provisioner-66c59d467f-2pjgw csi-provisioner -n openshift-storageI0429 12:36:35.831384       1 controller.go:964] VolumeContentSource_Snapshot {Snapshot:snapshot_id:"0001-0011-openshift-storage-0000000000000001-6702cdf8-a8c9-11eb-9bc2-0a580a81022a" }
I0429 12:36:35.831404       1 controller.go:972] Requested volume size is 1073741824 and snapshot size is 0 for the source snapshot velero-test-pvc-q2zw7
W0429 12:36:35.831423       1 controller.go:979] requested volume size 1073741824 is greater than the size 0 for the source snapshot velero-test-pvc-q2zw7. Volume plugin needs to handle volume expansion.
I0429 12:36:35.831431       1 controller.go:570] CreateVolumeRequest {Name:pvc-ab71a168-21f9-430c-b89a-80a416de477f CapacityRange:required_bytes:1073741824  VolumeCapabilities:[mount:<> access_mode:<mode:SINGLE_NODE_WRITER > ] Parameters:map[clusterID:openshift-storage csi.storage.k8s.io/controller-expand-secret-name:rook-csi-cephfs-provisioner csi.storage.k8s.io/controller-expand-secret-namespace:openshift-storage csi.storage.k8s.io/node-stage-secret-name:rook-csi-cephfs-node csi.storage.k8s.io/node-stage-secret-namespace:openshift-storage csi.storage.k8s.io/provisioner-secret-name:rook-csi-cephfs-provisioner csi.storage.k8s.io/provisioner-secret-namespace:openshift-storage fsName:ocs-storagecluster-cephfilesystem] Secrets:map[] VolumeContentSource:snapshot:<snapshot_id:"0001-0011-openshift-storage-0000000000000001-6702cdf8-a8c9-11eb-9bc2-0a580a81022a" >  AccessibilityRequirements:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0429 12:36:35.834643       1 connection.go:182] GRPC call: /csi.v1.Controller/CreateVolume
I0429 12:36:35.834656       1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-ab71a168-21f9-430c-b89a-80a416de477f","parameters":{"clusterID":"openshift-storage","fsName":"ocs-storagecluster-cephfilesystem"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{}},"access_mode":{"mode":1}}],"volume_content_source":{"Type":{"Snapshot":{"snapshot_id":"0001-0011-openshift-storage-0000000000000001-6702cdf8-a8c9-11eb-9bc2-0a580a81022a"}}}}
I0429 12:36:37.140280       1 connection.go:185] GRPC response: {}
I0429 12:36:37.140321       1 connection.go:186] GRPC error: rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a"
I0429 12:36:37.140348       1 controller.go:642] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Finished: rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a"
I0429 12:36:37.140390       1 controller.go:1084] Final error received, removing PVC ab71a168-21f9-430c-b89a-80a416de477f from claims in progress
```


We need further analysis on the same to identify the reason for such behavior.

Comment 9 Mudit Agarwal 2021-04-29 13:24:33 UTC

That's great Yug, thanks for reproducing it. 

Can you also check what is the status of the source snapshot, are the omap values proper for it?

Comment 10 henrychi 2021-04-29 14:51:21 UTC

I'm checking with wmla team to see if cluster is still available.  Will get back.


Regarding the "no snap source in omap", that seems like a different issue than mine.  Just FYI, my VolumeSnapshotClass deletionPolicy is set to Retain, so the volumesnapshotcontent isn't deleted when the volumesnapshot is deleted.


cat ocs-storagecluster-cephfsplugin-snapclass-velero.yaml


    apiVersion: snapshot.storage.k8s.io/v1beta1
    deletionPolicy: Retain
    driver: openshift-storage.cephfs.csi.ceph.com
    kind: VolumeSnapshotClass
    metadata:
     name: ocs-storagecluster-cephfsplugin-snapclass-velero
     labels:
      velero.io/csi-volumesnapshot-class: "true"
    parameters:
     clusterID: openshift-storage
     csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner
     csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage

Comment 11 Yug Gupta 2021-05-05 13:15:12 UTC

We tried to reproduce the issue on OCS 4.6 with OADP and after restore operation, the PVC got to BOUND state without any issue.
As Mudit and Madhu suggested, it looks like the clone creation is stuck at ceph side.

Comment 12 henrychi 2021-05-05 15:33:23 UTC

Unfortunately I haven't been able to reach the wmla team and provide the clone status.  I'll update if they get back to me, but their cluster may no longer be available.

Comment 13 henrychi 2021-05-07 01:19:08 UTC

I was able to get more info from the cluster.  From comment #5, this is the clone status:


sh-4.4# ceph fs clone status ocs-storagecluster-cephfilesystem csi-vol-118b914c-a38d-11eb-9e8b-0a580a81000d --group_name csi
Error ENOENT: subvolume 'csi-vol-118b914c-a38d-11eb-9e8b-0a580a81000d' does not exist

Comment 14 henrychi 2021-05-07 01:43:20 UTC

From comment #7, the must-gather can be downloaded below.  Please let me know if more information is needed.

https://ibm.box.com/s/hg0aexs550q85qafl0mycqg55dxl8gtg

Comment 15 Yug Gupta 2021-05-17 07:05:44 UTC

(In reply to henrychi from comment #14)
> From comment #7, the must-gather can be downloaded below.  Please let me
> know if more information is needed.
> 
> https://ibm.box.com/s/hg0aexs550q85qafl0mycqg55dxl8gtg

The shared must-gather doesn't seem to have any logs from pods in the openshift-storage namespace. Can you please share the logs so that we can perform a deeper analysis of the same?

Comment 16 Yug Gupta 2021-05-17 07:15:08 UTC

That must have been an oversight, found the logs in the attachment.

Comment 17 henrychi 2021-05-17 16:03:41 UTC

Ok.  Let me know if more info is needed.

Comment 18 Yug Gupta 2021-05-19 07:12:56 UTC

Looks like velero restore operation performs clone from snapshots created during the backup operation.
From the attached logs it seems like the `ceph fs clone status` keeps failing in the backend; due to which the restored PVC never gets to the BOUND state.

For debugging the issue further, we need to look into the logs of other pods like ceph manager, ceph mon, etc. Can you provide the logs from all the pods in the `openshift-storage` namespace?
Also, can you please provide the output of `ceph -s` so that we can verify if the ceph cluster is in a healthy state. 

In the meantime, to verify if clone creation is working fine on the backend, we can try creating clones manually and monitor the status:

  To create a new clone named "test":  
    `ceph fs subvolume snapshot clone ocs-storagecluster-cephfilesystem csi-vol-54f89e48-b869-11eb-ac15-fe59eb9b76df  csi-snap-c1d7d4e1-a29a-11eb-b5d8-0a580a800542 test --group_name csi --target_group_name csi`

  To check if the subvolume got created:
    `ceph fs subvolume ls ocs-storagecluster-cephfilesystem --group_name csi`

  During the clone operation, monitor the clone status by:
    `ceph fs clone status ocs-storagecluster-cephfilesystem test --group_name csi`

Comment 19 henrychi 2021-05-20 05:37:06 UTC

1) logs from openshift-storage pods

for i in $(kubectl get pods |cut -d" " -f1); do kubectl logs $i --all-containers=true > $i.log; done

https://ibm.box.com/s/m7153uaptx9n1qqeqxvvvhibtkd562qb


2) output of ceph -s

sh-4.4# ceph -s
  cluster:
    id:     7b132412-410e-4e70-bb91-e9c4d2bbdf11
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum g,i,j (age 5d)
    mgr: a(active, since 5d)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 5d), 3 in (since 9w)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   10 pools, 176 pgs
    objects: 2.36M objects, 180 GiB
    usage:   767 GiB used, 1.6 TiB / 2.3 TiB avail
    pgs:     176 active+clean

  io:
    client:   1.5 MiB/s rd, 1.9 MiB/s wr, 97 op/s rd, 477 op/s wr


3)Test clone creation

3a) The command given in comment #18 results in "does not exist"

sh-4.4# ceph fs subvolume snapshot clone ocs-storagecluster-cephfilesystem csi-vol-54f89e48-b869-11eb-ac15-fe59eb9b76df  csi-snap-c1d7d4e1-a29a-11eb-b5d8-0a580a800542 test --group_name csi --target_group_name csi
Error ENOENT: subvolume 'csi-vol-54f89e48-b869-11eb-ac15-fe59eb9b76df' does not exist

3b) I tried with another similar subvolume and snapshot:

sh-4.4# ceph fs subvolume snapshot clone ocs-storagecluster-cephfilesystem csi-vol-cf0ee0b9-a28e-11eb-b5d8-0a580a800542  csi-snap-0e600e2d-a292-11eb-b5d8-0a580a800542 test --group_name csi --target_group_name csi

3c) subvolume "test" got created

sh-4.4# ceph fs subvolume ls ocs-storagecluster-cephfilesystem --group_name csi | grep test
        "name": "test"

3d) Clone was in-progress for about an hour, and eventually failed.  I'm not sure if it's related to the original problem.

sh-4.4# ceph fs clone status ocs-storagecluster-cephfilesystem test --group_name csi
{
  "status": {
    "state": "failed",
    "source": {
      "volume": "ocs-storagecluster-cephfilesystem",
      "subvolume": "csi-vol-cf0ee0b9-a28e-11eb-b5d8-0a580a800542",
      "snapshot": "csi-snap-0e600e2d-a292-11eb-b5d8-0a580a800542",
      "group": "csi"
    }
  }
}

4) New logs after clone test


https://ibm.box.com/s/1a0ot0ika28is86vufinfsn68ze63znq

Comment 20 Madhu Rajanna 2021-05-20 06:50:45 UTC

Thank Henrychi for testing it out. As the manual clone is failing this looks more to be a ceph fs issue rather than a CSI issue.

@Patrick @Kotresh can anyone from the cephfs team help here?

Comment 21 Patrick Donnelly 2021-05-21 16:14:37 UTC

Hard to diagnose the problem without debugging turned up on the mgr. Please do:

> ceph config set mgr debug_mgr 20
> ceph config set mgr debug_ms 1
> ceph config set mgr debug_client 20

and reproduce.

@Humble, is this a normal error for CSI?

> I0429 12:36:37.140321       1 connection.go:186] GRPC error: rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a"

Comment 22 Humble Chirammal 2021-05-24 05:55:28 UTC

(In reply to Patrick Donnelly from comment #21)
> Hard to diagnose the problem without debugging turned up on the mgr. Please
> do:
> 
> > ceph config set mgr debug_mgr 20
> > ceph config set mgr debug_ms 1
> > ceph config set mgr debug_client 20
> 
> and reproduce.
> 
> @Humble, is this a normal error for CSI?
> 
> > I0429 12:36:37.140321       1 connection.go:186] GRPC error: rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a"

No Patrick, this is not a normal error. Rather clearly says the snap entry is missing from OMAP which is required for clone operation. This is a workflow error.

Comment 23 Mudit Agarwal 2021-05-24 09:27:37 UTC

Henry, please rerun with debug logs as asked in https://bugzilla.redhat.com/show_bug.cgi?id=1952708#c21

Comment 24 henrychi 2021-05-24 15:41:30 UTC

I can contact the wmla team to see if I can rerun the clone test on their cluster, but I don't think I can rerun the backup and restore.
The original PVC size was around 13GB and 185000 files.  wmla team reduced the size and number files as a workaround so they can continue work.

Regarding the 
no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a",
I remember someone in comment #8 was able to reproduce.  Is that still relevant, or a different issue?

Comment 25 henrychi 2021-05-25 03:24:35 UTC

1) Here are the debug logs for clone test "test3".  This time, after waiting two hours, the status is still in-progress.

https://ibm.box.com/s/1dw1yshvyol17vp29mfnuljmrkgvbp2k


ceph config set mgr debug_mgr 20

ceph config set mgr debug_ms 1

ceph config set mgr debug_client 20

ceph fs subvolume snapshot clone ocs-storagecluster-cephfilesystem csi-vol-cf0ee0b9-a28e-11eb-b5d8-0a580a800542  csi-snap-0e600e2d-a292-11eb-b5d8-0a580a800542 test3 --group_name csi --target_group_name csi

ceph fs subvolume ls ocs-storagecluster-cephfilesystem --group_name csi | grep test3
        "name": "test3"

ceph fs clone status ocs-storagecluster-cephfilesystem test3 --group_name csi

{
  "status": {
    "state": "in-progress",
    "source": {
      "volume": "ocs-storagecluster-cephfilesystem",
      "subvolume": "csi-vol-cf0ee0b9-a28e-11eb-b5d8-0a580a800542",
      "snapshot": "csi-snap-0e600e2d-a292-11eb-b5d8-0a580a800542",
      "group": "csi"
    }
  }
}


2) Also fyi, the original pvc from the problem description happens to be still there in pending state.

$ oc get pvc -n chudg
NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
wmla-conda         Pending                                                                        ocs-storagecluster-cephfs   33d
wmla-cws-share     Bound     pvc-3aa99b42-922b-435a-aa92-2b9ba181b489   15Gi       RWX            ocs-storagecluster-cephfs   33d
wmla-edi           Bound     pvc-00a8de84-9ad3-4523-add0-a2ed8ebb522a   10Gi       RWX            ocs-storagecluster-cephfs   33d
wmla-infoservice   Bound     pvc-09009aa2-8b6c-4be7-a24f-f9a3d0af6b2b   4Gi        RWX            ocs-storagecluster-cephfs   33d
wmla-logging       Bound     pvc-6e2399f8-2938-47a8-84fa-fcbcceed9934   10Gi       RWX            ocs-storagecluster-cephfs   33d
wmla-mygpfs        Bound     pvc-2d74707d-870e-4e9c-a74b-93beb77ec52a   20Gi       RWX            ocs-storagecluster-cephfs   33d

Comment 26 Yug Gupta 2021-05-25 04:19:46 UTC

(In reply to henrychi from comment #24)

> Regarding the 
> no snap source in omap for "csi.snap.6702cdf8-a8c9-11eb-9bc2-0a580a81022a",
> I remember someone in comment #8 was able to reproduce.  Is that still
> relevant, or a different issue?

That was a different issue as there the `deletePolicy` in snapshotclass was incorrectly set as `Delete`.
For successful backup-restore operation by velero, deletePolicy should be `Retain`.

Also, could you re-verify if this is not the case for you?

Comment 27 henrychi 2021-05-25 13:51:28 UTC

Correct, CloudPak for Data service teams are using volumesnapshotclasses with the Retain deletePolicy.  Thanks for checking.

Comment 28 Mudit Agarwal 2021-05-26 13:46:05 UTC

Setting the NI back on Patrick, please check https://bugzilla.redhat.com/show_bug.cgi?id=1952708#c25

Comment 29 Mudit Agarwal 2021-06-01 10:19:31 UTC

Patrick, did you get a chance to look into the logs?

Comment 30 Kotresh HR 2021-06-02 15:05:52 UTC

I don't see any update in the mgr log related to clone or any operation apart from few config sets. Could you please verify ?

Comment 31 henrychi 2021-06-02 15:17:44 UTC

comment #25 shows the steps and logs of the clone test "test3".  The logs were taken after waiting two hours.  The logs are from all pods in the openshift-storage namespace.

Comment 32 henrychi 2021-06-02 15:21:30 UTC

Script used to get logs

for i in $(kubectl get pods |cut -d" " -f1); do kubectl logs $i --all-containers=true > $i.log; done

Comment 33 henrychi 2021-06-02 15:26:14 UTC

Not sure if it helps, but there is also another set of logs in comment #19, with a similar clone test called "test".

Comment 34 Patrick Donnelly 2021-06-07 21:09:27 UTC

Hi Henry,

Sorry for the delay in response. I was away on PTO and am still getting caught up.

The logs from rook-ceph-mgr-a-6ccf987f58-59fzg.log in https://bugzilla.redhat.com/show_bug.cgi?id=1952708#c25 do not look like normal ceph-mgr logs:

> better match: 1.5 > 1.5: config show-with-defaults <who>
> better match: 5.5 > 1.5: config set <who> <name> <value> {--force}
> bestcmds_sorted:
> [{'flags': 0,
>   'help': 'Set a configuration option for one or more entities',
>   'module': 'config',
>   'perm': 'rw',
>   'sig': [argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=config),
>           argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=set),
>           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=who, n=1, numseen=0),
>           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=name, n=1, numseen=0),
>           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=value, n=1, numseen=0),
>           argdesc(<class 'ceph_argparse.CephBool'>, req=False, name=force, n=1, numseen=0)]}]
> Submitting command:  {'prefix': 'config set', 'who': 'mgr.a', 'name': 'mgr/prometheus/a/server_addr', 'value': '10.129.2.16', 'force': True}
> failed to try resolving symlinks in path /var/log/pods/openshift-storage_rook-ceph-mgr-a-6ccf987f58-59fzg_e3964605-1f9b-4b54-b616-48934a539ee6/mgr/0.log: lstat /var/log/pods/openshift-storage_rook-ceph-mgr-a-6ccf987f58-59fzg_e3964605-1f9b-4b54-b616-48934a539ee6/mgr/0.log: no such file or directory

I unsure what's going on here. Any idea Humble/Madhu?

Comment 35 Madhu Rajanna 2021-06-08 05:12:56 UTC

(In reply to Patrick Donnelly from comment #34)
> Hi Henry,
> 
> Sorry for the delay in response. I was away on PTO and am still getting
> caught up.
> 
> The logs from rook-ceph-mgr-a-6ccf987f58-59fzg.log in
> https://bugzilla.redhat.com/show_bug.cgi?id=1952708#c25 do not look like
> normal ceph-mgr logs:
> 
> > better match: 1.5 > 1.5: config show-with-defaults <who>
> > better match: 5.5 > 1.5: config set <who> <name> <value> {--force}
> > bestcmds_sorted:
> > [{'flags': 0,
> >   'help': 'Set a configuration option for one or more entities',
> >   'module': 'config',
> >   'perm': 'rw',
> >   'sig': [argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=config),
> >           argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=set),
> >           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=who, n=1, numseen=0),
> >           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=name, n=1, numseen=0),
> >           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=value, n=1, numseen=0),
> >           argdesc(<class 'ceph_argparse.CephBool'>, req=False, name=force, n=1, numseen=0)]}]
> > Submitting command:  {'prefix': 'config set', 'who': 'mgr.a', 'name': 'mgr/prometheus/a/server_addr', 'value': '10.129.2.16', 'force': True}
> > failed to try resolving symlinks in path /var/log/pods/openshift-storage_rook-ceph-mgr-a-6ccf987f58-59fzg_e3964605-1f9b-4b54-b616-48934a539ee6/mgr/0.log: lstat /var/log/pods/openshift-storage_rook-ceph-mgr-a-6ccf987f58-59fzg_e3964605-1f9b-4b54-b616-48934a539ee6/mgr/0.log: no such file or directory
> 
> I unsure what's going on here. Any idea Humble/Madhu?

AFAIC this is something related to the kubernetes logging. Just a guess when collecting the logs the mrg pod might have restarted or something like that happened. @Travis or @Seb any idea about this?

Comment 36 Sébastien Han 2021-06-08 15:08:04 UTC

(In reply to Madhu Rajanna from comment #35)
> (In reply to Patrick Donnelly from comment #34)
> > Hi Henry,
> > 
> > Sorry for the delay in response. I was away on PTO and am still getting
> > caught up.
> > 
> > The logs from rook-ceph-mgr-a-6ccf987f58-59fzg.log in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1952708#c25 do not look like
> > normal ceph-mgr logs:
> > 
> > > better match: 1.5 > 1.5: config show-with-defaults <who>
> > > better match: 5.5 > 1.5: config set <who> <name> <value> {--force}
> > > bestcmds_sorted:
> > > [{'flags': 0,
> > >   'help': 'Set a configuration option for one or more entities',
> > >   'module': 'config',
> > >   'perm': 'rw',
> > >   'sig': [argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=config),
> > >           argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=set),
> > >           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=who, n=1, numseen=0),
> > >           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=name, n=1, numseen=0),
> > >           argdesc(<class 'ceph_argparse.CephString'>, req=True, name=value, n=1, numseen=0),
> > >           argdesc(<class 'ceph_argparse.CephBool'>, req=False, name=force, n=1, numseen=0)]}]
> > > Submitting command:  {'prefix': 'config set', 'who': 'mgr.a', 'name': 'mgr/prometheus/a/server_addr', 'value': '10.129.2.16', 'force': True}
> > > failed to try resolving symlinks in path /var/log/pods/openshift-storage_rook-ceph-mgr-a-6ccf987f58-59fzg_e3964605-1f9b-4b54-b616-48934a539ee6/mgr/0.log: lstat /var/log/pods/openshift-storage_rook-ceph-mgr-a-6ccf987f58-59fzg_e3964605-1f9b-4b54-b616-48934a539ee6/mgr/0.log: no such file or directory
> > 
> > I unsure what's going on here. Any idea Humble/Madhu?
> 
> AFAIC this is something related to the kubernetes logging. Just a guess when
> collecting the logs the mrg pod might have restarted or something like that
> happened. @Travis or @Seb any idea about this?

Is this coming from the container runtime? Not sure what happened here but a broken symlink sounds plausible after a restart.

Comment 37 Mudit Agarwal 2021-06-09 18:52:29 UTC

Moving it to 4.9 as we don't have an RCA yet, let's continue the discussion.

Comment 38 Travis Nielsen 2021-06-15 18:56:49 UTC

I'm not sure either what happened with the mgr log, I haven't seen that before, as if the log was replaced with the ceph cmdline help.

Comment 39 Patrick Donnelly 2021-07-08 19:37:20 UTC

I think to make progress on this we will need the ceph-mgr log. Can another attempt at reproducing this be made? Hopefully the logs are not corrupted this time.

Comment 40 henrychi 2021-07-09 06:15:04 UTC

Here are new debug logs for clone test "test4".

https://ibm.box.com/s/ngt0li196q118ro1rp9rhjknmlpodmim


sh-4.4# ceph config set mgr debug_mgr 20

sh-4.4# ceph config set mgr debug_ms 1

sh-4.4# ceph config set mgr debug_client 20

sh-4.4# ceph fs subvolume snapshot clone ocs-storagecluster-cephfilesystem csi-vol-cf0ee0b9-a28e-11eb-b5d8-0a580a800542  csi-snap-0e600e2d-a292-11eb-b5d8-0a580a800542 test4 --group_name csi --target_group_name csi

sh-4.4# ceph fs subvolume ls ocs-storagecluster-cephfilesystem --group_name csi | grep test4
        "name": "test4"

sh-4.4# ceph fs clone status ocs-storagecluster-cephfilesystem test4 --group_name csi
{
  "status": {
    "state": "in-progress",
    "source": {
      "volume": "ocs-storagecluster-cephfilesystem",
      "subvolume": "csi-vol-cf0ee0b9-a28e-11eb-b5d8-0a580a800542",
      "snapshot": "csi-snap-0e600e2d-a292-11eb-b5d8-0a580a800542",
      "group": "csi"
    }
  }
}

Comment 41 Patrick Donnelly 2021-07-09 18:15:25 UTC

Thanks.

It looks like it's just the ceph-mgr is chewing through the clone slowly and is not stuck. There are indeed a lot of files:

> The PVC is around 13GB and 185K files.

I'm not sure there is anything we can do about this within the ceph-mgr yet but maybe +Venky will have some thoughts. Eventually we would like to solve this by having a container spun up to execute the copies rather than done in the ceph-mgr. That's a larger project though.

Comment 42 henrychi 2021-07-09 19:05:50 UTC

Created attachment 1800121 [details]
rook-ceph-mgr-a-5f65476665-jdwwk.log.tgz

Comment 43 henrychi 2021-07-09 19:08:08 UTC

A bit more info - I checked this morning and the clone eventually failed.  Attached rook-ceph-mgr-a-5f65476665-jdwwk.log.tgz.

sh-4.4# ceph fs clone status ocs-storagecluster-cephfilesystem test4 --group_name csi
{
  "status": {
    "state": "failed",
    "source": {
      "volume": "ocs-storagecluster-cephfilesystem",
      "subvolume": "csi-vol-cf0ee0b9-a28e-11eb-b5d8-0a580a800542",
      "snapshot": "csi-snap-0e600e2d-a292-11eb-b5d8-0a580a800542",
      "group": "csi"
    }
  }
}

Comment 44 Venky Shankar 2021-07-13 02:27:31 UTC

(In reply to Patrick Donnelly from comment #41)
> Thanks.
> 
> It looks like it's just the ceph-mgr is chewing through the clone slowly and
> is not stuck. There are indeed a lot of files:
> 
> > The PVC is around 13GB and 185K files.
> 
> I'm not sure there is anything we can do about this within the ceph-mgr yet
> but maybe +Venky will have some thoughts.

Nothing that I can think of.

Comment 45 Mudit Agarwal 2021-08-12 01:09:38 UTC

Hi Patrick,

Any further suggestions on how to proceed with this BZ?
Should we document this limitation?

Comment 46 Patrick Donnelly 2021-08-20 23:10:06 UTC

(In reply to Mudit Agarwal from comment #45)
> Hi Patrick,
> 
> Any further suggestions on how to proceed with this BZ?
> Should we document this limitation?

I think document and move on. An ODF RFE BZ to improve this by pulling clones out of the ceph-mgr could be created (and tracked) so we can begin prioritizing the potential enhancement.

Comment 47 henrychi 2021-08-20 23:24:47 UTC

A few questions:
1.  Why did did the clone fail?
2.  From a customer point a view, what conditions causes this issue to occur?
3.  Is there a workaround?  If the customer did a backup using snapshots, they need to be able to recover the volume.

Comment 50 Mudit Agarwal 2021-09-28 12:59:10 UTC

As suggested by Patrick, we need to create an ODF RFE BZ to track the ceph-mgr enhancement.
Keeping this BZ open for now till the discussions are complete.

Comment 51 Sridhar Venkat (IBM) 2022-03-16 13:24:29 UTC

I noticed the same issue while restoring a snapshot into PVC:  ODF 4.10 development driver. This is while working on OADP operator/velero to backup and restore.

  Warning  ProvisioningFailed    24m (x14 over 42m)     openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-5cb6b9665c-b2pzx_4955c704-cb9a-44ea-95bb-b0850854cc9d  failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.b5e1106a-a524-11ec-8f27-0a580a81021b"

Comment 52 Sridhar Venkat (IBM) 2022-03-16 15:04:23 UTC

I am not able to reproduce it with manual tests with these YAML definitions:

------ RBD:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-4172f02505b142da854b0b4af64f564
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: ocs-storagecluster-ceph-rbd


apiVersion: v1
kind: Pod
metadata:
  name: pod-test-rbd-85b3dd544c8f4b7396cc4fe19c1
spec:
  containers:
  - image: quay.io/ocsci/nginx:latest
    name: web-server
    volumeMounts:
    - mountPath: /var/lib/www/html
      name: mypvc
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: pvc-test-4172f02505b142da854b0b4af64f564
      readOnly: false

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  name: snapshot-test-1848ac13564a41f9ac4789f3e4
spec:
  source:
    persistentVolumeClaimName: pvc-test-4172f02505b142da854b0b4af64f564
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc-test-2c87a03576504f1dbbf5199
spec:
  accessModes:
  - ReadWriteOnce
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: snapshot-test-1848ac13564a41f9ac4789f3e4
  resources:
    requests:
      storage: 5Gi
  storageClassName: ocs-storagecluster-ceph-rbd


apiVersion: v1
kind: Pod
metadata:
  name: pod-test-rbd-fee5ccb8700047c6b2c071cfa5e
spec:
  containers:
  - image: quay.io/ocsci/nginx:latest
    name: web-server
    volumeMounts:
    - mountPath: /var/lib/www/html
      name: mypvc
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: restore-pvc-test-2c87a03576504f1dbbf5199
      readOnly: false

-------FS:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-cee52b42858e4c949577e6879faa88c
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: ocs-storagecluster-cephfs


apiVersion: v1
kind: Pod
metadata:
  name: pod-test-cephfs-ad6b9ee20c3047ed872e9143
spec:
  containers:
  - image: quay.io/ocsci/nginx:latest
    name: web-server
    volumeMounts:
    - mountPath: /var/lib/www/html
      name: mypvc
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: pvc-test-cee52b42858e4c949577e6879faa88c
      readOnly: false

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  name: snapshot-test-54f67af346c844d7b388bfb2ee
spec:
  source:
    persistentVolumeClaimName: pvc-test-cee52b42858e4c949577e6879faa88c
  volumeSnapshotClassName: ocs-storagecluster-cephfsplugin-snapclass


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restore-pvc-test-0f29660a8df34418b4b6d48
spec:
  accessModes:
  - ReadWriteOnce
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: snapshot-test-54f67af346c844d7b388bfb2ee
  resources:
    requests:
      storage: 5Gi
  storageClassName: ocs-storagecluster-cephfs

apiVersion: v1
kind: Pod
metadata:
  name: pod-test-cephfs-eaf064d360124ca98352f632
spec:
  containers:
  - image: quay.io/ocsci/nginx:latest
    name: web-server
    volumeMounts:
    - mountPath: /var/lib/www/html
      name: mypvc
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: restore-pvc-test-0f29660a8df34418b4b6d48
      readOnly: false

Comment 53 Sridhar Venkat (IBM) 2022-03-23 16:58:20 UTC

Above problem goes away when I changed deletion policy on volumesnapshotclass to 'Retain'. FYI.

Note You need to log in before you can comment on or make changes to this bug.

bniver
hchiramm
hnallurv
jbiao
jespy
jmatthew
khiremat
krodger
madam
mrajanna
muagarwa
ocs-bugs
odf-bz-bot
pdonnell
pnataraj
shan
smerrow
srangana
svenkat
tnielsen
vshankar