2264900 – PVC cloning is failing with error "RBD image not found"

Bug 2264900 - PVC cloning is failing with error "RBD image not found"

Summary: PVC cloning is failing with error "RBD image not found"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.14
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Rakshith
QA Contact:	Yuli Persky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2260844
TreeView+	depends on / blocked

Reported:	2024-02-19 15:50 UTC by nijin ashok
Modified:	2024-09-10 08:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	4.16.0-86
Doc Type:	Bug Fix
Doc Text:	.PVC cloning failed with an error "RBD image not found" Previously, volume snapshot restore failed when the parent of the snapshot did not exist as CephCSI driver falsely identified an RBD image in trash to exist due to a bug in the driver. With this fix, the CephCSI driver bug is fixed to identify the images in trash appropriately and as a result, the volume snapshot is restored successfully even when the parent of the snapshot does not exist.
Clone Of:
Environment:
Last Closed:	2024-07-17 13:14:03 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 4522	None	open	rbd: add ParentInTrash parameter in rbdImage struct	2024-04-01 10:23:36 UTC
Github	red-hat-storage ceph-csi pull 298	None	open	BUG 2264900: rbd: add ParentInTrash parameter in rbdImage struct	2024-04-24 09:31:36 UTC
Github	red-hat-storage ocs-ci pull 10357	None	Merged	<GSS Closed Loop BZ: 2264900> Tests to verify restore a pvc from snapshot when the parent PVC is deleted	2024-09-10 08:59:57 UTC
Red Hat Knowledge Base (Solution)	7057265	None	None	None	2024-02-22 14:10:17 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:14:13 UTC

Description nijin ashok 2024-02-19 15:50:13 UTC

Description of problem (please be detailed as possible and provide log
snippests):

This should be same as https://github.com/ceph/ceph-csi/issues/4013, we have a customer with similar problem in OpenShift Virtualization and is easy to hit.

The OpenShift Virtualization has golden images of templates which will be used for VM boot disk[1]. The source image will be either a PVC (before 4.14) or a VolumeSnapshot. VMs created from these templates will use CSI clone to clone the new PVC.

PVC and image details of the VM created from a rhel9 golden image:

~~~
rhel9-minimum-mammal             Bound    pvc-f9685b54-fd9e-44d0-b882-aa7d707e588b   30Gi       RWX            ocs-external-storagecluster-ceph-rbd   39s


oc get pv pvc-f9685b54-fd9e-44d0-b882-aa7d707e588b -o json |jq '.spec.csi.volumeAttributes.imageName'
"csi-vol-8f886291-10e7-49f6-9d7f-69db4bc6d21e"


# rbd info sno/csi-vol-8f886291-10e7-49f6-9d7f-69db4bc6d21e |grep parent
	parent: sno/csi-snap-df6fed52-f9eb-4711-af8b-5d6fe1940a7c@csi-snap-df6fed52-f9eb-4711-af8b-5d6fe1940a7c
~~~

Below is the golden image and corresponding RBD image:

~~~
# oc get volumesnapshot rhel9-6c486c3e5f8c -o json  |jq '.status.boundVolumeSnapshotContentName'
"snapcontent-34a43465-0341-4c26-84af-9da741e91b81"

# oc get volumesnapshotcontent snapcontent-34a43465-0341-4c26-84af-9da741e91b81 -o json |jq '.status.snapshotHandle,.spec.source.volumeHandle'
"0001-0011-openshift-storage-0000000000000001-df6fed52-f9eb-4711-af8b-5d6fe1940a7c"
"0001-0011-openshift-storage-0000000000000001-0d5f7a5d-3650-4efc-830f-fd8be1b4bf06"


# rbd snap ls sno/csi-snap-df6fed52-f9eb-4711-af8b-5d6fe1940a7c
SNAPID  NAME                                           SIZE    PROTECTED  TIMESTAMP
    89  csi-snap-df6fed52-f9eb-4711-af8b-5d6fe1940a7c  30 GiB             Mon Feb 19 15:10:25 2024

# rbd info sno/csi-snap-df6fed52-f9eb-4711-af8b-5d6fe1940a7c |grep parent
	op_features: clone-parent, clone-child
	parent: sno/csi-vol-0d5f7a5d-3650-4efc-830f-fd8be1b4bf06@1a58acf9-08dc-4a14-845f-44e918ff718e (trash b5aa5e1635793)
~~~
 
CDI deletes older PVCs when a new version of OS image is imported and by default only keeps three versions of images[2]. 

Once the above golden image is deleted, cloning of VMs created from the image will fail:

~~~
# oc delete volumesnapshot rhel9-6c486c3e5f8c

# rbd trash ls --pool sno |grep csi-snap-df6fed52-f9eb-4711-af8b-5d6fe1940a7c
b5aa5204dc99d csi-snap-df6fed52-f9eb-4711-af8b-5d6fe1940a7c
~~~

The cloning of rhel9-minimum-mammal will fail:

~~~
# oc get pvc tmp-pvc-ac6e3d87-98a9-493f-a5dd-f7f7bed409f8 -n nijin-cnv -o json |jq '.spec.dataSource'
{
  "apiGroup": null,
  "kind": "PersistentVolumeClaim",
  "name": "rhel9-minimum-mammal"
}


I0219 15:45:29.679359       1 utils.go:206] ID: 298 Req-ID: pvc-f2322676-d577-48b9-abd1-2c8972a2ae5f GRPC request: {"capacity_range":{"required_bytes":32212254720},"name":"pvc-f2322676-d577-48b9-abd1-2c8972a2ae5f","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-f2322676-d577-48b9-abd1-2c8972a2ae5f","csi.storage.k8s.io/pvc/name":"tmp-pvc-ac6e3d87-98a9-493f-a5dd-f7f7bed409f8","csi.storage.k8s.io/pvc/namespace":"nijin-cnv","imageFeatures":"layering,deep-flatten,exclusive-lock,object-map,fast-diff","imageFormat":"2","pool":"sno"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":5}}],"volume_content_source":{"Type":{"Volume":{"volume_id":"0001-0011-openshift-storage-0000000000000001-8f886291-10e7-49f6-9d7f-69db4bc6d21e"}}}}
I0219 15:45:29.679512       1 rbd_util.go:1308] ID: 298 Req-ID: pvc-f2322676-d577-48b9-abd1-2c8972a2ae5f setting disableInUseChecks: true image features: [layering fast-diff exclusive-lock object-map deep-flatten] mounter: rbd
I0219 15:45:29.681111       1 omap.go:88] ID: 298 Req-ID: pvc-f2322676-d577-48b9-abd1-2c8972a2ae5f got omap values: (pool="sno", namespace="", name="csi.volume.8f886291-10e7-49f6-9d7f-69db4bc6d21e"): map[csi.imageid:b5aa5232a500f csi.imagename:csi-vol-8f886291-10e7-49f6-9d7f-69db4bc6d21e csi.volname:pvc-f9685b54-fd9e-44d0-b882-aa7d707e588b csi.volume.owner:openshift-virtualization-os-images]
I0219 15:45:29.709824       1 omap.go:88] ID: 298 Req-ID: pvc-f2322676-d577-48b9-abd1-2c8972a2ae5f got omap values: (pool="sno", namespace="", name="csi.volumes.default"): map[]
E0219 15:45:29.739129       1 utils.go:210] ID: 298 Req-ID: pvc-f2322676-d577-48b9-abd1-2c8972a2ae5f GRPC error: rpc error: code = Internal desc = image not found: RBD image not found
~~~

Version of all relevant components (if applicable):

odf-operator.v4.14.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Users will be unable to clone VMs if the source VMs' golden images have been garbage collected by CDI

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

2

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create a VM from a golden image using ceph rbd storage class with block mode.
2. Delete the golden image VolumeSnapshot.
3. Try to clone the VM created in step [1].


Actual results:

PVC cloning is failing with error "RBD image not found"

Expected results:

Cloning should work.

Additional info:

[1] https://docs.openshift.com/container-platform/4.14/virt/virtual_machines/creating_vms_rh/virt-creating-vms-from-rh-images-overview.html#virt-about-golden-images_virt-creating-vms-from-rh-images-overview

[2] https://github.com/kubevirt/containerized-data-importer/blob/42ec627e3593c45027c2ebffb32f26a182c9cdee/pkg/controller/dataimportcron-controller.go#L769

Comment 2 nijin ashok 2024-02-19 15:57:57 UTC

Adding

Comment 9 Yuli Persky 2024-05-12 14:17:45 UTC

Looks like the bug is still not fixed in 4.16.0-94. 

I've tried the first 4 steps out of the following (scenario specified in comment#3)

- Create PVC ( created pvc1)
- Create Snapshot ) created pvc1snap1)
- Delete PVC ( deleted pvc1) 
- Restore Snapshot into pvc-restore ( tried to restore pvc1snap1 to a new pvc) 

Expected result: the restore action should be enabled in the UI and should work. 
Actual result: the option to restore is greyed out. 

I did check whether it is possible to restore from a snapshot when the initial pvc is not deleted.

Comment 11 Yuli Persky 2024-05-12 14:43:50 UTC

Due to comments #9 and #10 the BZ has failed QA and was reopened (changed the status to Assigned).

Comment 14 Yuli Persky 2024-05-16 20:40:27 UTC

@Rakshith, 

I've deployed another 4.16 cluster and reproduced this scenario again with TWX block mode pvc, and the problem was reproduced once again. 

To be more specific: 

1) I've created RWX block mode pvc ( ypersky-pvc1)

2) I've successfully created a snapshot ( ypersky=pvc1-snapshot1)

3) I've successfully restored this snapshot ypersky=pvc1-snapshot1 to PVC ( ypersky-pvc1-snapshot1-restore1)  ( to make sure that Restore is possible when the initial pvc is not deleted). 

4) I've deleted ypersky-pvc1

5) When I try to restore again from  ypersky=pvc1-snapshot1 - the option of Restore is greyed out as in the attached print screen. 
I did try to change access mode to any of the RWO, RWX, ROX  - for each one of those options the Restore is not possible ( greyed out). 

You are welcome to check on this cluster : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-simple-deploy-odf-cluster/66/ 
The above cluster will be available for a few more days. 

Reopening the BZ again.

Comment 17 Yuli Persky 2024-05-21 09:50:09 UTC

As for the BZ verification on the cli - it is possible to restore a pvc from snapshot on 4.16.0-94, when the parent PVC is deleted 
scenario: 

1) create pvc1
2) create ypersky-pvc1-snapshot1
3) delete pvc1
4) create pvc1-1-snap1-restore with the following command: 

oc create -f <restore_yaml> 

While the content of the yaml file is:  

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc1-snapshot1-restore1-cli
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: ypersky-pvc1-snapshot1
  resources:
    requests:
      storage: 1Gi
  storageClassName: ocs-storagecluster-ceph-rbd
  volumeMode: Block


Moving this BZ to verified state and will open a new BZ for the UI.

Comment 21 errata-xmlrpc 2024-07-17 13:14:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.