+++ This bug was initially created as a clone of Bug #1989521 +++ Running `rbd info` cmd may return ErrImageNotFound when rbd images are undergoing flattening operation. Cephcsi uses getImageInfo() at various places listed below. If it returns false positive ErrImageNotFound in situations where the image is still present(but undergoing flattening), it will leave stale images in ceph cluster. https://github.com/ceph/ceph-csi/search?q=getImageInfo%28%29 Operations which may leave stale images are create/delete Snapshot, create PVC from data source(PVC or snapshot) & delete PVC created from another data source. Note: Task to flatten images is added by cephcsi when we hit snap / imageclonedepth limit. Steps to Reproduce: see https://github.com/ceph/ceph-csi/issues/2327 Actual results: Running getImageInfo() / `rbd info` cmd may return ErrImageNotFound when rbd images are undergoing flattening operation. Expected results: Running getImageInfo() / `rbd info` cmd does not return ErrImageNotFound when rbd images are undergoing flattening operation. Additional info: see https://github.com/ceph/ceph-csi/issues/2327 root@rook-ceph-tools-7b96766574-wh7sf /]# ceph rbd task add flatten replicapool/tmp {"sequence": 9, "id": "e2a2df32-3edb-48c4-b538-a41985232e99", "message": "Flattening image replicapool/tmp", "refs": {"action": "flatten", "pool_name": "replicapool", "pool_namespace": "", "image_name": "tmp", "image_id": "41c9e3608f0b"}} ^[[A[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB ... [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to set image snapshot: (2) No such file or directory 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshParentRequest: failed to open parent image: (2) No such file or directory 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (2) No such file or directory 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to refresh image: (2) No such file or directory rbd: error opening image tmp: (2) No such file or directory [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten op_features: flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021
Ilya, can this be considered for 5.0z1?
AFAIK, this is not that urgent we can wait till 5.0z2 but I may be wrong. Rakshith, do you have any thoughts. Is it ok if we don't fix it in 4.9?
Putting the target release as 5.0z2 based on the above conversation, please re-target if required.
Not completed in time for 5.0 z4, moving to 5.1
*** Bug 2049202 has been marked as a duplicate of this bug. ***
I am running performance with CNV 4.9.3 and looks like I reproduced the issue : Created sequential VMS from golden image, 10 seconds apart, and after ~450 VMS the snapshot started to get stuck - My system DV : [kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c Succeeded 468 [kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c SnapshotForSmartCloneInProgress 21 [kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c CloneScheduled 12 Please advise if additional information is needed for debugging.
This has been an issue for some time, and will definitely impact VMs deployements at scale. This is a high priority defect for us, please advise how we can help get more progress on fixing this, before it become a fire drill in a production cluster.
We are past the code freeze date for 5.1 z1, but lets consider this one a blocker/exception.
Any update on this BZ. When can we expect to be ON_QA. We are close to Test phase completion. We need it by 6th for QE to verify this part of 5.1Z1 release.
We can no longer hold the 5.1 z1 release for this one.
Any update on this BZ. When can we expect to be ON_QA.
Note this is NOT a DR issue. We will leave this here for now, but this may be moved to 5.3 z1 if there is not enough extra time to complete this.
Yes, this is not a DR issue but we are hitting it very frequently in upstream ci. Its is required for one of our features in 4.12, this is also causing of delays in perf test with CNV team. If we don't fix it will leave a lot of rbd stale resources. Can we please target it for 5.3 only.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0076