Bug 1989527

Summary: RBD: `rbd info` cmd on rbd images on which flattening is in progress throws ErrImageNotFound
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Mudit Agarwal <muagarwa>
Component: RBDAssignee: Ilya Dryomov <idryomov>
Status: CLOSED ERRATA QA Contact: Preethi <pnataraj>
Severity: high Docs Contact: Akash Raj <akraj>
Priority: high    
Version: 5.0CC: akraj, bbenshab, bniver, ceph-eng-bugs, danken, dholler, fdeutsch, guchen, hchiramm, idryomov, madam, mhackett, mrajanna, muagarwa, ndevos, ocs-bugs, owasserm, pelauter, pnataraj, rar, sostapov, tserlin, vashastr, vereddy
Target Milestone: ---   
Target Release: 5.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.10-37.el8cp Doc Type: Bug Fix
Doc Text:
.`rbd info` command no longer fails if executed when the image is being flattened Previously, due to an implementation defect, `rbd info` command would fail, although rarely, if run when the image was being flattened. This caused a transient _No such file or directory_ error to occur, although, upon rerun, the command always succeeded. With this fix, the implementation defect is fixed and `rbd info` command no longer fails even if executed when the image is being flattened.
Story Points: ---
Clone Of: 1989521 Environment:
Last Closed: 2023-01-11 17:38:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1989521, 2039269, 2049202, 2126049    

Description Mudit Agarwal 2021-08-03 12:02:19 UTC
+++ This bug was initially created as a clone of Bug #1989521 +++

Running `rbd info` cmd may return ErrImageNotFound when rbd images are undergoing flattening operation.

Cephcsi uses getImageInfo() at various places listed below.

If it returns false positive ErrImageNotFound in situations where the image is still present(but undergoing flattening), it will leave stale images in ceph cluster.

https://github.com/ceph/ceph-csi/search?q=getImageInfo%28%29

Operations which may leave stale images are create/delete Snapshot, create PVC from data source(PVC or snapshot) &
delete PVC created from another data source.

Note: Task to flatten images is added by cephcsi when we hit snap / imageclonedepth limit.



Steps to Reproduce:
see https://github.com/ceph/ceph-csi/issues/2327


Actual results:
Running getImageInfo() / `rbd info` cmd may return ErrImageNotFound when rbd images are undergoing flattening operation.


Expected results:
Running getImageInfo() / `rbd info` cmd does not return ErrImageNotFound when rbd images are undergoing flattening operation.


Additional info:
see https://github.com/ceph/ceph-csi/issues/2327


root@rook-ceph-tools-7b96766574-wh7sf /]# ceph rbd task add flatten replicapool/tmp
{"sequence": 9, "id": "e2a2df32-3edb-48c4-b538-a41985232e99", "message": "Flattening image replicapool/tmp", "refs": {"action": "flatten", "pool_name": "replicapool", "pool_namespace": "", "image_name": "tmp", "image_id": "41c9e3608f0b"}}
^[[A[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp
rbd image 'tmp':
        size 1 GiB in 256 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 41c9e3608f0b
        block_name_prefix: rbd_data.41c9e3608f0b
        format: 2
        features: layering, deep-flatten, operations
        op_features: clone-child
        flags: 
        create_timestamp: Tue Jul 27 11:29:40 2021
        access_timestamp: Tue Jul 27 11:29:40 2021
        modify_timestamp: Tue Jul 27 11:29:40 2021
        parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0
        overlap: 1 GiB
...
[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to set image snapshot: (2) No such file or directory
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshParentRequest: failed to open parent image: (2) No such file or directory
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (2) No such file or directory
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to refresh image: (2) No such file or directory
rbd: error opening image tmp: (2) No such file or directory
[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp
rbd image 'tmp':
        size 1 GiB in 256 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 41c9e3608f0b
        block_name_prefix: rbd_data.41c9e3608f0b
        format: 2
        features: layering, deep-flatten
        op_features: 
        flags: 
        create_timestamp: Tue Jul 27 11:29:40 2021
        access_timestamp: Tue Jul 27 11:29:40 2021
        modify_timestamp: Tue Jul 27 11:29:40 2021

Comment 1 Mudit Agarwal 2021-08-20 02:46:46 UTC
Ilya, can this be considered for 5.0z1?

Comment 5 Mudit Agarwal 2021-09-30 14:08:34 UTC
AFAIK, this is not that urgent we can wait till 5.0z2 but I may be wrong.
Rakshith, do you have any thoughts. Is it ok if we don't fix it in 4.9?

Comment 7 Mudit Agarwal 2021-10-04 09:14:31 UTC
Putting the target release as 5.0z2 based on the above conversation, please re-target if required.

Comment 11 Scott Ostapovicz 2022-01-26 16:56:42 UTC
Not completed in time for 5.0 z4, moving to 5.1

Comment 16 Boaz 2022-03-01 12:31:23 UTC
*** Bug 2049202 has been marked as a duplicate of this bug. ***

Comment 17 guy chen 2022-03-08 08:49:55 UTC
I am running performance with CNV 4.9.3 and looks like I reproduced the issue :

Created sequential VMS from golden image, 10 seconds apart, and after ~450 VMS the snapshot started to get stuck -
My system DV :
[kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c Succeeded
468
[kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c SnapshotForSmartCloneInProgress
21
[kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c CloneScheduled
12

Please advise if additional information is needed for debugging.

Comment 18 Peter Lauterbach 2022-03-16 12:28:50 UTC
This has been an issue for some time, and will definitely impact VMs deployements at scale.
This is a high priority defect for us, please advise how we can help get more progress on fixing this, before it become a fire drill in a production cluster.

Comment 22 Scott Ostapovicz 2022-04-20 14:40:03 UTC
We are past the code freeze date for 5.1 z1, but lets consider this one a blocker/exception.

Comment 23 Preethi 2022-05-04 02:28:03 UTC
Any update on this BZ. When can we expect to be ON_QA. We are close to Test phase completion. We need it by 6th for QE to verify this part of 5.1Z1 release.

Comment 24 Scott Ostapovicz 2022-05-04 14:26:03 UTC
We can no longer hold the 5.1 z1 release for this one.

Comment 25 Preethi 2022-06-01 04:53:37 UTC
Any update on this BZ. When can we expect to be ON_QA.

Comment 29 Scott Ostapovicz 2022-08-23 17:34:10 UTC
Note this is NOT a DR issue.  We will leave this here for now, but this may be moved to 5.3 z1 if there is not enough extra time to complete this.

Comment 30 Mudit Agarwal 2022-08-24 03:38:36 UTC
Yes, this is not a DR issue but we are hitting it very frequently in upstream ci.
Its is required for one of our features in 4.12, this is also causing of delays in perf test with CNV team.
If we don't fix it will leave a lot of rbd stale resources.

Can we please target it for 5.3 only.

Comment 56 errata-xmlrpc 2023-01-11 17:38:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0076

Comment 57 Ilya Dryomov 2023-04-11 12:04:19 UTC
*** Bug 2049202 has been marked as a duplicate of this bug. ***