Bug 1989527 - RBD: `rbd info` cmd on rbd images on which flattening is in progress throws ErrImageNotFound
Summary: RBD: `rbd info` cmd on rbd images on which flattening is in progress throws E...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 5.3
Assignee: Ilya Dryomov
QA Contact: Preethi
Akash Raj
URL:
Whiteboard:
: 2049202 (view as bug list)
Depends On:
Blocks: 1989521 2039269 2049202 2126049
TreeView+ depends on / blocked
 
Reported: 2021-08-03 12:02 UTC by Mudit Agarwal
Modified: 2023-04-11 12:04 UTC (History)
24 users (show)

Fixed In Version: ceph-16.2.10-37.el8cp
Doc Type: Bug Fix
Doc Text:
.`rbd info` command no longer fails if executed when the image is being flattened Previously, due to an implementation defect, `rbd info` command would fail, although rarely, if run when the image was being flattened. This caused a transient _No such file or directory_ error to occur, although, upon rerun, the command always succeeded. With this fix, the implementation defect is fixed and `rbd info` command no longer fails even if executed when the image is being flattened.
Clone Of: 1989521
Environment:
Last Closed: 2023-01-11 17:38:53 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 52810 0 None None None 2021-10-04 09:19:00 UTC
Red Hat Issue Tracker RHCEPH-675 0 None None None 2021-09-21 11:05:58 UTC
Red Hat Product Errata RHSA-2023:0076 0 None None None 2023-01-11 17:39:49 UTC

Internal Links: 2060374

Description Mudit Agarwal 2021-08-03 12:02:19 UTC
+++ This bug was initially created as a clone of Bug #1989521 +++

Running `rbd info` cmd may return ErrImageNotFound when rbd images are undergoing flattening operation.

Cephcsi uses getImageInfo() at various places listed below.

If it returns false positive ErrImageNotFound in situations where the image is still present(but undergoing flattening), it will leave stale images in ceph cluster.

https://github.com/ceph/ceph-csi/search?q=getImageInfo%28%29

Operations which may leave stale images are create/delete Snapshot, create PVC from data source(PVC or snapshot) &
delete PVC created from another data source.

Note: Task to flatten images is added by cephcsi when we hit snap / imageclonedepth limit.



Steps to Reproduce:
see https://github.com/ceph/ceph-csi/issues/2327


Actual results:
Running getImageInfo() / `rbd info` cmd may return ErrImageNotFound when rbd images are undergoing flattening operation.


Expected results:
Running getImageInfo() / `rbd info` cmd does not return ErrImageNotFound when rbd images are undergoing flattening operation.


Additional info:
see https://github.com/ceph/ceph-csi/issues/2327


root@rook-ceph-tools-7b96766574-wh7sf /]# ceph rbd task add flatten replicapool/tmp
{"sequence": 9, "id": "e2a2df32-3edb-48c4-b538-a41985232e99", "message": "Flattening image replicapool/tmp", "refs": {"action": "flatten", "pool_name": "replicapool", "pool_namespace": "", "image_name": "tmp", "image_id": "41c9e3608f0b"}}
^[[A[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp
rbd image 'tmp':
        size 1 GiB in 256 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 41c9e3608f0b
        block_name_prefix: rbd_data.41c9e3608f0b
        format: 2
        features: layering, deep-flatten, operations
        op_features: clone-child
        flags: 
        create_timestamp: Tue Jul 27 11:29:40 2021
        access_timestamp: Tue Jul 27 11:29:40 2021
        modify_timestamp: Tue Jul 27 11:29:40 2021
        parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0
        overlap: 1 GiB
...
[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to set image snapshot: (2) No such file or directory
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshParentRequest: failed to open parent image: (2) No such file or directory
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (2) No such file or directory
2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to refresh image: (2) No such file or directory
rbd: error opening image tmp: (2) No such file or directory
[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp
rbd image 'tmp':
        size 1 GiB in 256 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 41c9e3608f0b
        block_name_prefix: rbd_data.41c9e3608f0b
        format: 2
        features: layering, deep-flatten
        op_features: 
        flags: 
        create_timestamp: Tue Jul 27 11:29:40 2021
        access_timestamp: Tue Jul 27 11:29:40 2021
        modify_timestamp: Tue Jul 27 11:29:40 2021

Comment 1 Mudit Agarwal 2021-08-20 02:46:46 UTC
Ilya, can this be considered for 5.0z1?

Comment 5 Mudit Agarwal 2021-09-30 14:08:34 UTC
AFAIK, this is not that urgent we can wait till 5.0z2 but I may be wrong.
Rakshith, do you have any thoughts. Is it ok if we don't fix it in 4.9?

Comment 7 Mudit Agarwal 2021-10-04 09:14:31 UTC
Putting the target release as 5.0z2 based on the above conversation, please re-target if required.

Comment 11 Scott Ostapovicz 2022-01-26 16:56:42 UTC
Not completed in time for 5.0 z4, moving to 5.1

Comment 16 Boaz 2022-03-01 12:31:23 UTC
*** Bug 2049202 has been marked as a duplicate of this bug. ***

Comment 17 guy chen 2022-03-08 08:49:55 UTC
I am running performance with CNV 4.9.3 and looks like I reproduced the issue :

Created sequential VMS from golden image, 10 seconds apart, and after ~450 VMS the snapshot started to get stuck -
My system DV :
[kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c Succeeded
468
[kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c SnapshotForSmartCloneInProgress
21
[kni@f12-h17-b07-5039ms ~]$ oc get dv -A | grep -c CloneScheduled
12

Please advise if additional information is needed for debugging.

Comment 18 Peter Lauterbach 2022-03-16 12:28:50 UTC
This has been an issue for some time, and will definitely impact VMs deployements at scale.
This is a high priority defect for us, please advise how we can help get more progress on fixing this, before it become a fire drill in a production cluster.

Comment 22 Scott Ostapovicz 2022-04-20 14:40:03 UTC
We are past the code freeze date for 5.1 z1, but lets consider this one a blocker/exception.

Comment 23 Preethi 2022-05-04 02:28:03 UTC
Any update on this BZ. When can we expect to be ON_QA. We are close to Test phase completion. We need it by 6th for QE to verify this part of 5.1Z1 release.

Comment 24 Scott Ostapovicz 2022-05-04 14:26:03 UTC
We can no longer hold the 5.1 z1 release for this one.

Comment 25 Preethi 2022-06-01 04:53:37 UTC
Any update on this BZ. When can we expect to be ON_QA.

Comment 29 Scott Ostapovicz 2022-08-23 17:34:10 UTC
Note this is NOT a DR issue.  We will leave this here for now, but this may be moved to 5.3 z1 if there is not enough extra time to complete this.

Comment 30 Mudit Agarwal 2022-08-24 03:38:36 UTC
Yes, this is not a DR issue but we are hitting it very frequently in upstream ci.
Its is required for one of our features in 4.12, this is also causing of delays in perf test with CNV team.
If we don't fix it will leave a lot of rbd stale resources.

Can we please target it for 5.3 only.

Comment 56 errata-xmlrpc 2023-01-11 17:38:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0076

Comment 57 Ilya Dryomov 2023-04-11 12:04:19 UTC
*** Bug 2049202 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.