Bug 2049202

Summary: image clones are becoming extremely unproportionately slow as image cloning parallelism increases
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Boaz <bbenshab>
Component: cephAssignee: Ilya Dryomov <idryomov>
ceph sub component: RBD QA Contact: krishnaram Karthick <kramdoss>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: akamra, alayani, alitke, awels, bniver, ceph-eng-bugs, danken, fdeutsch, idryomov, jhopper, mrajanna, muagarwa, ndevos, ocs-bugs, odf-bz-bot, owasserm, pnataraj, rar
Version: 4.9Keywords: Performance, Reopened, Tracking
Target Milestone: ---   
Target Release: ODF 4.13.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-11 12:04:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1989527    
Bug Blocks:    

Comment 42 Dan Kenigsberg 2022-03-24 13:16:21 UTC
Speaking with Orit Wasserman, I do not believe that this is a dup of 1989527: The customer problem is not just slow `rbd info`, but it could be the fact that CSI clone creates so many images.

Our customer would like to create thousands of clones of the same PV. IMHO this should be supported by OCS-CSI regardless of the implementation on the rbd level. Am I mistaken, @ndevos ?

Comment 43 Ilya Dryomov 2022-03-24 14:11:31 UTC
BZ 1989527 extends far beyond "rbd info" -- any attempt to open the image that is being flattened can spuriously return ENOENT.  It is a rather narrow race window but the belief that it nonetheless contributes to the overall issue.

Comment 44 Niels de Vos 2022-04-28 07:02:21 UTC
As Ilya mentioned in comment #43, the fix in Ceph is expected to address this problem. Due to the spurious incorrect error returned by Ceph RBD, Ceph-CSI can not handle the workflow correctly while cloning some RBD images. This results in a resource leak, causing different problems down the line.