Bug 2259054

Summary:	Improve rbd_diff_iterate2() performance in fast-diff mode [5.3z]
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Ilya Dryomov <idryomov>
Component:	RBD	Assignee:	Ilya Dryomov <idryomov>
Status:	CLOSED ERRATA	QA Contact:	Sunil Angadi <sangadi>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	asriram, ceph-eng-bugs, cephqe-warriors, dwalveka, sangadi, tserlin
Target Milestone:	---
Target Release:	5.3z7
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-16.2.10-257.el8cp	Doc Type:	Enhancement
Doc Text:	Previously, RBD diff-iterate was not guaranteed to execute locally if exclusive lock was available when diffing against the beginning of time (`fromsnapname == NULL`) in fast-diff mode (`whole_object == true` with `fast-diff` image feature enabled and valid). With this enhancement, `rbd_diff_iterate2()` API performance is improved, thereby increasing the performance for QEMU live disk synchronization and backup use cases, where the `fast-diff` image feature is enabled.	Story Points:	---
Clone Of:	2258997	Environment:
Last Closed:	2024-06-26 10:01:49 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ilya Dryomov 2024-01-18 21:39:51 UTC

+++ This bug was initially created as a clone of Bug #2258997 +++

Description of problem:
when user tries rbd du <pool>/<image>
irrespective of the image size command takes quite long to provide disk usage of an image
It can be improved in terms of performance.


Version-Release number of selected component (if applicable):
Issue exist in RHCS 5, RHCS 6 and RHCS 7

How reproducible:
Always 

Steps to Reproduce:
1.Deploy ceph and create pools and image
2.Run some image with huge IO and some with less IO
3.Notice the time taken to execute the rbd du command both the images

Actual results:
irrespective of the image size command rbd du takes quite long to provide disk usage

Expected results:
The performance of fast-diff can be improved

Additional info:
https://tracker.ceph.com/issues/63341
https://gitlab.com/qemu-project/qemu/-/issues/1026

Comment 5 Ilya Dryomov 2024-06-05 08:20:10 UTC

(In reply to Sunil Angadi from comment #4)
> Tested using
> ceph version 16.2.10-260.el8cp (b20e1a5452628262667a6b060687917fde010343)
> pacific (stable)

Hi Sunil,

Is this the version of Ceph installed on the client node too (i.e. where rbdbackup_with_lock.sh script is run)?

> 
> QEMU available for latest rhel8.9 is
> "qemu-kvm-6.2.0-40.module+el8.9.0+20867+9a6a0901.2"

QEMU 6.2 should be affected, both according to my understanding based on the code and the original report at https://gitlab.com/qemu-project/qemu/-/issues/1026.

> Timestamp for "event":"JOB_STATUS_CHANGE" with "status":"running":
> 1717495086.944589
> Timestamp for "event":"BLOCK_JOB_COMPLETED": 1717495131.119845
> Now, subtract the first timestamp from the second:
> 
> 1717495086.944589 - 1717495131.119845 = 44.17seconds
> 
> this performance result is not same as RHEL 9 

Have you tried running the same script on a build without the fix?  Is there any difference in performance?

(Again, what matters the version of Ceph installed on the client node -- you should be able to perform both tests against the same cluster.)

Comment 8 errata-xmlrpc 2024-06-26 10:01:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4118