Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1906595

Summary:	[cee/sd][bluestore] Performance issues with bluefs_buffered_io=false in RHCS 4 in certain scenarios
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Tomas Petr <tpetr>
Component:	RADOS	Assignee:	Neha Ojha <nojha>
Status:	CLOSED DUPLICATE	QA Contact:	Manohar Murthy <mmurthy>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1	CC:	akupczyk, bhubbard, cbodley, ceph-eng-bugs, dzafman, frederic.nass, gsitlani, jharriga, kchai, mbenjamin, mkogan, mmuench, mnelson, nojha, pdhiran, rzarzyns, sseshasa, twilkins, vumrao
Target Milestone:	---	Keywords:	Performance
Target Release:	5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-04-29 18:38:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomas Petr 2020-12-10 20:47:55 UTC

Description of problem:
We have report on performance regression after upgrading cluster from RHCS 3.3 to RHCS 4.1z2,
it was found out it is related with bluefs_buffered_io now set to false
 - changed since RHCS 4.1 in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1802199 

During RBD snapshot removal, SSDs hosting RocksDB databases would go 100%util on iostat with over 770MB/s and 4.5k iops READ with a lot of lines like theses in OSDs log files:
-----
2020-12-10 15:25:04.824 7f11ed711700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency slow operation observed for submit_transact, latency = 5.92839s
2020-12-10 15:25:04.824 7f11e46ff700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency slow operation observed for submit_transact, latency = 6.43955s
2020-12-10 15:25:04.824 7f11eff16700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency slow operation observed for submit_transact, latency = 5.88503s
2020-12-10 15:25:04.824 7f1205f42700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.92865s, txc = 0x55f3c2ed22c0
2020-12-10 15:25:04.824 7f1205f42700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.43986s, txc = 0x55f518282dc0
2020-12-10 15:25:04.825 7f1205f42700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.88556s, txc = 0x55f43af3a2c0
-----
The only way to lower the load on SSDs was to set a osd_snap_trim_sleep of 20s, but the time needed then to trim all PGs with such a high value was long.



After changing bluefs_buffered_io parameter to true, these issues has not been observed. The trimming was fast again so as with RHCS3, with no load at all on SSDs even with an osd_snap_trim_sleep of 0.1s.

There has been also observed a big difference in time listing of RGW buckets,  bluefs_buffered_io parameter to true - listing the same buckets>
RHCS 3.3: 14s
RHCS 4.1z2 + bluefs_buffered_io=false: 32s,
RHCS 4.1z2 + bluefs_buffered_io=true: 22s

Version-Release number of selected component (if applicable):
14.2.8-111

How reproducible:
during snapshot removal

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Red Hat Bugzilla 2023-09-15 00:52:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days