Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1906595

Summary: [cee/sd][bluestore] Performance issues with bluefs_buffered_io=false in RHCS 4 in certain scenarios
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tomas Petr <tpetr>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED DUPLICATE QA Contact: Manohar Murthy <mmurthy>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1CC: akupczyk, bhubbard, cbodley, ceph-eng-bugs, dzafman, frederic.nass, gsitlani, jharriga, kchai, mbenjamin, mkogan, mmuench, mnelson, nojha, pdhiran, rzarzyns, sseshasa, twilkins, vumrao
Target Milestone: ---Keywords: Performance
Target Release: 5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-29 18:38:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tomas Petr 2020-12-10 20:47:55 UTC
Description of problem:
We have report on performance regression after upgrading cluster from RHCS 3.3 to RHCS 4.1z2,
it was found out it is related with bluefs_buffered_io now set to false
 - changed since RHCS 4.1 in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1802199 

During RBD snapshot removal, SSDs hosting RocksDB databases would go 100%util on iostat with over 770MB/s and 4.5k iops READ with a lot of lines like theses in OSDs log files:
-----
2020-12-10 15:25:04.824 7f11ed711700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency slow operation observed for submit_transact, latency = 5.92839s
2020-12-10 15:25:04.824 7f11e46ff700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency slow operation observed for submit_transact, latency = 6.43955s
2020-12-10 15:25:04.824 7f11eff16700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency slow operation observed for submit_transact, latency = 5.88503s
2020-12-10 15:25:04.824 7f1205f42700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.92865s, txc = 0x55f3c2ed22c0
2020-12-10 15:25:04.824 7f1205f42700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.43986s, txc = 0x55f518282dc0
2020-12-10 15:25:04.825 7f1205f42700  0 bluestore(/var/lib/ceph/osd/ceph-109) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.88556s, txc = 0x55f43af3a2c0
-----
The only way to lower the load on SSDs was to set a osd_snap_trim_sleep of 20s, but the time needed then to trim all PGs with such a high value was long.



After changing bluefs_buffered_io parameter to true, these issues has not been observed. The trimming was fast again so as with RHCS3, with no load at all on SSDs even with an osd_snap_trim_sleep of 0.1s.

There has been also observed a big difference in time listing of RGW buckets,  bluefs_buffered_io parameter to true - listing the same buckets>
RHCS 3.3: 14s
RHCS 4.1z2 + bluefs_buffered_io=false: 32s,
RHCS 4.1z2 + bluefs_buffered_io=true: 22s

Version-Release number of selected component (if applicable):
14.2.8-111

How reproducible:
during snapshot removal

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Red Hat Bugzilla 2023-09-15 00:52:49 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days