Bug 1991677
Summary: | [Workload-DFG] RHCS 5 has a very high number of OSD compaction events as compared to 4.x | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
Component: | RADOS | Assignee: | Adam Kupczyk <akupczyk> |
Status: | CLOSED ERRATA | QA Contact: | Pawan <pdhiran> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.0 | CC: | akupczyk, bhubbard, ceph-eng-bugs, gsitlani, mmuench, mnelson, nojha, pdhiran, racpatel, rmandyam, rzarzyns, sseshasa, tserlin, twilkins, vereddy, vumrao |
Target Milestone: | --- | ||
Target Release: | 5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | ceph-16.2.0-116.el8cp | Doc Type: | Bug Fix |
Doc Text: |
.The trigger condition for RocksDB flush and compactions works as expected
BlueStore organizes data into chunks called blobs, the size of which is 64K by default. For large writes, it is split into a sequence of 64K blob writes.
Previously, when the deferred size was equal to or more than the blob size, all the data was deferred and they were placed under the “L” column family. A typical example is the case for HDD configuration where the value is 64K for both `bluestore_prefer_deferred_size_hdd` and `bluestore_max_blob_size_hdd` parameters. This consumed the “L” column faster resulting in the RocksDB flush count and the compactions becoming more frequent. The trigger condition for this scenario was `data size in blob` <= `minimum deferred size`.
With this release, the deferred trigger condition checks the size of extents on disks and not blobs. Extents smaller than `deferred_size` go to a deferred mechanism and larger extents are written to the disk immediately. The trigger condition is changed to `data size in extent` < `minimum deferred size`.
The small writes are placed under the “L” column and the growth of this column is slow with no extra compactions.
The `bluestore_prefer_deferred_size` parameter controls the deferred without any interference from the blob size and works as per it’s description of “writes smaller than this size”.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-08-30 08:31:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1959686 |
Description
Vikhyat Umrao
2021-08-09 16:49:55 UTC
From email Mark feedback: I also notice something goofy about those 4.2z2 results. There are maybe 5 compaction events grouped together at ~6.1K seconds, again at 12.1K seconds, then again at 16.1K seconds, 20.2K seconds, etc. The clusters are really tight, they only have a span of a couple of seconds each, but then there's over an hour where compaction is idle until the next small grouping. That would imply a really light write workload with minimal data vs a much heavier one in the 5.0 case. Can you think of anything that might have been different on that 4.2z2 OSD vs the 5.0 one? Ok, the difference in osd log sizes is directly related to the difference in rocksdb behavior between these two runs. I went back and dug through the respective 4.2 and 5.0 OSD logs looking at memtable flushing behavior given the big difference in compaction stats between them. As you said Tim, the memtable sizes are correct in both cases (256MB). In the 5.0 case though, we are filling up memtables extremely fast, often flushing them when they are full, but also sometimes having rocksdb's "write buffer manager" flushing early for some column families when global limits are hit. In the 4.2 case there were 120 flush events and in 5.0 there were 20110. 9135 of those appeared to happen due to the "L" column family completely filling a memtable forcing a flush. The others were due to the write buffer manager deciding it needed to prematurely flush memtables before they were completely full (always for other column families like onode or omap). Example 5.0 flush associated with "L" column family: 2021-07-31T22:56:45.302+0000 7ff98ab62700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1627772205303825, "job": 23798, "event": "flush_started", "num_memtables": 1, "num_entries": 535, "num_deletes": 0, "total_data_size": 254484297, "memory_usage": 260062264, "flush_reason": "Write Buffer Full"} ~500 entries means roughly 512KB of data per transactions on average. Example 5.0 flush associated with other column families: 2021-07-31T22:56:37.134+0000 7ff98ab62700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1627772197135248, "job": 23797, "event": "flush_started", "num_memtables": 1, "num_entries": 5147, "num_deletes": 2540, "total_data_size": 1872446, "memory_usage": 1974888, "flush_reason": "Write Buffer Manager"} Example 4.2 flush (4.2 has one big column family): 2021-08-01 07:25:32.766 7fc7072d4700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1627802732767384, "job": 329, "event": "flush_started", "num_memtables": 1, "num_entries": 779408, "num_deletes": 138087, "total_data_size": 245589616, "memory_usage": 260049232, "flush_reason": "Write Buffer Full"} Much better behavior here. More along the lines of what we expect (and also roughly what I see with master on our mako cluster with nvme drives doing RBD tests). The "L" column family is for deferred writes, so for some reason in 5.0 we seem to be seeing large (~512KB?) deferred writes. In pacific the default bluestore_prefer_deferred_size_hdd value is 64_K, so unless I'm missing something we shouldn't be seeing huge transactions hitting the L column family by default like this. If you haven't changed that setting, this might be a bug. https://docs.google.com/spreadsheets/d/1mTnKvLh8NIxPukad0o0f8ao-T2-GORXWASTQBdxtDFM/edit?usp=sharing I was able to reproduce this issue with upstream pacific on Mako using NVMe drives. Increasing the bluestore_prefer_deferred size to the HDD default (64K) dramatically increased rocksdb memtable flushing due to excessive writes to the "L" deferred column family (and thus greatly increased compaction overhead). Increasing the max_blob_size mitigates the issue to some extent, though it's clear that the deferred IO path (even at 4K) is introducing fairly significant memtable flush overhead. It becomes extremely dramatic though when the deferred IO size gets close to or exceeds the blob size. Upstream we probably want to get Igor involved looking at these results since this is pretty excessive. Again this doesn't occur with the default SSD/NVMe settings, only the default HDD settings. The compaction stats listed above were collected from a workload using the DFG's usual object sized histogram ... [sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB] Further tests were run using a smaller set of objects ... [sizes=h(1|2|25,2|4|40,4|8|25,8|256|10)KB] Those results are included here. RHCS 4.2.z2: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2-bb_20210803.log Compaction Statistics ceph-osd.0.log-20210803 Total OSD Log Duration (seconds) 172375.56 Number of Compaction Events 695 Avg Compaction Time (seconds) 1.4297389007194243 Total Compaction Time (seconds) 993.6685359999999 Avg Output Size: (MB) 317.3482826452461 Total Output Size: (MB) 220557.05643844604 Total Input Records 516063612 Total Output Records 438248858 Avg Output Throughput (MB/s) 216.54726637849893 Avg Input Records/second 597336.6266970418 Avg Output Records/second 490898.5575847847 Avg Output/Input Ratio 0.8390318513188124 RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50-bb_20210810.log Compaction Statistics ceph-osd.0.log-20210810 Total OSD Log Duration (seconds) 173069.541 Number of Compaction Events 1260 Avg Compaction Time (seconds) 1.300772765079365 Total Compaction Time (seconds) 1638.973684 Avg Output Size: (MB) 207.90438506489707 Total Output Size: (MB) 261959.52518177032 Total Input Records 767957338 Total Output Records 512552498 Avg Output Throughput (MB/s) 139.84079579789753 Avg Input Records/second 503980.3206531971 Avg Output Records/second 314987.9575686694 Avg Output/Input Ratio 0.7265201888226785 (In reply to Tim Wilkinson from comment #2) > The compaction stats listed above were collected from a workload using the > DFG's usual object sized histogram ... > > [sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB] > > > Further tests were run using a smaller set of objects ... > > [sizes=h(1|2|25,2|4|40,4|8|25,8|256|10)KB] > > > Those results are included here. > > > RHCS 4.2.z2: > http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2-bb_20210803.log > > Compaction Statistics ceph-osd.0.log-20210803 > Total OSD Log Duration (seconds) 172375.56 > Number of Compaction Events 695 > Avg Compaction Time (seconds) 1.4297389007194243 > Total Compaction Time (seconds) 993.6685359999999 > Avg Output Size: (MB) 317.3482826452461 > Total Output Size: (MB) 220557.05643844604 > Total Input Records 516063612 > Total Output Records 438248858 > Avg Output Throughput (MB/s) 216.54726637849893 > Avg Input Records/second 597336.6266970418 > Avg Output Records/second 490898.5575847847 > Avg Output/Input Ratio 0.8390318513188124 > > > > RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50-bb_20210810.log > > Compaction Statistics ceph-osd.0.log-20210810 > Total OSD Log Duration (seconds) 173069.541 > Number of Compaction Events 1260 > Avg Compaction Time (seconds) 1.300772765079365 > Total Compaction Time (seconds) 1638.973684 > Avg Output Size: (MB) 207.90438506489707 > Total Output Size: (MB) 261959.52518177032 > Total Input Records 767957338 > Total Output Records 512552498 > Avg Output Throughput (MB/s) 139.84079579789753 > Avg Input Records/second 503980.3206531971 > Avg Output Records/second 314987.9575686694 > Avg Output/Input Ratio 0.7265201888226785 Thanks, Tim. This is good this confirms that this is mostly with objects which are bigger than 64MB. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294 |