Description of problem: [Workload-DFG] RHCS 5 has a very high number of OSD compaction events as compared to 4.x RHCS 4.2.z2 - compaction events - 169 RHCS 5.0 - compaction events - 5103 Version-Release number of selected component (if applicable): RHCS 4.2z2 (14.2.11-181.el8cp) RHCS 5.0 stable (16.2.0-72.el8cp ) How reproducible: Always Tooling - compaction stats script - https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py Additional info: RHCS 4.2.z2: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2_20210731.log Compaction Statistics ceph-osd.0.log-20210731 Total OSD Log Duration (seconds) 173400.204 Number of Compaction Events 169 <========================= Avg Compaction Time (seconds) 1.293960958579882 Total Compaction Time (seconds) 218.67940200000004 Avg Output Size: (MB) 280.22014126975154 Total Output Size: (MB) 47357.20387458801 Total Input Records 102641789 Total Output Records 77039937 Avg Output Throughput (MB/s) 214.8391434610067 Avg Input Records/second 516788.7242412715 Avg Output Records/second 361864.6868155501 Avg Output/Input Ratio 0.707900981798514 RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50_20210731.log Compaction Statistics ceph-osd.0.log-20210731 Total OSD Log Duration (seconds) 176346.023 Number of Compaction Events 5103 <================= Avg Compaction Time (seconds) 0.6740464569860873 Total Compaction Time (seconds) 3439.659070000004 Avg Output Size: (MB) 78.40701729910714 Total Output Size: (MB) 400111.00927734375 Total Input Records 2865990255 Total Output Records 2817877787 Avg Output Throughput (MB/s) 122.18567838544968 Avg Input Records/second 475908.4835702056 Avg Output Records/second 465314.5987771648 Avg Output/Input Ratio 0.5896704305992323
From email Mark feedback: I also notice something goofy about those 4.2z2 results. There are maybe 5 compaction events grouped together at ~6.1K seconds, again at 12.1K seconds, then again at 16.1K seconds, 20.2K seconds, etc. The clusters are really tight, they only have a span of a couple of seconds each, but then there's over an hour where compaction is idle until the next small grouping. That would imply a really light write workload with minimal data vs a much heavier one in the 5.0 case. Can you think of anything that might have been different on that 4.2z2 OSD vs the 5.0 one? Ok, the difference in osd log sizes is directly related to the difference in rocksdb behavior between these two runs. I went back and dug through the respective 4.2 and 5.0 OSD logs looking at memtable flushing behavior given the big difference in compaction stats between them. As you said Tim, the memtable sizes are correct in both cases (256MB). In the 5.0 case though, we are filling up memtables extremely fast, often flushing them when they are full, but also sometimes having rocksdb's "write buffer manager" flushing early for some column families when global limits are hit. In the 4.2 case there were 120 flush events and in 5.0 there were 20110. 9135 of those appeared to happen due to the "L" column family completely filling a memtable forcing a flush. The others were due to the write buffer manager deciding it needed to prematurely flush memtables before they were completely full (always for other column families like onode or omap). Example 5.0 flush associated with "L" column family: 2021-07-31T22:56:45.302+0000 7ff98ab62700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1627772205303825, "job": 23798, "event": "flush_started", "num_memtables": 1, "num_entries": 535, "num_deletes": 0, "total_data_size": 254484297, "memory_usage": 260062264, "flush_reason": "Write Buffer Full"} ~500 entries means roughly 512KB of data per transactions on average. Example 5.0 flush associated with other column families: 2021-07-31T22:56:37.134+0000 7ff98ab62700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1627772197135248, "job": 23797, "event": "flush_started", "num_memtables": 1, "num_entries": 5147, "num_deletes": 2540, "total_data_size": 1872446, "memory_usage": 1974888, "flush_reason": "Write Buffer Manager"} Example 4.2 flush (4.2 has one big column family): 2021-08-01 07:25:32.766 7fc7072d4700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1627802732767384, "job": 329, "event": "flush_started", "num_memtables": 1, "num_entries": 779408, "num_deletes": 138087, "total_data_size": 245589616, "memory_usage": 260049232, "flush_reason": "Write Buffer Full"} Much better behavior here. More along the lines of what we expect (and also roughly what I see with master on our mako cluster with nvme drives doing RBD tests). The "L" column family is for deferred writes, so for some reason in 5.0 we seem to be seeing large (~512KB?) deferred writes. In pacific the default bluestore_prefer_deferred_size_hdd value is 64_K, so unless I'm missing something we shouldn't be seeing huge transactions hitting the L column family by default like this. If you haven't changed that setting, this might be a bug. https://docs.google.com/spreadsheets/d/1mTnKvLh8NIxPukad0o0f8ao-T2-GORXWASTQBdxtDFM/edit?usp=sharing I was able to reproduce this issue with upstream pacific on Mako using NVMe drives. Increasing the bluestore_prefer_deferred size to the HDD default (64K) dramatically increased rocksdb memtable flushing due to excessive writes to the "L" deferred column family (and thus greatly increased compaction overhead). Increasing the max_blob_size mitigates the issue to some extent, though it's clear that the deferred IO path (even at 4K) is introducing fairly significant memtable flush overhead. It becomes extremely dramatic though when the deferred IO size gets close to or exceeds the blob size. Upstream we probably want to get Igor involved looking at these results since this is pretty excessive. Again this doesn't occur with the default SSD/NVMe settings, only the default HDD settings.
The compaction stats listed above were collected from a workload using the DFG's usual object sized histogram ... [sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB] Further tests were run using a smaller set of objects ... [sizes=h(1|2|25,2|4|40,4|8|25,8|256|10)KB] Those results are included here. RHCS 4.2.z2: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2-bb_20210803.log Compaction Statistics ceph-osd.0.log-20210803 Total OSD Log Duration (seconds) 172375.56 Number of Compaction Events 695 Avg Compaction Time (seconds) 1.4297389007194243 Total Compaction Time (seconds) 993.6685359999999 Avg Output Size: (MB) 317.3482826452461 Total Output Size: (MB) 220557.05643844604 Total Input Records 516063612 Total Output Records 438248858 Avg Output Throughput (MB/s) 216.54726637849893 Avg Input Records/second 597336.6266970418 Avg Output Records/second 490898.5575847847 Avg Output/Input Ratio 0.8390318513188124 RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50-bb_20210810.log Compaction Statistics ceph-osd.0.log-20210810 Total OSD Log Duration (seconds) 173069.541 Number of Compaction Events 1260 Avg Compaction Time (seconds) 1.300772765079365 Total Compaction Time (seconds) 1638.973684 Avg Output Size: (MB) 207.90438506489707 Total Output Size: (MB) 261959.52518177032 Total Input Records 767957338 Total Output Records 512552498 Avg Output Throughput (MB/s) 139.84079579789753 Avg Input Records/second 503980.3206531971 Avg Output Records/second 314987.9575686694 Avg Output/Input Ratio 0.7265201888226785
(In reply to Tim Wilkinson from comment #2) > The compaction stats listed above were collected from a workload using the > DFG's usual object sized histogram ... > > [sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB] > > > Further tests were run using a smaller set of objects ... > > [sizes=h(1|2|25,2|4|40,4|8|25,8|256|10)KB] > > > Those results are included here. > > > RHCS 4.2.z2: > http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2-bb_20210803.log > > Compaction Statistics ceph-osd.0.log-20210803 > Total OSD Log Duration (seconds) 172375.56 > Number of Compaction Events 695 > Avg Compaction Time (seconds) 1.4297389007194243 > Total Compaction Time (seconds) 993.6685359999999 > Avg Output Size: (MB) 317.3482826452461 > Total Output Size: (MB) 220557.05643844604 > Total Input Records 516063612 > Total Output Records 438248858 > Avg Output Throughput (MB/s) 216.54726637849893 > Avg Input Records/second 597336.6266970418 > Avg Output Records/second 490898.5575847847 > Avg Output/Input Ratio 0.8390318513188124 > > > > RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50-bb_20210810.log > > Compaction Statistics ceph-osd.0.log-20210810 > Total OSD Log Duration (seconds) 173069.541 > Number of Compaction Events 1260 > Avg Compaction Time (seconds) 1.300772765079365 > Total Compaction Time (seconds) 1638.973684 > Avg Output Size: (MB) 207.90438506489707 > Total Output Size: (MB) 261959.52518177032 > Total Input Records 767957338 > Total Output Records 512552498 > Avg Output Throughput (MB/s) 139.84079579789753 > Avg Input Records/second 503980.3206531971 > Avg Output Records/second 314987.9575686694 > Avg Output/Input Ratio 0.7265201888226785 Thanks, Tim. This is good this confirms that this is mostly with objects which are bigger than 64MB.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294