Bug 1991677

Summary:	[Workload-DFG] RHCS 5 has a very high number of OSD compaction events as compared to 4.x
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vikhyat Umrao <vumrao>
Component:	RADOS	Assignee:	Adam Kupczyk <akupczyk>
Status:	CLOSED ERRATA	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.0	CC:	akupczyk, bhubbard, ceph-eng-bugs, gsitlani, mmuench, mnelson, nojha, pdhiran, racpatel, rmandyam, rzarzyns, sseshasa, tserlin, twilkins, vereddy, vumrao
Target Milestone:	---
Target Release:	5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-16.2.0-116.el8cp	Doc Type:	Bug Fix
Doc Text:	.The trigger condition for RocksDB flush and compactions works as expected BlueStore organizes data into chunks called blobs, the size of which is 64K by default. For large writes, it is split into a sequence of 64K blob writes. Previously, when the deferred size was equal to or more than the blob size, all the data was deferred and they were placed under the “L” column family. A typical example is the case for HDD configuration where the value is 64K for both `bluestore_prefer_deferred_size_hdd` and `bluestore_max_blob_size_hdd` parameters. This consumed the “L” column faster resulting in the RocksDB flush count and the compactions becoming more frequent. The trigger condition for this scenario was `data size in blob` <= `minimum deferred size`. With this release, the deferred trigger condition checks the size of extents on disks and not blobs. Extents smaller than `deferred_size` go to a deferred mechanism and larger extents are written to the disk immediately. The trigger condition is changed to `data size in extent` < `minimum deferred size`. The small writes are placed under the “L” column and the growth of this column is slow with no extra compactions. The `bluestore_prefer_deferred_size` parameter controls the deferred without any interference from the blob size and works as per it’s description of “writes smaller than this size”.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-30 08:31:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1959686

Description Vikhyat Umrao 2021-08-09 16:49:55 UTC

Description of problem:
[Workload-DFG] RHCS 5 has a very high number of OSD compaction events as compared to 4.x

RHCS 4.2.z2 - compaction events - 169
RHCS 5.0 - compaction events - 5103

Version-Release number of selected component (if applicable):
RHCS 4.2z2 (14.2.11-181.el8cp) 
RHCS 5.0 stable (16.2.0-72.el8cp ) 

How reproducible:
Always

Tooling - compaction stats script - https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

Additional info:

RHCS 4.2.z2: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2_20210731.log

Compaction Statistics	ceph-osd.0.log-20210731
Total OSD Log Duration (seconds)	173400.204


Number of Compaction Events	169 <=========================


Avg Compaction Time (seconds)	1.293960958579882
Total Compaction Time (seconds)	218.67940200000004
Avg Output Size: (MB)	280.22014126975154
Total Output Size: (MB)	47357.20387458801
Total Input Records	102641789
Total Output Records	77039937
Avg Output Throughput (MB/s)	214.8391434610067
Avg Input Records/second	516788.7242412715
Avg Output Records/second	361864.6868155501
Avg Output/Input Ratio	0.707900981798514


RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50_20210731.log

Compaction Statistics	ceph-osd.0.log-20210731
Total OSD Log Duration (seconds)	176346.023

Number of Compaction Events	5103 <=================

Avg Compaction Time (seconds)	0.6740464569860873
Total Compaction Time (seconds)	3439.659070000004
Avg Output Size: (MB)	78.40701729910714
Total Output Size: (MB)	400111.00927734375
Total Input Records	2865990255
Total Output Records	2817877787
Avg Output Throughput (MB/s)	122.18567838544968
Avg Input Records/second	475908.4835702056
Avg Output Records/second	465314.5987771648
Avg Output/Input Ratio	0.5896704305992323

Comment 1 Vikhyat Umrao 2021-08-09 16:53:07 UTC

From email Mark feedback:

I also notice something goofy about those 4.2z2 results. There are
maybe 5 compaction events grouped together at ~6.1K seconds, again at
12.1K seconds, then again at 16.1K seconds, 20.2K seconds, etc. The
clusters are really tight, they only have a span of a couple of seconds
each, but then there's over an hour where compaction is idle until the
next small grouping. That would imply a really light write workload
with minimal data vs a much heavier one in the 5.0 case. Can you think
of anything that might have been different on that 4.2z2 OSD vs the 5.0 one?

Ok, the difference in osd log sizes is directly related to the
difference in rocksdb behavior between these two runs. I went back and
dug through the respective 4.2 and 5.0 OSD logs looking at memtable
flushing behavior given the big difference in compaction stats between
them. As you said Tim, the memtable sizes are correct in both cases
(256MB). In the 5.0 case though, we are filling up memtables extremely
fast, often flushing them when they are full, but also sometimes having
rocksdb's "write buffer manager" flushing early for some column families
when global limits are hit. In the 4.2 case there were 120 flush events
and in 5.0 there were 20110. 9135 of those appeared to happen due to
the "L" column family completely filling a memtable forcing a flush.
The others were due to the write buffer manager deciding it needed to
prematurely flush memtables before they were completely full (always for
other column families like onode or omap).

Example 5.0 flush associated with "L" column family:

2021-07-31T22:56:45.302+0000 7ff98ab62700 4 rocksdb: EVENT_LOG_v1
{"time_micros": 1627772205303825, "job": 23798, "event":
"flush_started", "num_memtables": 1, "num_entries": 535, "num_deletes":
0, "total_data_size": 254484297, "memory_usage": 260062264,
"flush_reason": "Write Buffer Full"}

~500 entries means roughly 512KB of data per transactions on average.

Example 5.0 flush associated with other column families:

2021-07-31T22:56:37.134+0000 7ff98ab62700 4 rocksdb: EVENT_LOG_v1
{"time_micros": 1627772197135248, "job": 23797, "event":
"flush_started", "num_memtables": 1, "num_entries": 5147, "num_deletes":
2540, "total_data_size": 1872446, "memory_usage": 1974888,
"flush_reason": "Write Buffer Manager"}

Example 4.2 flush (4.2 has one big column family):

2021-08-01 07:25:32.766 7fc7072d4700 4 rocksdb: EVENT_LOG_v1
{"time_micros": 1627802732767384, "job": 329, "event": "flush_started",
"num_memtables": 1, "num_entries": 779408, "num_deletes": 138087,
"total_data_size": 245589616, "memory_usage": 260049232, "flush_reason":
"Write Buffer Full"}

Much better behavior here. More along the lines of what we expect (and
also roughly what I see with master on our mako cluster with nvme drives
doing RBD tests).

The "L" column family is for deferred writes, so for some reason in 5.0
we seem to be seeing large (~512KB?) deferred writes. In pacific the
default bluestore_prefer_deferred_size_hdd value is 64_K, so unless I'm
missing something we shouldn't be seeing huge transactions hitting the L
column family by default like this. If you haven't changed that
setting, this might be a bug.

https://docs.google.com/spreadsheets/d/1mTnKvLh8NIxPukad0o0f8ao-T2-GORXWASTQBdxtDFM/edit?usp=sharing

I was able to reproduce this issue with upstream pacific on Mako using
NVMe drives. Increasing the bluestore_prefer_deferred size to the HDD
default (64K) dramatically increased rocksdb memtable flushing due to
excessive writes to the "L" deferred column family (and thus greatly
increased compaction overhead). Increasing the max_blob_size mitigates
the issue to some extent, though it's clear that the deferred IO path
(even at 4K) is introducing fairly significant memtable flush overhead.
It becomes extremely dramatic though when the deferred IO size gets
close to or exceeds the blob size.

Upstream we probably want to get Igor involved looking at these results
since this is pretty excessive. Again this doesn't occur with the
default SSD/NVMe settings, only the default HDD settings.

Comment 2 Tim Wilkinson 2021-08-11 12:51:28 UTC

The compaction stats listed above were collected from a workload using the DFG's usual object sized histogram ...

  [sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB]


Further tests were run using a smaller set of objects ...

  [sizes=h(1|2|25,2|4|40,4|8|25,8|256|10)KB]


Those results are included here.


RHCS 4.2.z2: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2-bb_20210803.log

Compaction Statistics	ceph-osd.0.log-20210803
Total OSD Log Duration (seconds)	172375.56
Number of Compaction Events	695
Avg Compaction Time (seconds)	1.4297389007194243
Total Compaction Time (seconds)	993.6685359999999
Avg Output Size: (MB)	317.3482826452461
Total Output Size: (MB)	220557.05643844604
Total Input Records	516063612
Total Output Records	438248858
Avg Output Throughput (MB/s)	216.54726637849893
Avg Input Records/second	597336.6266970418
Avg Output Records/second	490898.5575847847
Avg Output/Input Ratio	0.8390318513188124



RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50-bb_20210810.log

Compaction Statistics	ceph-osd.0.log-20210810
Total OSD Log Duration (seconds)	173069.541
Number of Compaction Events	1260
Avg Compaction Time (seconds)	1.300772765079365
Total Compaction Time (seconds)	1638.973684
Avg Output Size: (MB)	207.90438506489707
Total Output Size: (MB)	261959.52518177032
Total Input Records	767957338
Total Output Records	512552498
Avg Output Throughput (MB/s)	139.84079579789753
Avg Input Records/second	503980.3206531971
Avg Output Records/second	314987.9575686694
Avg Output/Input Ratio	0.7265201888226785

Comment 3 Vikhyat Umrao 2021-08-11 15:27:17 UTC

(In reply to Tim Wilkinson from comment #2)
> The compaction stats listed above were collected from a workload using the
> DFG's usual object sized histogram ...
> 
>   [sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB]
> 
> 
> Further tests were run using a smaller set of objects ...
> 
>   [sizes=h(1|2|25,2|4|40,4|8|25,8|256|10)KB]
> 
> 
> Those results are included here.
> 
> 
> RHCS 4.2.z2:
> http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_42z2-bb_20210803.log
> 
> Compaction Statistics	ceph-osd.0.log-20210803
> Total OSD Log Duration (seconds)	172375.56
> Number of Compaction Events	695
> Avg Compaction Time (seconds)	1.4297389007194243
> Total Compaction Time (seconds)	993.6685359999999
> Avg Output Size: (MB)	317.3482826452461
> Total Output Size: (MB)	220557.05643844604
> Total Input Records	516063612
> Total Output Records	438248858
> Avg Output Throughput (MB/s)	216.54726637849893
> Avg Input Records/second	597336.6266970418
> Avg Output Records/second	490898.5575847847
> Avg Output/Input Ratio	0.8390318513188124
> 
> 
> 
> RHCS 5: http://perf1.lab.bos.redhat.com/tim/tmp/rocksdb_50-bb_20210810.log
> 
> Compaction Statistics	ceph-osd.0.log-20210810
> Total OSD Log Duration (seconds)	173069.541
> Number of Compaction Events	1260
> Avg Compaction Time (seconds)	1.300772765079365
> Total Compaction Time (seconds)	1638.973684
> Avg Output Size: (MB)	207.90438506489707
> Total Output Size: (MB)	261959.52518177032
> Total Input Records	767957338
> Total Output Records	512552498
> Avg Output Throughput (MB/s)	139.84079579789753
> Avg Input Records/second	503980.3206531971
> Avg Output Records/second	314987.9575686694
> Avg Output/Input Ratio	0.7265201888226785

Thanks, Tim. This is good this confirms that this is mostly with objects which are bigger than 64MB.

Comment 25 errata-xmlrpc 2021-08-30 08:31:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294