1219974 – [RFE] filestore merge threadhold and split multiple defaults may not be ideal

Bug 1219974 - [RFE] filestore merge threadhold and split multiple defaults may not be ideal

Summary: [RFE] filestore merge threadhold and split multiple defaults may not be ideal

Keywords:
Status:	CLOSED DUPLICATE of bug 1337018
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.0
Assignee:	Josh Durgin
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1332874 1367448 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-05-08 21:01 UTC by tbrekke
Modified:	2020-06-11 12:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-18 21:01:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	17043	None	None	None	2016-08-17 18:03:34 UTC
Red Hat Bugzilla	1337018	urgent	CLOSED	[RFE] filestore: randomize split threshold	2021-03-11 14:34:24 UTC
Red Hat Bugzilla	1367448	medium	CLOSED	[RFE] Change default values for filestore split/merge	2021-02-22 00:41:40 UTC

Internal Links: 1337018 1367448

Description tbrekke 2015-05-08 21:01:45 UTC

Description of problem:

When large numbers of objects are written to a pool with a reasonable amount of placement groups (100 per osd / replica count) the amount of directories created in the placement group has a negative affect on performance. 


How reproducible:

Very

Steps to Reproduce:
1. Benchmark system with default values for filestore merge threashold and filestore split multiple.
2. After many objects are written watch performance degrade
3. Delete all data, and rerun test.

Actual results:

Performance degrades significantly in many cases.

Expected results:

Consistent performance

Additional info:

In many cases setting "filestore merge threadhold = 40" and "filestore split multiple = 8" resolved the performance degradation caused from the large number of directories in the placement groups.

Comment 2 Greg Farnum 2015-05-08 21:27:47 UTC

Did you run the tests long enough to establish if you were seeing long-term performance lower than the average, or only the cost of splitting? Was the improvement for 40*8 because of fewer folders, or because it didn't split during the test period?

I'm sure the values we have right now aren't optimal (and have suggested increasing them more than once), but changing them is going to be a tradeoff and depend on the particulars of the system. We'll probably need some benchmarking resources to validate any changes we make.

Comment 3 Mark Nelson 2015-05-11 20:05:25 UTC

We ran benchmarks back around dumpling:

http://nhm.ceph.com/4k-wip-6286-compare-splits.pdf

Basically under normal operation increasing the split and merge thresholds improves performance and helps smooth out degradation as the object count increases.  The big concern is probably what happens during recovery.  FWIW, I spoke to the XFS guys a while back and they thought we should be optimizing around thousands of files per directory rather than hundreds though.

The results above include swift-bench tests for default, aggressive, and relaxed directory splitting.  Please keep in mind that swift-bench is giving you a running view of the average, so sudden drops or increases in performance will only show up as a gradual change in the curve.  Here's the settings used:

default (320 objects per directory):

filestore merge threshold = 10
filestore split multiple = 2

Aggressive (80 objects per directory):

filestore merge threshold = 5
filestore split multiple = 1

Relaxed (1600 objects per directory):

filestore merge threshold = 20
filestore split multiple = 5

Comment 4 Kefu Chai 2015-06-29 10:30:34 UTC

> The big concern is probably what happens during recovery.

if we have a very relaxed directory splitting setting, we might end up with directories with too many object chunks in it, and it will hurt the performance when OSD is listing chunks for scrubbing? where the number of chunks are defined by osd_scrub_chunk_{min,max}.

is it your concern, when it comes to recovery?

Comment 5 Samuel Just 2015-06-30 14:55:35 UTC

That would be my main concern since you need to list all of the contents of a directory at once (max 320 with the default settings).

Comment 6 Federico Lucifredi 2015-07-13 23:21:58 UTC

Should we push this to 1.3.2, or is it done upstream?

Comment 7 Samuel Just 2015-07-14 18:14:56 UTC

We haven't really made a decision here, probably best to push it to 1.3.2.

Comment 8 Federico Lucifredi 2015-07-17 00:23:12 UTC

Pushing. See you next time, dear Bug :)

Comment 9 Warren Wang 2015-08-31 21:07:45 UTC

Chiming in as a user. We are hitting issues with this, due to the defaults being too small. Doesn't take much of a cluster before this is impacting. Probably mitigated it some on our side by reducing vfs_cache_pressure and overloading systems with memory.

Comment 10 Federico Lucifredi 2015-12-11 21:14:08 UTC

No staffing in performance to do it in this cycle...

Comment 12 Vikhyat Umrao 2016-08-17 17:59:26 UTC

*** Bug 1367448 has been marked as a duplicate of this bug. ***

Comment 14 Vikhyat Umrao 2016-09-26 12:49:47 UTC

*** Bug 1332874 has been marked as a duplicate of this bug. ***

Comment 15 Josh Durgin 2017-05-18 21:01:37 UTC

Increasing the defaults doesn't fix the problem, it just delays it. Closing in favor of randomizing the split thresholds.

*** This bug has been marked as a duplicate of bug 1337018 ***

Comment 16 dn-infra-peta-pers 2017-06-22 08:48:25 UTC

While randomizing the split/merge thresholds will avoid splitting at the same time on all OSDs, it won't help to reduce the complexity of the PG's trees and the performance degradation that goes with it.

Our recent experience got us to believe that XFS got mad with more than 200.000 files per PG directory trees (and 20 millions of inodes used) in this context of 320 files max per directory. Even after reducing this number by increasing the number of PGs on the pool, the OSDs did not recover to their initial level of performance and we had to rebuild all of them with 40/8 values. xfs_db showed no fragmentation but OSDs still had a huge apply_latency and the hardware was not in fault. The defaults should be raised.

Note You need to log in before you can comment on or make changes to this bug.