Description of problem: When large numbers of objects are written to a pool with a reasonable amount of placement groups (100 per osd / replica count) the amount of directories created in the placement group has a negative affect on performance. How reproducible: Very Steps to Reproduce: 1. Benchmark system with default values for filestore merge threashold and filestore split multiple. 2. After many objects are written watch performance degrade 3. Delete all data, and rerun test. Actual results: Performance degrades significantly in many cases. Expected results: Consistent performance Additional info: In many cases setting "filestore merge threadhold = 40" and "filestore split multiple = 8" resolved the performance degradation caused from the large number of directories in the placement groups.
Did you run the tests long enough to establish if you were seeing long-term performance lower than the average, or only the cost of splitting? Was the improvement for 40*8 because of fewer folders, or because it didn't split during the test period? I'm sure the values we have right now aren't optimal (and have suggested increasing them more than once), but changing them is going to be a tradeoff and depend on the particulars of the system. We'll probably need some benchmarking resources to validate any changes we make.
We ran benchmarks back around dumpling: http://nhm.ceph.com/4k-wip-6286-compare-splits.pdf Basically under normal operation increasing the split and merge thresholds improves performance and helps smooth out degradation as the object count increases. The big concern is probably what happens during recovery. FWIW, I spoke to the XFS guys a while back and they thought we should be optimizing around thousands of files per directory rather than hundreds though. The results above include swift-bench tests for default, aggressive, and relaxed directory splitting. Please keep in mind that swift-bench is giving you a running view of the average, so sudden drops or increases in performance will only show up as a gradual change in the curve. Here's the settings used: default (320 objects per directory): filestore merge threshold = 10 filestore split multiple = 2 Aggressive (80 objects per directory): filestore merge threshold = 5 filestore split multiple = 1 Relaxed (1600 objects per directory): filestore merge threshold = 20 filestore split multiple = 5
> The big concern is probably what happens during recovery. if we have a very relaxed directory splitting setting, we might end up with directories with too many object chunks in it, and it will hurt the performance when OSD is listing chunks for scrubbing? where the number of chunks are defined by osd_scrub_chunk_{min,max}. is it your concern, when it comes to recovery?
That would be my main concern since you need to list all of the contents of a directory at once (max 320 with the default settings).
Should we push this to 1.3.2, or is it done upstream?
We haven't really made a decision here, probably best to push it to 1.3.2.
Pushing. See you next time, dear Bug :)
Chiming in as a user. We are hitting issues with this, due to the defaults being too small. Doesn't take much of a cluster before this is impacting. Probably mitigated it some on our side by reducing vfs_cache_pressure and overloading systems with memory.
No staffing in performance to do it in this cycle...
*** Bug 1367448 has been marked as a duplicate of this bug. ***
*** Bug 1332874 has been marked as a duplicate of this bug. ***
Increasing the defaults doesn't fix the problem, it just delays it. Closing in favor of randomizing the split thresholds. *** This bug has been marked as a duplicate of bug 1337018 ***
While randomizing the split/merge thresholds will avoid splitting at the same time on all OSDs, it won't help to reduce the complexity of the PG's trees and the performance degradation that goes with it. Our recent experience got us to believe that XFS got mad with more than 200.000 files per PG directory trees (and 20 millions of inodes used) in this context of 320 files max per directory. Even after reducing this number by increasing the number of PGs on the pool, the OSDs did not recover to their initial level of performance and we had to rebuild all of them with 40/8 values. xfs_db showed no fragmentation but OSDs still had a huge apply_latency and the hardware was not in fault. The defaults should be raised.