Bug 1332874
Summary: | Slow/blocked requests for a specific pool "rbd" which has approx 66 million objects | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
Component: | RADOS | Assignee: | Josh Durgin <jdurgin> |
Status: | CLOSED DUPLICATE | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 1.3.2 | CC: | ceph-eng-bugs, dzafman, jbiao, jdurgin, kchai, kdreyer, linuxkidd, sjust, vikumar |
Target Milestone: | rc | ||
Target Release: | 1.3.4 | ||
Hardware: | x86_64 | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-09-26 12:49:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vikhyat Umrao
2016-05-04 09:12:00 UTC
After some discussion, I have a theory - they may have just hit the split threshold on many osds at once, resulting in high latency as they were all splitting directories at once (an expensive operation). Increasing the threshold may have stopped the splitting temporarily, but they will run into the same issue once they reach the larger threshold of 9600 files/dir. Continuing to increase the threshold increases the cost of background processes like backfill, scrub, and pg splitting, though we don't have good data on how high the threshold can be before causing issues there. Can we get tree output from say 100 random pgs in the rbd pool to verify that they were near the former split threshold of 5120 files/dir? Added http://tracker.ceph.com/issues/15835 upstream as a possible way to mitigate this if this theory is correct. (In reply to Josh Durgin from comment #55) > After some discussion, I have a theory - they may have just hit the split > threshold on many osds at once, resulting in high latency as they were all > splitting directories at once (an expensive operation). Increasing the > threshold may have stopped the splitting temporarily, but they will run into > the same issue once they reach the larger threshold of 9600 files/dir. > > Continuing to increase the threshold increases the cost of background > processes like backfill, scrub, and pg splitting, though we don't have good > data on how high the threshold can be before causing issues there. > > to verify that they were near the former split threshold of 5120 files/dir? Customer has captured this tree output almost a week after the implementation of the new filestore settings. But it does not matter as we are interested "to verify that they were near the former split threshold of 5120 files/dir?" and with current output it looks like all were near to threshold and which mostly gives an approval to our theory. am I right ? Thanks, Vikhyat (In reply to Vikhyat Umrao from comment #59) > (In reply to Josh Durgin from comment #55) > > After some discussion, I have a theory - they may have just hit the split > > threshold on many osds at once, resulting in high latency as they were all > > splitting directories at once (an expensive operation). Increasing the > > threshold may have stopped the splitting temporarily, but they will run into > > the same issue once they reach the larger threshold of 9600 files/dir. > > > > Continuing to increase the threshold increases the cost of background > > processes like backfill, scrub, and pg splitting, though we don't have good > > data on how high the threshold can be before causing issues there. > > > > to verify that they were near the former split threshold of 5120 files/dir? > > Customer has captured this tree output almost a week after the > implementation of the new filestore settings. But it does not matter as we > are interested "to verify that they were near the former split threshold of > 5120 files/dir?" > > and with current output it looks like all were near to threshold and which > mostly gives an approval to our theory. am I right ? Yes, it looks like there's very little variation in number of files/pg, so they were very likely all just crossing the 5120 threshold when the slow requests started. *** This bug has been marked as a duplicate of bug 1219974 *** |