Bug 1332874

Summary:	Slow/blocked requests for a specific pool "rbd" which has approx 66 million objects
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vikhyat Umrao <vumrao>
Component:	RADOS	Assignee:	Josh Durgin <jdurgin>
Status:	CLOSED DUPLICATE	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	1.3.2	CC:	ceph-eng-bugs, dzafman, jbiao, jdurgin, kchai, kdreyer, linuxkidd, sjust, vikumar
Target Milestone:	rc
Target Release:	1.3.4
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-26 12:49:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2016-05-04 09:12:00 UTC

Description of problem:
Slow/blocked requests for a specific pool "rbd" which has approx 66 million objects 

NAME                                ID     USED       %USED     MAX AVAIL     OBJECTS
rbd                                    57       252T           7.90          345T              66220069 

- Customer is using *filestore merge threadhold = 40* and *filestore split multiple = 8*  from this bz : https://bugzilla.redhat.com/show_bug.cgi?id=1219974

- They have total 846 osds but "rbd" pool which is facing this issue is using ruleset 1 and it has 576 OSDs and all are 4TB.

- In a single OSD node they have 32 OSDs in which 24 OSDs are of 4 TB which belongs to this *rbd* pool which is facing the slow request.

- OSD nodes configuration 
- 125 GB RAM
- 24 Core CPU
- Networking Mode:  in OSD nodes :
 bond : IEEE 802.3ad Dynamic link aggregation : 2*10Gbps NICs = 20Gbps

- We have asked two OSDs which were facing slow request : 

osd.397.dump_ops_in_flight :

{
    "ops": [
        {
            "description": "osd_op(client.1862229.0:3035 benchmark_data_rcprsdc1r70-01-ac_2891081_object812 [delete] 57.37f66ef6 ack+ondisk+write+known_if_redirected e213940)",
            "initiated_at": "2016-05-03 16:45:27.170510",
            "age": 24121.264136,
            "duration": 0.000000,
            "type_data": [
                "no flag points reached",
                {
                    "client": "client.1862229",
                    "tid": 3035
                },
                [
                    {
                        "time": "2016-05-03 16:45:27.170510",
                        "event": "initiated"
                    }
                ]
            ]
        }
    ],
    "num_ops": 1
}

- Both were in  "no flag points reached" 


Version-Release number of selected component (if applicable):
Upstream Hammer : ceph-0.94.3-0.el7.x86_64

Comment 55 Josh Durgin 2016-05-11 01:02:47 UTC

After some discussion, I have a theory - they may have just hit the split threshold on many osds at once, resulting in high latency as they were all splitting directories at once (an expensive operation). Increasing the threshold may have stopped the splitting temporarily, but they will run into the same issue once they reach the larger threshold of 9600 files/dir.

Continuing to increase the threshold increases the cost of background processes like backfill, scrub, and pg splitting, though we don't have good data on how high the threshold can be before causing issues there.

Can we get tree output from say 100 random pgs in the rbd pool to verify that they were near the former split threshold of 5120 files/dir?

Comment 56 Josh Durgin 2016-05-11 01:07:58 UTC

Added http://tracker.ceph.com/issues/15835 upstream as a possible way to mitigate this if this theory is correct.

Comment 59 Vikhyat Umrao 2016-05-17 06:21:15 UTC

(In reply to Josh Durgin from comment #55)
> After some discussion, I have a theory - they may have just hit the split
> threshold on many osds at once, resulting in high latency as they were all
> splitting directories at once (an expensive operation). Increasing the
> threshold may have stopped the splitting temporarily, but they will run into
> the same issue once they reach the larger threshold of 9600 files/dir.
> 
> Continuing to increase the threshold increases the cost of background
> processes like backfill, scrub, and pg splitting, though we don't have good
> data on how high the threshold can be before causing issues there.
> 
> to verify that they were near the former split threshold of 5120 files/dir?

Customer has captured this tree output almost a week after the implementation of the new filestore settings. But it does not matter as we are interested "to verify that they were near the former split threshold of 5120 files/dir?" 

and with current output it looks like all were near to threshold and which mostly gives an approval to our theory. am I right ?

Thanks,
Vikhyat

Comment 60 Josh Durgin 2016-05-17 07:16:15 UTC

(In reply to Vikhyat Umrao from comment #59)
> (In reply to Josh Durgin from comment #55)
> > After some discussion, I have a theory - they may have just hit the split
> > threshold on many osds at once, resulting in high latency as they were all
> > splitting directories at once (an expensive operation). Increasing the
> > threshold may have stopped the splitting temporarily, but they will run into
> > the same issue once they reach the larger threshold of 9600 files/dir.
> > 
> > Continuing to increase the threshold increases the cost of background
> > processes like backfill, scrub, and pg splitting, though we don't have good
> > data on how high the threshold can be before causing issues there.
> > 
> > to verify that they were near the former split threshold of 5120 files/dir?
> 
> Customer has captured this tree output almost a week after the
> implementation of the new filestore settings. But it does not matter as we
> are interested "to verify that they were near the former split threshold of
> 5120 files/dir?" 
> 
> and with current output it looks like all were near to threshold and which
> mostly gives an approval to our theory. am I right ?

Yes, it looks like there's very little variation in number of files/pg, so they were very likely all just crossing the 5120 threshold when the slow requests started.

Comment 68 Vikhyat Umrao 2016-09-26 12:49:47 UTC


*** This bug has been marked as a duplicate of bug 1219974 ***