Bug 1210825

Summary: RFE: OSD Suicide Timeout should be configurable
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tupper Cole <tcole>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED ERRATA QA Contact: Tamil <tmuthami>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.2.3CC: bhubbard, ceph-eng-bugs, dzafman, flucifre, hnallurv, icolle, jdang, kchai, kdreyer, nlevine, shmohan, sjust, vumrao
Target Milestone: rcKeywords: FutureFeature
Target Release: 1.3.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-0.94.2-3.el7cp Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-23 20:20:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1319075    

Description Tupper Cole 2015-04-10 15:56:56 UTC
Description of problem:
The OSD suicide timeout is currently coded at 10X the timeout value (default 15 seconds). This allows an OSD to run in a degraded state for extended periods, slowing traffic across the cluster. This value should be configurable.  

Version-Release number of selected component (if applicable):
1.2.3

How reproducible:
Very

Steps to Reproduce:
1.OSD gets in degraded state
2.Timeout results, ie:
/var/log/ceph/ceph-osd.152.log.9.gz:2015-04-01 18:20:19.411110 7fa162d83700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fa162d83700' had timed out after 15
3.Filesystem responds before suicide timeout, resulting in degraded performance.

Actual results:
In the particular- VM soft lockups, resulting in client outage.

Expected results:
Suicide timeout should be configurable to insure a poorly behaving OSD will die rather than slowing traffic. 

Additional info:

Comment 2 Samuel Just 2015-05-11 17:44:33 UTC
I've got a branch that does it as of friday.  I'm testing it now.  Should merge to master some time this week.  Then there will be a downstream thing at some point later.

Comment 3 Brad Hubbard 2015-05-11 17:56:24 UTC
Thanks Samuel, 

Can we get a devel_ack then?

Comment 4 Samuel Just 2015-05-11 18:03:23 UTC
Yes, yes you can! (Sorry!)

Comment 9 Samuel Just 2015-07-13 18:39:44 UTC
Hammer backport in progress https://github.com/ceph/ceph/pull/5159.

Comment 10 Samuel Just 2015-07-16 21:59:40 UTC
Backported to hammer (see above pull request for sha1s).

This can be tested by a function which in each of the relevant work queues and using gdb to cause it to hang.

Comment 13 shylesh 2015-10-28 12:20:23 UTC
Verification procedure
======================
Different suicide timeout parameters are

$> grep suicide  config_opts.h
 
OPTION(osd_op_thread_suicide_timeout, OPT_INT, 150)
OPTION(osd_recovery_thread_suicide_timeout, OPT_INT, 300)
OPTION(osd_remove_thread_suicide_timeout, OPT_INT, 10*60*60)
OPTION(osd_command_thread_suicide_timeout, OPT_INT, 15*60)
OPTION(filestore_op_thread_suicide_timeout, OPT_INT, 180)
 

Find out the functions which uses these parameters  
==================================================== 


  timeout param                   function                      Triggering op
================                ============                    ============  
osd_op_thread_suicide_timeout   OSD::ShardedOpWQ::_process   Read/write on object
osd_recovery_thread_suicide_timeout OSD::do_recovery          cluster recovery 
osd_remove_thread_suicide_timeout OSD::RemoveWQ::_process    Pool deletion
osd_command_thread_suicide_timeout OSD::do_command            ceph pg query
filestore_op_thread_suicide_timeout FileStore::_do_op        any I/O


Procedure
=========
1. Configure the timeout parameter in ceph.conf and put it across all the nodes
2. Restart all the processes so that they have newly configured value
3. Attach gdb to one of the OSD (make sure that debug symbol packages are installed)
4. Put breakpoints for the functions from above table based on timeout params
5. then do the operations that calls the respective function

Results:
=====
Case 1:
-----
I tried for timeout values 15 for each of the params

After the break point is hit wait for  20 seconds and delete the break point and continue :- In this case osd died with SIGABRT.

case 2:
=====
After the break point is hit wait for 10 s , delete the break points and continue :- In this case osd didn't die and continued the operations

  
Sam,

If you can confirm that these steps are valid then I will mark this bug as verified.

Comment 14 Samuel Just 2015-11-02 16:31:54 UTC
Looks good to me.

Comment 16 errata-xmlrpc 2015-11-23 20:20:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2512

Comment 17 Siddharth Sharma 2015-11-23 21:54:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2066