1210825 – RFE: OSD Suicide Timeout should be configurable

Bug 1210825 - RFE: OSD Suicide Timeout should be configurable

Summary: RFE: OSD Suicide Timeout should be configurable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.2.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	1.3.1
Assignee:	Samuel Just
QA Contact:	Tamil
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1319075
TreeView+	depends on / blocked

Reported:	2015-04-10 15:56 UTC by Tupper Cole
Modified:	2022-07-09 07:45 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ceph-0.94.2-3.el7cp
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-23 20:20:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	11439	None	None	None	Never
Red Hat Issue Tracker	RHCEPH-4692	None	None	None	2022-07-09 07:45:26 UTC
Red Hat Product Errata	RHSA-2015:2066	normal	SHIPPED_LIVE	Moderate: Red Hat Ceph Storage 1.3.1 security, bug fix, and enhancement update	2015-11-24 02:34:55 UTC
Red Hat Product Errata	RHSA-2015:2512	normal	SHIPPED_LIVE	Moderate: Red Hat Ceph Storage 1.3.1 security, bug fix, and enhancement update	2016-02-03 03:15:52 UTC

Description Tupper Cole 2015-04-10 15:56:56 UTC

Description of problem:
The OSD suicide timeout is currently coded at 10X the timeout value (default 15 seconds). This allows an OSD to run in a degraded state for extended periods, slowing traffic across the cluster. This value should be configurable.  

Version-Release number of selected component (if applicable):
1.2.3

How reproducible:
Very

Steps to Reproduce:
1.OSD gets in degraded state
2.Timeout results, ie:
/var/log/ceph/ceph-osd.152.log.9.gz:2015-04-01 18:20:19.411110 7fa162d83700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fa162d83700' had timed out after 15
3.Filesystem responds before suicide timeout, resulting in degraded performance.

Actual results:
In the particular- VM soft lockups, resulting in client outage.

Expected results:
Suicide timeout should be configurable to insure a poorly behaving OSD will die rather than slowing traffic. 

Additional info:

Comment 2 Samuel Just 2015-05-11 17:44:33 UTC

I've got a branch that does it as of friday.  I'm testing it now.  Should merge to master some time this week.  Then there will be a downstream thing at some point later.

Comment 3 Brad Hubbard 2015-05-11 17:56:24 UTC

Thanks Samuel, 

Can we get a devel_ack then?

Comment 4 Samuel Just 2015-05-11 18:03:23 UTC

Yes, yes you can! (Sorry!)

Comment 9 Samuel Just 2015-07-13 18:39:44 UTC

Hammer backport in progress https://github.com/ceph/ceph/pull/5159.

Comment 10 Samuel Just 2015-07-16 21:59:40 UTC

Backported to hammer (see above pull request for sha1s).

This can be tested by a function which in each of the relevant work queues and using gdb to cause it to hang.

Comment 13 shylesh 2015-10-28 12:20:23 UTC

Verification procedure
======================
Different suicide timeout parameters are

$> grep suicide  config_opts.h
 
OPTION(osd_op_thread_suicide_timeout, OPT_INT, 150)
OPTION(osd_recovery_thread_suicide_timeout, OPT_INT, 300)
OPTION(osd_remove_thread_suicide_timeout, OPT_INT, 10*60*60)
OPTION(osd_command_thread_suicide_timeout, OPT_INT, 15*60)
OPTION(filestore_op_thread_suicide_timeout, OPT_INT, 180)
 

Find out the functions which uses these parameters  
==================================================== 


  timeout param                   function                      Triggering op
================                ============                    ============  
osd_op_thread_suicide_timeout   OSD::ShardedOpWQ::_process   Read/write on object
osd_recovery_thread_suicide_timeout OSD::do_recovery          cluster recovery 
osd_remove_thread_suicide_timeout OSD::RemoveWQ::_process    Pool deletion
osd_command_thread_suicide_timeout OSD::do_command            ceph pg query
filestore_op_thread_suicide_timeout FileStore::_do_op        any I/O


Procedure
=========
1. Configure the timeout parameter in ceph.conf and put it across all the nodes
2. Restart all the processes so that they have newly configured value
3. Attach gdb to one of the OSD (make sure that debug symbol packages are installed)
4. Put breakpoints for the functions from above table based on timeout params
5. then do the operations that calls the respective function

Results:
=====
Case 1:
-----
I tried for timeout values 15 for each of the params

After the break point is hit wait for  20 seconds and delete the break point and continue :- In this case osd died with SIGABRT.

case 2:
=====
After the break point is hit wait for 10 s , delete the break points and continue :- In this case osd didn't die and continued the operations

  
Sam,

If you can confirm that these steps are valid then I will mark this bug as verified.

Comment 14 Samuel Just 2015-11-02 16:31:54 UTC

Looks good to me.

Comment 16 errata-xmlrpc 2015-11-23 20:20:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2512

Comment 17 Siddharth Sharma 2015-11-23 21:54:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2066

Note You need to log in before you can comment on or make changes to this bug.