Description of problem: The OSD suicide timeout is currently coded at 10X the timeout value (default 15 seconds). This allows an OSD to run in a degraded state for extended periods, slowing traffic across the cluster. This value should be configurable. Version-Release number of selected component (if applicable): 1.2.3 How reproducible: Very Steps to Reproduce: 1.OSD gets in degraded state 2.Timeout results, ie: /var/log/ceph/ceph-osd.152.log.9.gz:2015-04-01 18:20:19.411110 7fa162d83700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fa162d83700' had timed out after 15 3.Filesystem responds before suicide timeout, resulting in degraded performance. Actual results: In the particular- VM soft lockups, resulting in client outage. Expected results: Suicide timeout should be configurable to insure a poorly behaving OSD will die rather than slowing traffic. Additional info:
I've got a branch that does it as of friday. I'm testing it now. Should merge to master some time this week. Then there will be a downstream thing at some point later.
Thanks Samuel, Can we get a devel_ack then?
Yes, yes you can! (Sorry!)
Hammer backport in progress https://github.com/ceph/ceph/pull/5159.
Backported to hammer (see above pull request for sha1s). This can be tested by a function which in each of the relevant work queues and using gdb to cause it to hang.
Verification procedure ====================== Different suicide timeout parameters are $> grep suicide config_opts.h OPTION(osd_op_thread_suicide_timeout, OPT_INT, 150) OPTION(osd_recovery_thread_suicide_timeout, OPT_INT, 300) OPTION(osd_remove_thread_suicide_timeout, OPT_INT, 10*60*60) OPTION(osd_command_thread_suicide_timeout, OPT_INT, 15*60) OPTION(filestore_op_thread_suicide_timeout, OPT_INT, 180) Find out the functions which uses these parameters ==================================================== timeout param function Triggering op ================ ============ ============ osd_op_thread_suicide_timeout OSD::ShardedOpWQ::_process Read/write on object osd_recovery_thread_suicide_timeout OSD::do_recovery cluster recovery osd_remove_thread_suicide_timeout OSD::RemoveWQ::_process Pool deletion osd_command_thread_suicide_timeout OSD::do_command ceph pg query filestore_op_thread_suicide_timeout FileStore::_do_op any I/O Procedure ========= 1. Configure the timeout parameter in ceph.conf and put it across all the nodes 2. Restart all the processes so that they have newly configured value 3. Attach gdb to one of the OSD (make sure that debug symbol packages are installed) 4. Put breakpoints for the functions from above table based on timeout params 5. then do the operations that calls the respective function Results: ===== Case 1: ----- I tried for timeout values 15 for each of the params After the break point is hit wait for 20 seconds and delete the break point and continue :- In this case osd died with SIGABRT. case 2: ===== After the break point is hit wait for 10 s , delete the break points and continue :- In this case osd didn't die and continued the operations Sam, If you can confirm that these steps are valid then I will mark this bug as verified.
Looks good to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2512
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2066