Bug 1210825
| Summary: | RFE: OSD Suicide Timeout should be configurable | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tupper Cole <tcole> |
| Component: | RADOS | Assignee: | Samuel Just <sjust> |
| Status: | CLOSED ERRATA | QA Contact: | Tamil <tmuthami> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 1.2.3 | CC: | bhubbard, ceph-eng-bugs, dzafman, flucifre, hnallurv, icolle, jdang, kchai, kdreyer, nlevine, shmohan, sjust, vumrao |
| Target Milestone: | rc | Keywords: | FutureFeature |
| Target Release: | 1.3.1 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-0.94.2-3.el7cp | Doc Type: | Enhancement |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-11-23 20:20:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1319075 | ||
|
Description
Tupper Cole
2015-04-10 15:56:56 UTC
I've got a branch that does it as of friday. I'm testing it now. Should merge to master some time this week. Then there will be a downstream thing at some point later. Thanks Samuel, Can we get a devel_ack then? Yes, yes you can! (Sorry!) Hammer backport in progress https://github.com/ceph/ceph/pull/5159. Backported to hammer (see above pull request for sha1s). This can be tested by a function which in each of the relevant work queues and using gdb to cause it to hang. Verification procedure ====================== Different suicide timeout parameters are $> grep suicide config_opts.h OPTION(osd_op_thread_suicide_timeout, OPT_INT, 150) OPTION(osd_recovery_thread_suicide_timeout, OPT_INT, 300) OPTION(osd_remove_thread_suicide_timeout, OPT_INT, 10*60*60) OPTION(osd_command_thread_suicide_timeout, OPT_INT, 15*60) OPTION(filestore_op_thread_suicide_timeout, OPT_INT, 180) Find out the functions which uses these parameters ==================================================== timeout param function Triggering op ================ ============ ============ osd_op_thread_suicide_timeout OSD::ShardedOpWQ::_process Read/write on object osd_recovery_thread_suicide_timeout OSD::do_recovery cluster recovery osd_remove_thread_suicide_timeout OSD::RemoveWQ::_process Pool deletion osd_command_thread_suicide_timeout OSD::do_command ceph pg query filestore_op_thread_suicide_timeout FileStore::_do_op any I/O Procedure ========= 1. Configure the timeout parameter in ceph.conf and put it across all the nodes 2. Restart all the processes so that they have newly configured value 3. Attach gdb to one of the OSD (make sure that debug symbol packages are installed) 4. Put breakpoints for the functions from above table based on timeout params 5. then do the operations that calls the respective function Results: ===== Case 1: ----- I tried for timeout values 15 for each of the params After the break point is hit wait for 20 seconds and delete the break point and continue :- In this case osd died with SIGABRT. case 2: ===== After the break point is hit wait for 10 s , delete the break points and continue :- In this case osd didn't die and continued the operations Sam, If you can confirm that these steps are valid then I will mark this bug as verified. Looks good to me. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2512 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2066 |