Bug 596398
Summary: | [RFE] NOT_RESPONDING_WANT_CORE =TRUE: follow SIGABRT with SIGKILL | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> |
Component: | condor | Assignee: | Matthew Farrellee <matt> |
Status: | CLOSED ERRATA | QA Contact: | Luigi Toscano <ltoscano> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 1.2 | CC: | ltoscano, tao |
Target Milestone: | 1.3.2 | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | condor-7.4.5-0.1 | Doc Type: | Enhancement |
Doc Text: |
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE
C: If the child does not exit in response to SIGABRT, the child may never exit.
F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable).
R: The hung child will eventually be terminated by the condor_master.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-02-15 12:16:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Matthew Farrellee
2010-05-26 17:38:14 UTC
Description of problem: We've set NOT_RESPONDING_WANT_CORE=True based on wanting to collect condor core dumps for support purposes: http://www.cs.wisc.edu/condor/manual/v7.4/3_3Configuration.html#SECTION00435000000000000000 This parameter tells a parent process (e.g. condor_master) to send a kill -ABRT instead of a kill -KILL to a hung child. What I've discovered when conducting HA testing is that if you kill -STOP a process, condor_master will recognize the process as hung and send it a kill -ABRT exactly as documented. However, a stopped process can't handle the ABRT, so the hung process stays hung and remains so forever. DaemonCore should escalate the set of signals sent to a child daemon - if a child doesn't exit from an ABRT, there should be an associated timeout around that and then it should be sent a KILL. How reproducible: 100% Steps to Reproduce: Set NOT_RESPONDING_WANT_CORE=true. Set NOT_RESPONDING_TIMEOUT to something more sane than 1 hour for your testing purposes. kill -STOP a process like condor_negotiator. Wait for the NOT_RESPONDING_TIMEOUT duration. Monitor the MasterLog to see condor_master say something like: 05/21 15:28:41 ERROR: Child pid 28198 appears hung! Killing it hard. Note that the child process is *not* killed hard, and never will be. Actual results: Expected results: I'd expect a tunable called something like NOT_RESPONDING_CORE_TIMEOUT, and if NOT_RESPONDING_WANT_CORE=true a process is sent an ABRT and then after the duration of NOT_RESPONDING_CORE_TIMEOUT a KILL would be sent. *** Bug 609692 has been marked as a duplicate of this bug. *** Resolved in 7.5.5 - https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1688 GT1688 pulled into V7_4-BZ596398-sigabrt-escalation-backport-branch, to be merged for condor post 7.4.4-0.17 Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE C: If the child does not exit in response to SIGABRT, the child may never exit. F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable). R: The hung child will eventually be terminated by the condor_master. The behavior described in #c5 has been tested and verified on RHEL 4.9 beta (20110127) / RHEL 5.6, i386/x86_64. condor-7.4.5-0.7 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html |