Bug 596398

Summary: [RFE] NOT_RESPONDING_WANT_CORE =TRUE: follow SIGABRT with SIGKILL
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: medium Docs Contact:
Priority: low    
Version: 1.2CC: ltoscano, tao
Target Milestone: 1.3.2Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.4.5-0.1 Doc Type: Enhancement
Doc Text:
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE C: If the child does not exit in response to SIGABRT, the child may never exit. F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable). R: The hung child will eventually be terminated by the condor_master.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-15 12:16:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2010-05-26 17:38:14 UTC
Upstream at https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=746

During development or during deployment when trying to capture a rare event, it is useful to have the master send a SIGABRT instead of SIGKILL when a daemon stops responding. However, the daemon may be in a state where the SIGABRT does not cause the daemon to exit. In such a situation, the master should follow up the SIGABRT with a SIGKILL.

Comment 1 Matthew Farrellee 2010-07-01 12:21:13 UTC
Description of problem:

We've set NOT_RESPONDING_WANT_CORE=True based on wanting to collect condor core
dumps for support purposes:

http://www.cs.wisc.edu/condor/manual/v7.4/3_3Configuration.html#SECTION00435000000000000000

This parameter tells a parent process (e.g. condor_master) to send a kill -ABRT
instead of a kill -KILL to a hung child.

What I've discovered when conducting HA testing is that if you kill -STOP a
process, condor_master will recognize the process as hung and send it a kill
-ABRT exactly as documented.  However, a stopped process can't handle the ABRT,
so the hung process stays hung and remains so forever.

DaemonCore should escalate the set of signals sent to a child daemon - if a
child doesn't exit from an ABRT, there should be an associated timeout around
that and then it should be sent a KILL.

How reproducible:

100%

Steps to Reproduce:

Set NOT_RESPONDING_WANT_CORE=true.  Set NOT_RESPONDING_TIMEOUT to something
more sane than 1 hour for your testing purposes.  kill -STOP a process like
condor_negotiator.  Wait for the NOT_RESPONDING_TIMEOUT duration.  Monitor the
MasterLog to see condor_master say something like:

05/21 15:28:41 ERROR: Child pid 28198 appears hung! Killing it hard.

Note that the child process is *not* killed hard, and never will be.

Actual results:

Expected results:

I'd expect a tunable called something like NOT_RESPONDING_CORE_TIMEOUT, and if
NOT_RESPONDING_WANT_CORE=true a process is sent an ABRT and then after the
duration of NOT_RESPONDING_CORE_TIMEOUT a KILL would be sent.

Comment 2 Matthew Farrellee 2010-07-01 12:21:31 UTC
*** Bug 609692 has been marked as a duplicate of this bug. ***

Comment 3 Matthew Farrellee 2010-10-05 02:25:50 UTC
Resolved in 7.5.5 -

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1688

Comment 4 Matthew Farrellee 2010-11-18 17:59:42 UTC
GT1688 pulled into V7_4-BZ596398-sigabrt-escalation-backport-branch, to be merged for condor post 7.4.4-0.17

Comment 5 Matthew Farrellee 2010-11-18 21:51:11 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE
C: If the child does not exit in response to SIGABRT, the child may never exit.
F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable).
R: The hung child will eventually be terminated by the condor_master.

Comment 8 Luigi Toscano 2011-02-01 16:02:12 UTC
The behavior described in #c5 has been tested and verified on RHEL 4.9 beta (20110127) / RHEL 5.6, i386/x86_64.
condor-7.4.5-0.7

Comment 9 errata-xmlrpc 2011-02-15 12:16:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html