Bug 596398 - [RFE] NOT_RESPONDING_WANT_CORE =TRUE: follow SIGABRT with SIGKILL
Summary: [RFE] NOT_RESPONDING_WANT_CORE =TRUE: follow SIGABRT with SIGKILL
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
low
medium
Target Milestone: 1.3.2
: ---
Assignee: Matthew Farrellee
QA Contact: Luigi Toscano
URL:
Whiteboard:
: 609692 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-05-26 17:38 UTC by Matthew Farrellee
Modified: 2018-11-14 19:11 UTC (History)
2 users (show)

Fixed In Version: condor-7.4.5-0.1
Doc Type: Enhancement
Doc Text:
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE C: If the child does not exit in response to SIGABRT, the child may never exit. F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable). R: The hung child will eventually be terminated by the condor_master.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:16:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Matthew Farrellee 2010-05-26 17:38:14 UTC
Upstream at https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=746

During development or during deployment when trying to capture a rare event, it is useful to have the master send a SIGABRT instead of SIGKILL when a daemon stops responding. However, the daemon may be in a state where the SIGABRT does not cause the daemon to exit. In such a situation, the master should follow up the SIGABRT with a SIGKILL.

Comment 1 Matthew Farrellee 2010-07-01 12:21:13 UTC
Description of problem:

We've set NOT_RESPONDING_WANT_CORE=True based on wanting to collect condor core
dumps for support purposes:

http://www.cs.wisc.edu/condor/manual/v7.4/3_3Configuration.html#SECTION00435000000000000000

This parameter tells a parent process (e.g. condor_master) to send a kill -ABRT
instead of a kill -KILL to a hung child.

What I've discovered when conducting HA testing is that if you kill -STOP a
process, condor_master will recognize the process as hung and send it a kill
-ABRT exactly as documented.  However, a stopped process can't handle the ABRT,
so the hung process stays hung and remains so forever.

DaemonCore should escalate the set of signals sent to a child daemon - if a
child doesn't exit from an ABRT, there should be an associated timeout around
that and then it should be sent a KILL.

How reproducible:

100%

Steps to Reproduce:

Set NOT_RESPONDING_WANT_CORE=true.  Set NOT_RESPONDING_TIMEOUT to something
more sane than 1 hour for your testing purposes.  kill -STOP a process like
condor_negotiator.  Wait for the NOT_RESPONDING_TIMEOUT duration.  Monitor the
MasterLog to see condor_master say something like:

05/21 15:28:41 ERROR: Child pid 28198 appears hung! Killing it hard.

Note that the child process is *not* killed hard, and never will be.

Actual results:

Expected results:

I'd expect a tunable called something like NOT_RESPONDING_CORE_TIMEOUT, and if
NOT_RESPONDING_WANT_CORE=true a process is sent an ABRT and then after the
duration of NOT_RESPONDING_CORE_TIMEOUT a KILL would be sent.

Comment 2 Matthew Farrellee 2010-07-01 12:21:31 UTC
*** Bug 609692 has been marked as a duplicate of this bug. ***

Comment 3 Matthew Farrellee 2010-10-05 02:25:50 UTC
Resolved in 7.5.5 -

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1688

Comment 4 Matthew Farrellee 2010-11-18 17:59:42 UTC
GT1688 pulled into V7_4-BZ596398-sigabrt-escalation-backport-branch, to be merged for condor post 7.4.4-0.17

Comment 5 Matthew Farrellee 2010-11-18 21:51:11 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE
C: If the child does not exit in response to SIGABRT, the child may never exit.
F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable).
R: The hung child will eventually be terminated by the condor_master.

Comment 8 Luigi Toscano 2011-02-01 16:02:12 UTC
The behavior described in #c5 has been tested and verified on RHEL 4.9 beta (20110127) / RHEL 5.6, i386/x86_64.
condor-7.4.5-0.7

Comment 9 errata-xmlrpc 2011-02-15 12:16:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.