Upstream at https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=746
During development or during deployment when trying to capture a rare event, it is useful to have the master send a SIGABRT instead of SIGKILL when a daemon stops responding. However, the daemon may be in a state where the SIGABRT does not cause the daemon to exit. In such a situation, the master should follow up the SIGABRT with a SIGKILL.
Description of problem:
We've set NOT_RESPONDING_WANT_CORE=True based on wanting to collect condor core
dumps for support purposes:
This parameter tells a parent process (e.g. condor_master) to send a kill -ABRT
instead of a kill -KILL to a hung child.
What I've discovered when conducting HA testing is that if you kill -STOP a
process, condor_master will recognize the process as hung and send it a kill
-ABRT exactly as documented. However, a stopped process can't handle the ABRT,
so the hung process stays hung and remains so forever.
DaemonCore should escalate the set of signals sent to a child daemon - if a
child doesn't exit from an ABRT, there should be an associated timeout around
that and then it should be sent a KILL.
Steps to Reproduce:
Set NOT_RESPONDING_WANT_CORE=true. Set NOT_RESPONDING_TIMEOUT to something
more sane than 1 hour for your testing purposes. kill -STOP a process like
condor_negotiator. Wait for the NOT_RESPONDING_TIMEOUT duration. Monitor the
MasterLog to see condor_master say something like:
05/21 15:28:41 ERROR: Child pid 28198 appears hung! Killing it hard.
Note that the child process is *not* killed hard, and never will be.
I'd expect a tunable called something like NOT_RESPONDING_CORE_TIMEOUT, and if
NOT_RESPONDING_WANT_CORE=true a process is sent an ABRT and then after the
duration of NOT_RESPONDING_CORE_TIMEOUT a KILL would be sent.
*** Bug 609692 has been marked as a duplicate of this bug. ***
Resolved in 7.5.5 -
GT1688 pulled into V7_4-BZ596398-sigabrt-escalation-backport-branch, to be merged for condor post 7.4.4-0.17
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE
C: If the child does not exit in response to SIGABRT, the child may never exit.
F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable).
R: The hung child will eventually be terminated by the condor_master.
The behavior described in #c5 has been tested and verified on RHEL 4.9 beta (20110127) / RHEL 5.6, i386/x86_64.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.