Bug 707498 - condor_triggerd ignore SIGTERM and SIGQUIT signals on RHEL6/x86_64
Summary: condor_triggerd ignore SIGTERM and SIGQUIT signals on RHEL6/x86_64
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0
: ---
Assignee: Robert Rati
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-25 09:12 UTC by Tomas Rusnak
Modified: 2011-06-27 15:33 UTC (History)
1 user (show)

Fixed In Version: qpid-qmf-0.10-7
Doc Type: Bug Fix
Doc Text:
N/A
Clone Of:
Environment:
Last Closed: 2011-06-27 15:33:31 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Tomas Rusnak 2011-05-25 09:12:46 UTC
Description of problem:
The condor_triggerd daemon ignore SIGTERM and SIGQUIT signal.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. configure system to use triggerd
2. try to service condor restart, service condor stop, or simple send SIGTERM or SIGQUIT signals to triggerd daemon
3. tail -f /var/log/condor/MasterLog
4. after long time (depends on SHUTDOWN_FAST_TIMEOUT), the master killed triggerd with signal 9
  
Actual results:
daemon ignore signals

Expected results:
daemon respond on signals properly

Additional info:

Config:

QMF_BROKER_HOST=localhost
ALL_DEBUG=D_FULLDEBUG
CONFIGD_ARGS = -d

ALLOW_WRITE = *
ALLOW_READ = *
ALLOW_NEGOTIATOR = *
ALLOW_ADMINISTRATOR_READ = *

QMF_PUBLISH_SUBMISSIONS = False
HISTORY = $(SPOOL)/history
JOB_SERVER = $(SBIN)/condor_job_server
JOB_SERVER_ARGS = -f
JOB_SERVER.JOB_SERVER_LOG = $(LOG)/JobServerLog
JOB_SERVER.JOB_SERVER_ADDRESS_FILE = $(LOG)/.job_server_address
JOB_SERVER.SCHEDD_NAME = schedd

STARTD_CRON_NAME = TRIGGER_DATA
STARTD_CRON_AUTOPUBLISH = If_Changed
TRIGGER_DATA_JOBLIST = GetData
TRIGGER_DATA_GETDATA_PREFIX = Triggerd
TRIGGER_DATA_GETDATA_EXECUTABLE = $(BIN)/get_trigger_data
TRIGGER_DATA_GETDATA_PERIOD = 5m
TRIGGER_DATA_GETDATA_RECONFIG = FALSE

DAEMON_LIST = $(DAEMON_LIST), JOB_SERVER, TRIGGERD
ENABLE_ABSENT_NODES_DETECTION=True
DC_DAEMON_LIST = $(DAEMON_LIST), JOB_SERVER, TRIGGERD

QMF_BROKER_AUTH_MECH = ANONYMOUS

MasterLog:
05/25/11 11:48:46 Timeout for fast shutdown has expired for TRIGGERD.
05/25/11 11:48:46 ProcAPI::buildFamily() Found daddypid on the system: 2436
05/25/11 11:48:46 Sent SIGKILL to TRIGGERD (pid 2436) and all its children.
05/25/11 11:48:46 DaemonCore: No more children processes to reap.
05/25/11 11:48:46 The TRIGGERD (pid 2436) died due to signal 9 (Killed)
05/25/11 11:48:46 ProcAPI::buildFamily failed: parent 2436 not found on system.
05/25/11 11:48:46 ProcAPI::getProcInfo() pid 2436 does not exist.
05/25/11 11:48:46 ProcAPI::getProcInfo() pid 2436 does not exist.
05/25/11 11:48:46 ProcAPI::getProcInfo() pid 2436 does not exist.
05/25/11 11:48:46 ProcAPI::getProcInfo() pid 2436 does not exist.
05/25/11 11:48:46 ProcAPI::getProcInfo() pid 2436 does not exist.
05/25/11 11:48:46 NumberOfChildren() returning 0
05/25/11 11:48:46 All daemons are gone.  Exiting.
05/25/11 11:48:46 MgmtMasterPlugin: shutting down...
05/25/11 11:48:47 **** condor_master (condor_MASTER) pid 2424 EXITING WITH STATUS 0

Comment 1 Tomas Rusnak 2011-05-25 12:52:43 UTC
Try with repaired config gives me same results. The triggerd completely ignore such signals. The TriggerLog stay untouched.

Parameter changed to:
DC_DAEMON_LIST =+ TRIGGERD, JOB_SERVER

Comment 3 Robert Rati 2011-05-25 17:07:37 UTC
What packages on rhel6 were you using when you found the issue?  Does it still if you run with the latest qpid packages for rhel6?

Comment 4 Tomas Rusnak 2011-05-26 09:42:14 UTC
Retested over current packages:

# rpm -qa | egrep '(qpid|qmf|condor)' | sort
condor-7.6.1-0.6.el6.x86_64
condor-aviary-7.6.1-0.6.el6.x86_64
condor-classads-7.6.1-0.6.el6.x86_64
condor-debuginfo-7.6.1-0.6.el6.x86_64
condor-kbdd-7.6.1-0.6.el6.x86_64
condor-qmf-7.6.1-0.6.el6.x86_64
condor-vm-gahp-7.6.1-0.6.el6.x86_64
condor-wallaby-base-db-1.12-1.el6.noarch
condor-wallaby-client-4.0-6.el6.noarch
condor-wallaby-tools-4.0-6.el6.noarch
python-condorutils-1.5-3.el6.noarch
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-7.el6.x86_64
qpid-cpp-client-0.10-5.el6.x86_64
qpid-cpp-client-devel-0.10-5.el6.x86_64
qpid-cpp-client-devel-docs-0.10-5.el6.noarch
qpid-cpp-client-rdma-0.10-5.el6.x86_64
qpid-cpp-client-ssl-0.10-5.el6.x86_64
qpid-cpp-server-0.10-5.el6.x86_64
qpid-cpp-server-cluster-0.10-5.el6.x86_64
qpid-cpp-server-devel-0.10-5.el6.x86_64
qpid-cpp-server-rdma-0.10-5.el6.x86_64
qpid-cpp-server-ssl-0.10-5.el6.x86_64
qpid-cpp-server-store-0.10-5.el6.x86_64
qpid-cpp-server-xml-0.10-5.el6.x86_64
qpid-java-client-0.10-6.el6.noarch
qpid-java-common-0.10-6.el6.noarch
qpid-java-example-0.10-6.el6.noarch
qpid-java-jca-0.10-6.el6.noarch
qpid-qmf-0.10-7.el6.x86_64
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-4.el6.noarch
rh-qpid-cpp-tests-0.10-5.el6.x86_64
ruby-qpid-0.7.946106-2.el6.x86_64
ruby-qpid-qmf-0.10-7.el6.x86_64

MasterLog:
05/26/11 12:39:02 Got SIGQUIT.  Performing fast shutdown.
05/26/11 12:39:02 Trying to update collector <127.0.0.1:9618>
05/26/11 12:39:02 Attempting to send update via UDP to collector localhost.localdomain <127.0.0.1:9618>
05/26/11 12:39:02 NumberOfChildren() returning 6
05/26/11 12:39:02 Send_Signal(): Doing kill(8073,3) [SIGQUIT]
05/26/11 12:39:02 Sent SIGQUIT to COLLECTOR (pid 8073)
05/26/11 12:39:02 Send_Signal(): Doing kill(8083,3) [SIGQUIT]
05/26/11 12:39:02 Sent SIGQUIT to CONFIGD (pid 8083)
05/26/11 12:39:02 Send_Signal(): Doing kill(8079,3) [SIGQUIT]
05/26/11 12:39:02 Sent SIGQUIT to JOB_SERVER (pid 8079)
05/26/11 12:39:02 Send_Signal(): Doing kill(8080,3) [SIGQUIT]
05/26/11 12:39:02 Sent SIGQUIT to NEGOTIATOR (pid 8080)
05/26/11 12:39:02 Send_Signal(): Doing kill(8081,3) [SIGQUIT]
05/26/11 12:39:02 Sent SIGQUIT to STARTD (pid 8081)
05/26/11 12:39:02 Send_Signal(): Doing kill(8082,3) [SIGQUIT]
05/26/11 12:39:02 Sent SIGQUIT to TRIGGERD (pid 8082)
05/26/11 12:39:02 DaemonCore: No more children processes to reap.
05/26/11 12:39:02 The NEGOTIATOR (pid 8080) exited with status 0
05/26/11 12:39:02 ProcAPI::buildFamily failed: parent 8080 not found on system.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8080 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8080 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8080 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8080 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8080 does not exist.
05/26/11 12:39:02 NumberOfChildren() returning 5
05/26/11 12:39:02 DaemonCore: No more children processes to reap.
05/26/11 12:39:02 The CONFIGD (pid 8083) exited with status 0
05/26/11 12:39:02 ProcAPI::buildFamily failed: parent 8083 not found on system.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8083 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8083 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8083 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8083 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8083 does not exist.
05/26/11 12:39:02 NumberOfChildren() returning 4
05/26/11 12:39:02 DaemonCore: No more children processes to reap.
05/26/11 12:39:02 The TRIGGERD (pid 8082) exited with status 0
05/26/11 12:39:02 ProcAPI::buildFamily failed: parent 8082 not found on system.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8082 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8082 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8082 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8082 does not exist.
05/26/11 12:39:02 ProcAPI::getProcInfo() pid 8082 does not exist.
05/26/11 12:39:02 NumberOfChildren() returning 3
05/26/11 12:39:03 DaemonCore: No more children processes to reap.

No such error found in latest qpid packages.

>>> VERIFIED

Comment 5 Matthew Farrellee 2011-06-13 16:00:14 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
N/A


Note You need to log in before you can comment on or make changes to this bug.