Bug 610773

Summary: condor_master alive after SIGQUIT from service command with QMF plugins, RHEL4 only
Product: Red Hat Enterprise MRG Reporter: Tomas Rusnak <trusnak>
Component: condorAssignee: grid-maint-list <grid-maint-list>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: betaCC: astitcher, gsim, kgiusti, matt
Target Milestone: 1.3Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-16 14:00:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 534073, 578137, 596210    
Attachments:
Description Flags
condor configuration with qmf plugins
none
strace from condor_master
none
pstack from condor_master
none
strace of condor when a SIGQUIT is sent and hang results.
none
annotated strace on failing rhel4 machine (mrg8)
none
annotated strace of successful shutdown on rhel5 machine none

Description Tomas Rusnak 2010-07-02 12:16:59 UTC
Description of problem:
After "service condor restart" or "service condor stop" is called, the condor_master, condor_schedd and condor_procd still alive, when qmf plugins are configured. 

Version-Release number of selected component (if applicable):
RHEL4 - i386, x86_64

condor-7.4.4-0.4.el4
condor-debuginfo-7.4.4-0.4.el4
condor-qmf-7.4.4-0.4.el4
condor-test-7.4.4-0.4.el4
qmf-0.7.946106-6.el4
qmf-devel-0.7.946106-6.el4
classads-1.0.8-1.el4
gsoap-2.7.12-2.el4

How reproducible:
always


Steps to Reproduce:
1. configure condor with QMF plugins - sample config in attachment
2. start condor and qpidd
3. try to stop condor with "service condor stop" or "service condor restart"
  
Actual results:
some condor deamons still alive and never stops 

Expected results:
condor will stop as expected

Additional info:

Comment 1 Tomas Rusnak 2010-07-02 12:17:50 UTC
Created attachment 429045 [details]
condor configuration with qmf plugins

Comment 2 Tomas Rusnak 2010-07-02 12:18:28 UTC
Created attachment 429047 [details]
strace from condor_master

Comment 3 Tomas Rusnak 2010-07-02 12:20:14 UTC
Created attachment 429048 [details]
pstack from condor_master

Comment 4 Pete MacKinnon 2010-07-13 03:44:55 UTC
Re-assigning to Messaging. I can reproduce this on mrg8 (RHEL4U8) but it is stuck in epoll_wait. I tried reverting mrg1 git commit c3f1bef98288 using a local src build but no change in the outcome.

There have been reports of bugs with epoll_wait hangs in RHEL4 but everything looks up-to-date.

Comment 5 Gordon Sim 2010-07-13 09:19:10 UTC
This looks to me to be an issue with shutdown of the c++ agent/console API. The plugin is apparently deleting the agent as the mechanism to initiate shutdown. Either that is the wrong mechanism to use and there is something more explicit provided, or that mechanism is broken. Ted, any thoughts?

Comment 6 Ken Giusti 2010-07-13 14:20:44 UTC
Gordon - agreed, I've started looking at this problem.

Comment 7 Ken Giusti 2010-07-14 14:01:45 UTC
Update:  I can repro this on a RHEL 4 VM.  

Gordon is correct - the plugin should be deleting the agent singleton on shutdown in order to clean up all qmf-related threads.  That is the correct procedure.

Each pluging will issue a debug log message on shutdown when it deletes the agent singleton object.  This log message contains the text:

"<blah-blah>Plugin: shutting down..."

When condor shuts down properly, I see this log message appear in all qmf-related log files (Collector, Master, Schedd).   However, when I reproduce the hang, this log message never appears.

In addition, the QMF connection and publish threads are still active.  These threads run in a periodic loop that checks a "shutdown" boolean flag.  If shutdown is set, these threads will exit.   The shutdown flag is only set through the destructor of the agent singleton object.

Furthermore - if I dump the /proc/xxx/status file of the condor_master, I see a SIGQUIT shared signal pending (this was not set prior to attempting to stop condor):

ShdPnd: 0000000000000004
....


Could the SIGQUIT handling on RHEL 4 not abort all threads via exit()?  I appears as if the plugin shutdown code is not getting invoked.

Comment 8 Ken Giusti 2010-07-14 15:07:07 UTC
More "good" news:

The behavior of the bug is totally different when I repro this on mrg8.

Differences:

1) the "shutting down" message is seen in the logs, indicating that the agent singleton is being deleted properly

2) the pstack shows only the Poller thread as being present - there is no Connection or Publish qmf thread.

3) All "peer" processes - other than condor_master - are gone.  In my VM, all processes are present after failed shutdown

4) SigQuit is NOT in the pending mask of condor_master.

What appears to be happening here is condor_master is cleaning up the agent singleton correctly, which shuts down the connection thread.  This *should* cause the poller thread to exit, but for some reason it is not.

There may be two issues here - I'll try to solve the mrg8 issue first as that appears to be easier to reproduce.

Comment 9 Ken Giusti 2010-07-15 15:19:50 UTC
Created attachment 432110 [details]
strace of condor when a SIGQUIT is sent and hang results.

Comment 10 Ken Giusti 2010-07-15 18:18:40 UTC
Created attachment 432188 [details]
annotated strace on failing rhel4 machine (mrg8)

Comment 11 Ken Giusti 2010-07-15 18:19:28 UTC
Created attachment 432189 [details]
annotated strace of successful shutdown on rhel5 machine

Comment 12 Ken Giusti 2010-07-15 18:21:38 UTC
See the two annotated strace attachements: one shows the behavior of the failure on a rhel4 machine, the other is an strace of the same code on rhel5, which does not show the failure.

I'm -assuming- the futex activity is due to the pthread library.  If that is the case, I believe the failure strace shows that the problem is due to an infinite loop hit in the pthread library.

The failing machine has glibc-2.3.4-2.43 installed.

Comment 13 Ken Giusti 2010-07-15 19:24:01 UTC
There is a pthread issue with that version of the glibc library.

I have upgraded mrg8 to the latest glibc which contains a fix:

glibc-common-2.3.4-2.43.el4_8.3.x86_64.rpm
glibc-2.3.4-2.43.el4_8.3.x86_64.rpm

and the hang of condor_master on exit no longer occurs.   

I am going to try to reproduce the problem I have seen in Comment 7 using these latest libraries on mrg8.

Comment 14 Ken Giusti 2010-07-15 20:17:39 UTC
I've run "service condor start; sleep 30; service condor stop" for over an hour on mrg8 without issue now that the latest glibc libraries have been installed.


There are several pthread-related fixes that have gone into the glibc package above that could have caused this problem.   Upgrading to the latest glibc package via RHN should fix this issue.

Comment 15 Ken Giusti 2010-07-16 12:35:29 UTC
I'm reopening this bug, and reassigning it back to grid for analysis.

I have a clean RHEL4 x86_64 vm.   I have configured condor as described above.   I reboot my machine.   Condor comes up fine.

If I then try to stop condor, it fails:


[kgiusti@localhost ~]$ sudo /sbin/service condor stop
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail
[kgiusti@localhost ~]$ ps -efw | grep condor
condor    5721     1  0 08:09 ?        00:00:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor    5804  5721  0 08:09 ?        00:00:00 condor_collector -f
condor    6507  5721  0 08:09 ?        00:00:00 condor_negotiator -f
condor    6515  5721  0 08:09 ?        00:00:00 condor_schedd -f
condor    6519  5721  0 08:09 ?        00:00:04 condor_startd -f
root      6587  6515  0 08:09 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64


[kgiusti@localhost ~]$ sudo /sbin/service condor stop
Password:
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail
[kgiusti@localhost ~]$ ps -efw | grep condor
condor    5721     1  0 08:09 ?        00:00:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor    5804  5721  0 08:09 ?        00:00:00 condor_collector -f
condor    6507  5721  0 08:09 ?        00:00:00 condor_negotiator -f
condor    6515  5721  0 08:09 ?        00:00:00 condor_schedd -f
condor    6519  5721  0 08:09 ?        00:00:04 condor_startd -f
root      6587  6515  0 08:09 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
kgiusti   7177  6877  0 08:20 pts/1    00:00:00 grep condor


[kgiusti@localhost ~]$ sudo kill -QUIT 5721
[kgiusti@localhost ~]$ ps -efw | grep condor
condor    5721     1  0 08:09 ?        00:00:00 condor_master -pidfile /var/run/condor/condor_master.pid
condor    5804  5721  0 08:09 ?        00:00:00 condor_collector -f
condor    6507  5721  0 08:09 ?        00:00:00 condor_negotiator -f
condor    6515  5721  0 08:09 ?        00:00:00 condor_schedd -f
condor    6519  5721  0 08:09 ?        00:00:04 condor_startd -f
root      6587  6515  0 08:09 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -S 60 -C 64
kgiusti   7183  6877  0 08:20 pts/1    00:00:00 grep condor
[kgiusti@localhost ~]$


*HOWEVER* - I do *not* believe this is related to the qmf/mrg8 failure for the following reasons:

1) condor has not gotten to the point where it has attempted to shutdown QMF.  Note well that condor_master will shutdown all other condor processes first, *then* shutdown the QMF agent.   All other condor processes are active.

2) MOST IMPORTANT: there is -no- activity in the condor master log file indicating the SIGQUIT was seen!  In a normal shutdown, the logs would indicate the receipt of the SIGQUIT - I'm seeing nothing:


[kgiusti@localhost condor]$ sudo tail -f MasterLog
Password:
07/16 08:09:44 Received UDP command 60008 (DC_CHILDALIVE) from  <127.0.0.1:41916>, access level DAEMON
07/16 08:09:49 Received UDP command 60008 (DC_CHILDALIVE) from  <127.0.0.1:41916>, access level DAEMON
07/16 08:10:39 ProcAPI::buildFamily() Found daddypid on the system: 5804
07/16 08:10:41 ProcAPI::buildFamily() Found daddypid on the system: 6507
07/16 08:10:41 ProcAPI::buildFamily() Found daddypid on the system: 6515
07/16 08:10:41 ProcAPI::buildFamily() Found daddypid on the system: 6519
07/16 08:11:39 ProcAPI::buildFamily() Found daddypid on the system: 5804
07/16 08:11:41 ProcAPI::buildFamily() Found daddypid on the system: 6507
07/16 08:11:41 ProcAPI::buildFamily() Found daddypid on the system: 6515
07/16 08:11:41 ProcAPI::buildFamily() Found daddypid on the system: 6519

3) There is a pending SIGQUIT in the /proc/xxx/status mask for condor_master.  The status info does not suggest SIGQUIT is blocked, so I don't know why it is stuck pending.

I'm using condor*-7.4.4-0.4.el4 packages.

Comment 16 Ken Giusti 2010-07-16 12:46:17 UTC
quick update:

I have to kill -9 condor to shut it down.

Once I kill condor, I can "service condor start/stop" to my hearts content without an problem.

I've also restarted my VM, and could not repro this problem.  Cannot repo 100% of the time.

FYI: here is the condor master log output when SIGQUIT is correctly received:

07/16 08:36:33 Time stamp of running /usr/sbin/condor_startd: 1277904555
07/16 08:36:33 GetTimeStamp returned: 1277904555
07/16 08:36:33 exit Daemons::CheckForNewExecutable


07/16 08:36:57 Got SIGQUIT.  Performing fast shutdown.
07/16 08:36:57 NumberOfChildren() returning 4
07/16 08:36:57 Send_Signal(): Doing kill(7313,3) [SIGQUIT]
07/16 08:36:57 Sent SIGQUIT to COLLECTOR (pid 7313)
07/16 08:36:57 Send_Signal(): Doing kill(7320,3) [SIGQUIT]
07/16 08:36:57 Sent SIGQUIT to NEGOTIATOR (pid 7320)
07/16 08:36:57 Send_Signal(): Doing kill(7321,3) [SIGQUIT]
07/16 08:36:57 Sent SIGQUIT to SCHEDD (pid 7321)
07/16 08:36:57 Send_Signal(): Doing kill(7322,3) [SIGQUIT]
07/16 08:36:57 Sent SIGQUIT to STARTD (pid 7322)
07/16 08:36:57 DaemonCore: No more children processes to reap.
07/16 08:36:57 The SCHEDD (pid 7321) exited with status 0
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::buildFamily failed: parent 7321 not found on system.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7321 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7321 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7321 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7321 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7321 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7323 does not exist.
07/16 08:36:57 NumberOfChildren() returning 3
07/16 08:36:57 DaemonCore: No more children processes to reap.
07/16 08:36:57 The COLLECTOR (pid 7313) exited with status 0
07/16 08:36:57 ProcAPI::buildFamily failed: parent 7313 not found on system.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7313 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7313 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7313 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7313 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7313 does not exist.
07/16 08:36:57 NumberOfChildren() returning 2
07/16 08:36:57 The STARTD (pid 7322) exited with status 0
07/16 08:36:57 ProcAPI::buildFamily failed: parent 7322 not found on system.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7322 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7322 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7322 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7322 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7322 does not exist.
07/16 08:36:57 NumberOfChildren() returning 1
07/16 08:36:57 DaemonCore: No more children processes to reap.
07/16 08:36:57 The NEGOTIATOR (pid 7320) exited with status 0
07/16 08:36:57 ProcAPI::buildFamily failed: parent 7320 not found on system.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7320 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7320 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7320 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7320 does not exist.
07/16 08:36:57 ProcAPI::getProcInfo() pid 7320 does not exist.
07/16 08:36:57 NumberOfChildren() returning 0
07/16 08:36:57 All daemons are gone.  Exiting.
07/16 08:36:57 MasterPluginManager::Shutdown
07/16 08:36:57 MgmtMasterPlugin: shutting down...
07/16 08:36:58 **** condor_master (condor_MASTER) pid 7308 EXITING WITH STATUS 0

Comment 17 Ken Giusti 2010-07-16 14:00:34 UTC
Closed per request from grid team - opened a new bug dedicated to shutdown problem that occurs using the latest glibc.

See https://bugzilla.redhat.com/show_bug.cgi?id=615321