Bug 625450

Summary:	condor_master stay alive with QMF plugins while qpidd was stopped before (RHEL4)
Product:	Red Hat Enterprise MRG	Reporter:	Tomas Rusnak <trusnak>
Component:	qpid-qmf	Assignee:	Matthew Farrellee <matt>
Status:	CLOSED ERRATA	QA Contact:	Tomas Rusnak <trusnak>
Severity:	high	Docs Contact:
Priority:	high
Version:	beta	CC:	freznice, gsim, matt, pmackinn, tross
Target Milestone:	1.3
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-10-20 11:30:36 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	534073, 596210

Description Tomas Rusnak 2010-08-19 13:38:46 UTC

Description of problem:

When you configure condor with qmf plugins, condor stay alive after service condor stop was called and qpidd was stopped before.

Version-Release number of selected component (if applicable):

condor-7.4.4-0.9.el4
condor-debuginfo-7.4.4-0.9.el4
condor-kbdd-7.4.4-0.9.el4
condor-qmf-7.4.4-0.9.el4
condor-test-7.4.4-0.9.el4
python-qmf-0.7.946106-8.el4
python-qpid-0.7.946106-11.el4
qmf-0.7.946106-11.el4
qmf-devel-0.7.946106-11.el4
qpid-cpp-client-0.7.946106-11.el4
qpid-cpp-client-devel-0.7.946106-11.el4
qpid-cpp-client-devel-docs-0.7.946106-11.el4
qpid-cpp-client-ssl-0.7.946106-11.el4
qpid-cpp-mrg-debuginfo-0.7.946106-11.el4
qpid-cpp-server-0.7.946106-11.el4
qpid-cpp-server-devel-0.7.946106-11.el4
qpid-cpp-server-ssl-0.7.946106-11.el4
qpid-cpp-server-store-0.7.946106-11.el4
qpid-cpp-server-xml-0.7.946106-11.el4
qpid-java-client-0.7.946106-7.el4
qpid-java-common-0.7.946106-7.el4
qpid-tests-0.7.946106-1.el4
qpid-tools-0.7.946106-8.el4
rhm-docs-0.7.946106-4.el4


How reproducible:
always

Steps to Reproduce:
1. install RHEL4, MRG
2. configure qmf plugins
3. service qpidd start
4. service condor start
5. kill -STOP qpidd
6. service condor stop
7. you can see message:
     Stopping Condor daemons: [  OK  ]
     Warning: condor_master may not have exited, start/restart may fail
8. look at the ps ax | grep condor and daemons are alive
  
Actual results:
condor is alive after shutdown

Expected results:
condor go away after shutdown

Additional info:

Configuration used:
CREATE_CORE_FILES=True
ABORT_ON_EXCEPTION=True
MAX_HISTORY_LOG=300*1024*1024
MAX_HISTORY_ROTATIONS=10
QMF_BROKER_HOST=
MAX_COLLECTOR_LOG=100000000
MAX_STARTD_LOG=100000000
MAX_STARTER_LOG=100000000
MAX_SCHEDD_LOG=100000000
MAX_MASTER_LOG=100000000
MAX_NEGOTIATOR_LOG=100000000
SCHEDD.PLUGINS = $(LIB)/plugins/MgmtScheddPlugin-plugin.so
COLLECTOR.PLUGINS = $(LIB)/plugins/MgmtCollectorPlugin-plugin.so
NEGOTIATOR.PLUGINS = $(LIB)/plugins/MgmtNegotiatorPlugin-plugin.so
MASTER.PLUGINS = $(LIB)/plugins/MgmtMasterPlugin-plugin.so
QMF_BROKER_HOST=
CREATE_CORE_FILES = True
ALL_DEBUG = D_FULLDEBUG
QMF_DELETE_ON_SHUTDOWN=TRUE

Comment 1 Matthew Farrellee 2010-08-19 13:52:25 UTC

Possibly related to Bug 615321

Comment 2 Pete MacKinnon 2010-08-19 15:51:33 UTC

pstack from mrg42 (grid pool machine) exhibiting problem...

[root@mrg42 condor]# pstack 9921
Thread 4 (Thread -1208812640 (LWP 9924)):
#0  0x0030e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0058fef6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0x0029a0ed in qpid::client::StateManager::waitFor ()
#3  0x00238115 in qpid::client::ConnectionHandler::close ()
#4  0x00240fc9 in qpid::client::ConnectionImpl::close ()
#5  0x00230bdc in qpid::client::Connection::close ()
#6  0x0016143e in qpid::management::ManagementAgentImpl::ConnectionThread::run
#7  0x0096ff11 in qpid::sys::(anonymous namespace)::runRunnable ()
#8  0x0058d5cc in start_thread () from /lib/tls/libpthread.so.0
#9  0x003f3f8e in clone () from /lib/tls/libc.so.6
Thread 3 (Thread -1219302496 (LWP 9925)):
#0  0x0030e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0058fef6 in pthread_cond_wait@@GLIBC_2.3.2 ()
#2  0x00291f10 in qpid::client::SessionImpl::waitForCompletionImpl ()
#3  0x00292081 in qpid::client::SessionImpl::waitForCompletion ()
#4  0x00270976 in qpid::client::Future::wait ()
#5  0x002305e5 in qpid::client::Completion::wait ()
#6  0x00225387 in qpid::client::no_keyword::Session_0_10::messageTransfer ()
#7  0x0015447d in qpid::management::ManagementAgentImpl::ConnectionThread::sendMessage () from /usr/lib/libqmf.so.2
#8  0x0015522f in qpid::management::ManagementAgentImpl::ConnectionThread::sendBuffer () from /usr/lib/libqmf.so.2
#9  0x0015e44d in qpid::management::ManagementAgentImpl::sendHeartbeat ()
#10 0x0015f788 in qpid::management::ManagementAgentImpl::periodicProcessing ()
#11 0x001608fa in qpid::management::ManagementAgentImpl::PublishThread::run ()
#12 0x0096ff11 in qpid::sys::(anonymous namespace)::runRunnable ()
#13 0x0058d5cc in start_thread () from /lib/tls/libpthread.so.0
#14 0x003f3f8e in clone () from /lib/tls/libc.so.6
Thread 2 (Thread -1229792352 (LWP 9927)):
#0  0x0030e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x003f460e in epoll_wait () from /lib/tls/libc.so.6
#2  0x0097d4a0 in qpid::sys::Poller::wait () from /usr/lib/libqpidcommon.so.3
#3  0x0097e566 in qpid::sys::Poller::run () from /usr/lib/libqpidcommon.so.3
#4  0x0096ff11 in qpid::sys::(anonymous namespace)::runRunnable ()
#5  0x0058d5cc in start_thread () from /lib/tls/libpthread.so.0
#6  0x003f3f8e in clone () from /lib/tls/libc.so.6

Comment 3 Matthew Farrellee 2010-08-19 15:52:14 UTC

Strike comment 1

Comment 4 Ted Ross 2010-08-19 18:45:53 UTC

This appears to be an issue with the Messaging client (c++, old style).  If a connection is closed when the broker has been suspended (i.e. TCP connection is open but the broker is non-responsive), the close call hangs waiting for a response from the broker.

Comment 5 Ted Ross 2010-08-19 18:46:28 UTC

Also, I can readily reproduce this symptom on Fedora 11.

Comment 6 Gordon Sim 2010-08-19 18:57:21 UTC

That is 'expected'. As with any communication to the broker the client will only detect a problem if it is notified that the tcp socket is disconnected or if it uses heartbeats. If it expects a response it will otherwise keep waiting for it.

Comment 7 Gordon Sim 2010-08-19 19:04:03 UTC

I don't think this is a blocker as described. It is not a regression and is the 'expected' behaviour for an unresponsive broker where the tcp socket is not detected as disconnected.

Comment 8 Tomas Rusnak 2010-08-25 09:32:54 UTC

I do not understand why this can be 'expected. If the user want to stop daemons, the SIGTERM will be send, and all parts of service must be shutted down. 
Imagine, if admin wants to do 'service condor stop', and the condor keeps running, then only way how to shut it down, is to use kill -9... as I think, this is not expected, and not recommended way, how to shutdown daemons.

Comment 9 Matthew Farrellee 2010-08-31 15:11:32 UTC

With Bug 625541 MODI, it should be possible to fix this if the C++ QMF Agent enables heartbeats, via management::ConnectionSettings.

Comment 11 Ted Ross 2010-09-07 12:55:49 UTC

In upstream commit revision 993339, the management agent library was changed such that users of the init function that does not use ConnectionSettings sets a heartbeat interval.

Users of the other init function (with the ConnectionSettings argument) will need to set ConnectionSettings.heartbeat themselves.

Comment 12 Matthew Farrellee 2010-09-07 13:09:28 UTC

This sounds like it is now just a spec file change to require a specific qmf version. What version will that be?

Comment 13 Ted Ross 2010-09-07 14:04:33 UTC

It will be in builds after 946106-12.

Comment 14 Tomas Rusnak 2010-09-15 14:30:48 UTC

Retested over current packages on all supported platforms for RHEL4 - x86,x86_64 using:

qpid-tools-0.7.946106-10.el5
condor-wallaby-tools-3.6-1.el5
condor-qmf-7.4.4-0.13.el5
condor-7.4.4-0.13.el5
condor-wallaby-client-3.6-1.el5
qmf-0.7.946106-14.el5
python-qmf-0.7.946106-13.el5
python-condorutils-1.4-5.el5

No shutdown errors found. Heartbeat in AMQP is enabled by default, now.

>>> VERIFIED