Bug 625450
Summary: | condor_master stay alive with QMF plugins while qpidd was stopped before (RHEL4) | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Tomas Rusnak <trusnak> |
Component: | qpid-qmf | Assignee: | Matthew Farrellee <matt> |
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | beta | CC: | freznice, gsim, matt, pmackinn, tross |
Target Milestone: | 1.3 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-10-20 11:30:36 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 534073, 596210 |
Description
Tomas Rusnak
2010-08-19 13:38:46 UTC
Possibly related to Bug 615321 pstack from mrg42 (grid pool machine) exhibiting problem... [root@mrg42 condor]# pstack 9921 Thread 4 (Thread -1208812640 (LWP 9924)): #0 0x0030e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0058fef6 in pthread_cond_wait@@GLIBC_2.3.2 () #2 0x0029a0ed in qpid::client::StateManager::waitFor () #3 0x00238115 in qpid::client::ConnectionHandler::close () #4 0x00240fc9 in qpid::client::ConnectionImpl::close () #5 0x00230bdc in qpid::client::Connection::close () #6 0x0016143e in qpid::management::ManagementAgentImpl::ConnectionThread::run #7 0x0096ff11 in qpid::sys::(anonymous namespace)::runRunnable () #8 0x0058d5cc in start_thread () from /lib/tls/libpthread.so.0 #9 0x003f3f8e in clone () from /lib/tls/libc.so.6 Thread 3 (Thread -1219302496 (LWP 9925)): #0 0x0030e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0058fef6 in pthread_cond_wait@@GLIBC_2.3.2 () #2 0x00291f10 in qpid::client::SessionImpl::waitForCompletionImpl () #3 0x00292081 in qpid::client::SessionImpl::waitForCompletion () #4 0x00270976 in qpid::client::Future::wait () #5 0x002305e5 in qpid::client::Completion::wait () #6 0x00225387 in qpid::client::no_keyword::Session_0_10::messageTransfer () #7 0x0015447d in qpid::management::ManagementAgentImpl::ConnectionThread::sendMessage () from /usr/lib/libqmf.so.2 #8 0x0015522f in qpid::management::ManagementAgentImpl::ConnectionThread::sendBuffer () from /usr/lib/libqmf.so.2 #9 0x0015e44d in qpid::management::ManagementAgentImpl::sendHeartbeat () #10 0x0015f788 in qpid::management::ManagementAgentImpl::periodicProcessing () #11 0x001608fa in qpid::management::ManagementAgentImpl::PublishThread::run () #12 0x0096ff11 in qpid::sys::(anonymous namespace)::runRunnable () #13 0x0058d5cc in start_thread () from /lib/tls/libpthread.so.0 #14 0x003f3f8e in clone () from /lib/tls/libc.so.6 Thread 2 (Thread -1229792352 (LWP 9927)): #0 0x0030e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x003f460e in epoll_wait () from /lib/tls/libc.so.6 #2 0x0097d4a0 in qpid::sys::Poller::wait () from /usr/lib/libqpidcommon.so.3 #3 0x0097e566 in qpid::sys::Poller::run () from /usr/lib/libqpidcommon.so.3 #4 0x0096ff11 in qpid::sys::(anonymous namespace)::runRunnable () #5 0x0058d5cc in start_thread () from /lib/tls/libpthread.so.0 #6 0x003f3f8e in clone () from /lib/tls/libc.so.6 This appears to be an issue with the Messaging client (c++, old style). If a connection is closed when the broker has been suspended (i.e. TCP connection is open but the broker is non-responsive), the close call hangs waiting for a response from the broker. Also, I can readily reproduce this symptom on Fedora 11. That is 'expected'. As with any communication to the broker the client will only detect a problem if it is notified that the tcp socket is disconnected or if it uses heartbeats. If it expects a response it will otherwise keep waiting for it. I don't think this is a blocker as described. It is not a regression and is the 'expected' behaviour for an unresponsive broker where the tcp socket is not detected as disconnected. I do not understand why this can be 'expected. If the user want to stop daemons, the SIGTERM will be send, and all parts of service must be shutted down. Imagine, if admin wants to do 'service condor stop', and the condor keeps running, then only way how to shut it down, is to use kill -9... as I think, this is not expected, and not recommended way, how to shutdown daemons. With Bug 625541 MODI, it should be possible to fix this if the C++ QMF Agent enables heartbeats, via management::ConnectionSettings. In upstream commit revision 993339, the management agent library was changed such that users of the init function that does not use ConnectionSettings sets a heartbeat interval. Users of the other init function (with the ConnectionSettings argument) will need to set ConnectionSettings.heartbeat themselves. This sounds like it is now just a spec file change to require a specific qmf version. What version will that be? It will be in builds after 946106-12. Retested over current packages on all supported platforms for RHEL4 - x86,x86_64 using:
qpid-tools-0.7.946106-10.el5
condor-wallaby-tools-3.6-1.el5
condor-qmf-7.4.4-0.13.el5
condor-7.4.4-0.13.el5
condor-wallaby-client-3.6-1.el5
qmf-0.7.946106-14.el5
python-qmf-0.7.946106-13.el5
python-condorutils-1.4-5.el5
No shutdown errors found. Heartbeat in AMQP is enabled by default, now.
>>> VERIFIED
|