Bug 589675
Summary: | job_server: QMF init/connection failure from Qpid "Invalid URL" | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> | ||||||
Component: | qpid-cpp | Assignee: | Alan Conway <aconway> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | Development | CC: | esammons, freznice, gsim, matt | ||||||
Target Milestone: | 1.3 | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | Type: | --- | |||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 560693 | ||||||||
Bug Blocks: | 586348 | ||||||||
Attachments: |
|
Description
Pete MacKinnon
2010-05-06 16:39:45 UTC
(gdb) where #0 AttrList::Lookup (this=0xbfffe5fc, name=0x825a800 "MyType") at attrlist.cpp:1001 #1 0x08156d44 in AttrList::Lookup (this=0xbfffe5fc, name=0x825a800 "MyType") at attrlist.cpp:994 #2 0x08156eda in AttrList::Lookup (this=0xbfffe5fc, attr=0x825a858) at attrlist.cpp:1037 #3 0x08156663 in AttrList::Insert (this=0xbfffe5fc, expr=0x8259e48, check_for_dups=true) at attrlist.cpp:705 #4 0x08154ebe in AttrList::AttrList (this=0xbfffe5fc, file=0x825a5c8, delimitor=0x81ae2e4 "***", isEOF=@0xbfffe728, error=@0xbfffe724, empty=@0xbfffe720) at attrlist.cpp:222 #5 0x0815b3f8 in ClassAd::ClassAd (this=0xbfffe5fc, f=0x825a5c8, d=0x81ae2e4 "***", i=@0xbfffe728, err=@0xbfffe724, empty=@0xbfffe720) at classad.cpp:102 #6 0x080d1679 in HistoryFile::poll (this=0x8217300) at HistoryFile.cpp:205 #7 0x080d4620 in ProcessCurrentHistory () at HistoryProcessingUtils.cpp:234 #8 0x080c63ae in ProcessHistoryTimer () at job_server_main.cpp:245 #9 0x080f7553 in TimerManager::Timeout (this=0x8217348) at timer_manager.cpp:397 #10 0x080de60f in DaemonCore::Driver (this=0x8235898) at daemon_core.cpp:2900 #11 0x080f400f in main (argc=1, argv=0xbffff0fc) at daemon_core_main.cpp:2281 Culprit...bulk load? (gdb) p hash $19 = (HashTable<YourString, AttrListElem*> *) 0x0 (gdb) where #0 AttrList::Lookup (this=0x8259a70, name=0x81bf388 "JobStatus") at attrlist.cpp:1001 #1 0x080c7cec in LiveJob::Get (this=0x8233ac0, _name=0x81bf388 "JobStatus", _attribute=@0xbfffe79c) at Job.cpp:176 #2 0x080c8013 in LiveJob::GetStatus (this=0x8233ac0) at Job.cpp:236 #3 0x080cb90a in SubmissionObject::Decrement (this=0x0, job=0x8233ac0) at SubmissionObject.cpp:117 #4 0x080c782f in Job::DecrementSubmission (this=0x8233ac0) at Job.cpp:107 #5 0x080c7aa4 in LiveJob::~LiveJob (this=0x8233ac0, __in_chrg=<value optimized out>) at Job.cpp:127 #6 0x080cd5be in JobServerJobLogConsumer::Reset (this=0x8258478) at JobServerJobLogConsumer.cpp:50 #7 0x0812c925 in JobLogReader::BulkLoad (this=0x8258994) at JobLogReader.cpp:97 #8 0x0812c8b7 in JobLogReader::Poll (this=0x8258994) at JobLogReader.cpp:75 #9 0x080cf1ee in JobLogMirror::TimerHandler_JobLogPolling (this=0x8258990) at JobLogMirror.cpp:79 #10 0x080f753a in TimerManager::Timeout (this=0x8217348) at timer_manager.cpp:394 #11 0x080de60f in DaemonCore::Driver (this=0x8235898) at daemon_core.cpp:2900 #12 0x080f400f in main (argc=1, argv=0xbffff0fc) at daemon_core_main.cpp:2281 Fixed an order problem in LiveJob dtor. Delete the ad AFTER decrementing the submission, due to a lookup in the decrement. :-) Same symptom but different trace... 05/21 16:57:50 ****************************************************** 05/21 16:57:50 ** condor_job_server (CONDOR_JOB_SERVER) STARTING UP 05/21 16:57:50 ** /usr/sbin/condor_job_server 05/21 16:57:50 ** SubsystemInfo: name=JOB_SERVER type=DAEMON(11) class=DAEMON(1) 05/21 16:57:50 ** Configuration: subsystem:JOB_SERVER local:<NONE> class:DAEMON 05/21 16:57:50 ** $CondorVersion: 7.4.3 May 19 2010 BuildID: RH-7.4.3-0.14.el5 PRE-RELEASE $ 05/21 16:57:50 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ 05/21 16:57:50 ** PID = 15135 05/21 16:57:50 ** Log last touched 5/21 15:57:50 05/21 16:57:50 ****************************************************** 05/21 16:57:50 Using config source: /etc/condor/condor_config 05/21 16:57:50 Using local config sources: 05/21 16:57:50 /var/lib/condor/condor_config.local 05/21 16:57:50 /var/lib/condor/config/99configd.config 05/21 16:57:50 /var/lib/condor/config/ZQMF_BROKER_HOST 05/21 16:57:50 /var/lib/condor/config/qmf_condor_config 05/21 16:57:50 /var/lib/condor/config/zmaster 05/21 16:57:50 /var/lib/condor/config/zmatt 05/21 16:57:50 /var/lib/condor/config/zpmackinn 05/21 16:57:50 DaemonCore: Command Socket at <10.16.44.232:31118> 05/21 16:57:50 main_init() called 05/21 16:57:50 error reading /mnt/pool/spool/job_queue.log: no such job '025819.-1' for 'ClusterId = 25819' 05/21 16:57:50 error reading /mnt/pool/spool/job_queue.log: Failed to process log entry. 05/21 16:58:00 error reading /mnt/pool/spool/job_queue.log: no such job '025819.-1' for 'ClusterId = 25819' 05/21 16:58:00 error reading /mnt/pool/spool/job_queue.log: Failed to process log entry. Stack dump for process 15135 at timestamp 1274475484 (15 frames) condor_job_server(dprintf_dump_stack+0x4e)[0x4a0f9e] condor_job_server[0x4a2d72] /lib64/libpthread.so.0[0x320f80e930] /lib64/libc.so.6(gsignal+0x35)[0x320ec30265] /lib64/libc.so.6(abort+0x110)[0x320ec31d10] /lib64/libc.so.6[0x320ec6a84b] /lib64/libc.so.6[0x320ec723e5] /lib64/libc.so.6(cfree+0x4b)[0x320ec7273b] /usr/lib64/libqpidclient.so.2(_ZN4qpid6client12TCPConnectorD0Ev+0xff)[0x2b7bbc40329f] /usr/lib64/libqpidclient.so.2(_ZN4qpid6client14ConnectionImplD0Ev+0xe8)[0x2b7bbc3b9248] /usr/lib64/libqpidclient.so.2(_ZN4qpid6client10Connection4openERKNS0_18ConnectionSettingsE+0x569)[0x2b7bbc3acdc9] /usr/lib64/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread3runEv+0x25b)[0x2b7bbc65051b] /usr/lib64/libqpidcommon.so.2[0x2b7bbbfe458a] /lib64/libpthread.so.0[0x320f806617] /lib64/libc.so.6(clone+0x6d)[0x320ecd3c2d] Running condor_job_server -t -f tells me... 2010-05-22 09:01:54 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211) 2010-05-22 09:01:56 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211) 2010-05-22 09:02:00 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211) Don't see this in the plug-in daemons. ld issue? Think we need a reproducer with the sample or test C++ agent in a pkg we can try in the pool. condor_job_server failure trace... gdb) where #0 qpid::Url::parse (this=<value optimized out>, url=<value optimized out>) at qpid/Url.cpp:211 #1 0x00000031a9c5b10c in qpid::client::ConnectionHandler::openOk ( this=<value optimized out>, knownBrokers=<value optimized out>) at ../include/qpid/Url.h:54 #2 0x00000031a9763352 in qpid::framing::AMQP_ClientOperations::ConnectionHandler::Invoker::visit (this=<value optimized out>, body=<value optimized out>) at qpid/framing/ConnectionOpenOkBody.h:63 #3 0x00000031a9c60244 in qpid::framing::invoke<qpid::framing::AMQP_ClientOperations::ConnectionHandler> (target=<value optimized out>, body=<value optimized out>) at qpid/framing/Invoker.h:80 #4 0x00000031a9c5b907 in qpid::client::ConnectionHandler::incoming ( this=<value optimized out>, frame=<value optimized out>) at qpid/client/ConnectionHandler.cpp:105 #5 0x00000031a9cb07bd in qpid::client::TCPConnector::decode ( this=<value optimized out>, buffer=<value optimized out>, size=<value optimized out>) at qpid/client/TCPConnector.cpp:299 #6 0x00000031a9cacea0 in qpid::client::TCPConnector::readbuff ( this=<value optimized out>, aio=<value optimized out>, buff=<value optimized out>) at qpid/client/TCPConnector.cpp:267 #7 0x00000031a9720fca in boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator() ( this=<value optimized out>, a0=<value optimized out>, a1=<value optimized out>) at /usr/include/boost/function/function_template.hpp:576 #8 0x00000031a971e753 in qpid::sys::posix::AsynchIO::readable ( this=<value optimized out>, h=<value optimized out>) at qpid/sys/posix/AsynchIO.cpp:418 #9 0x00000031a97f6067 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:576 #10 0x00000031a97f2637 in qpid::sys::DispatchHandle::processEvent ( this=<value optimized out>, type=<value optimized out>) at qpid/sys/DispatchHandle.cpp:278 #11 0x00000031a972c9ff in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/Poller.h:123 #12 0x00000031a972358a in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #13 0x000000320f806617 in start_thread () from /lib64/libpthread.so.0 #14 0x000000320ecd3c2d in clone () from /lib64/libc.so.6 For some reason the Address object is not added to the URL vector. Seems like some kind of initialization gap that doesn't affect the plug-ins which are dlopened. Created attachment 419436 [details]
condor_job_server connection failure trace
Note the absence of the line found in the good trace...
2010-06-02 12:20:47 debug Created IO thread: 0
Created attachment 419438 [details]
condor_master plug-in successful connection
Fix in trunk r951441, release repo http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=4b96352abe88aa513434d59812a6d44679d2acd8 Was not able to reproduce independently, fix was tested directly on the system that showed the problem. Fix looks good to me. No error on connection to broker by condor daemon. The issue has been fixed, tested on RHEL 4.8 / 5.5, i386 / x86_64 on packages: python-qmf-0.7.946106-3.el5 python-qpid-0.7.946106-1.el5 qmf-0.7.946106-3.el5 qmf-devel-0.7.946106-3.el5 qpid-cpp-client-0.7.946106-3.el5 qpid-cpp-client-devel-0.7.946106-3.el5 qpid-cpp-client-devel-docs-0.7.946106-3.el5 qpid-cpp-client-rdma-0.7.946106-3.el5 qpid-cpp-client-ssl-0.7.946106-3.el5 qpid-cpp-mrg-debuginfo-0.7.946106-3.el5 qpid-cpp-server-0.7.946106-3.el5 qpid-cpp-server-cluster-0.7.946106-3.el5 qpid-cpp-server-devel-0.7.946106-3.el5 qpid-cpp-server-rdma-0.7.946106-3.el5 qpid-cpp-server-ssl-0.7.946106-3.el5 qpid-cpp-server-store-0.7.946106-3.el5 qpid-cpp-server-xml-0.7.946106-3.el5 qpid-java-client-0.7.946106-3.el5 qpid-java-common-0.7.946106-3.el5 qpid-tools-0.7.946106-4.el5 rh-qpid-cpp-tests-0.7.946106-3.el5 ruby-qmf-0.7.946106-3.el5 -> VERIFIED |