Bug 589675

Summary: job_server: QMF init/connection failure from Qpid "Invalid URL"
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: esammons, freznice, gsim, matt
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 560693    
Bug Blocks: 586348    
Attachments:
Description Flags
condor_job_server connection failure trace
none
condor_master plug-in successful connection none

Description Pete MacKinnon 2010-05-06 16:39:45 UTC
05/06 12:37:57 (fd:2) (pid:4002) === Current Probing Information ===
05/06 12:37:57 (fd:2) (pid:4002) fsize: 80		mtime: 1273150920
05/06 12:37:57 (fd:2) (pid:4002) first log entry: 2 CreationTimestamp 1273105563
05/06 12:37:57 (fd:2) (pid:4002) ERROR "Assertion ERROR on (hash)" at line 1001 in file attrlist.cpp
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Stack dump for process 4002 at timestamp 1273163887 (23 frames)
./condor_job_server(dprintf_dump_stack+0xc7)[0x80f9b63]
./condor_job_server[0x80f9d3a]
[0xe24400]
[0xe24422]
/lib/libc.so.6(gsignal+0x51)[0x7d9781]
/lib/libc.so.6(abort+0x17a)[0x7db04a]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x14f)[0x56df44f]
/usr/lib/libstdc++.so.6[0x56dd385]
/usr/lib/libstdc++.so.6[0x56dd3c2]
/usr/lib/libstdc++.so.6[0x56dd501]
/usr/lib/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x7f)[0x5677d1f]
/usr/lib/libstdc++.so.6(_ZNSs4_Rep9_S_createEjjRKSaIcE+0x85)[0x56b8b95]
/usr/lib/libstdc++.so.6(_ZNSs4_Rep8_M_cloneERKSaIcEj+0x38)[0x56b97e8]
/usr/lib/libstdc++.so.6(_ZNSs6assignERKSs+0x6f)[0x56ba57f]
/usr/lib/libqpidcommon.so.2(_ZN4qpid7framing18DeliveryProperties13setRoutingKeyERKSs+0x2c)[0x25534c]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread11sendMessageENS_6client7MessageERKSsS6_+0xc4)[0x712f84]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread10sendBufferERKSsS4_St3mapISsNS_5types7VariantESt4lessISsESaISt4pairIS3_S7_EEES4_S4_S4_+0x171)[0x713681]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl13sendHeartbeatEv+0x4ff)[0x713c6f]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl18periodicProcessingEv+0x7ba)[0x7165aa]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl13PublishThread3runEv+0x2a)[0x7170ca]
/usr/lib/libqpidcommon.so.2[0x21de31]
/lib/libpthread.so.0[0x1158f5]
/lib/libc.so.6(clone+0x5e)[0x88bfce]
Aborted

Comment 1 Pete MacKinnon 2010-05-06 19:26:03 UTC
(gdb) where
#0  AttrList::Lookup (this=0xbfffe5fc, name=0x825a800 "MyType")
    at attrlist.cpp:1001
#1  0x08156d44 in AttrList::Lookup (this=0xbfffe5fc, name=0x825a800 "MyType")
    at attrlist.cpp:994
#2  0x08156eda in AttrList::Lookup (this=0xbfffe5fc, attr=0x825a858)
    at attrlist.cpp:1037
#3  0x08156663 in AttrList::Insert (this=0xbfffe5fc, expr=0x8259e48, 
    check_for_dups=true) at attrlist.cpp:705
#4  0x08154ebe in AttrList::AttrList (this=0xbfffe5fc, file=0x825a5c8, 
    delimitor=0x81ae2e4 "***", isEOF=@0xbfffe728, error=@0xbfffe724, 
    empty=@0xbfffe720) at attrlist.cpp:222
#5  0x0815b3f8 in ClassAd::ClassAd (this=0xbfffe5fc, f=0x825a5c8, 
    d=0x81ae2e4 "***", i=@0xbfffe728, err=@0xbfffe724, empty=@0xbfffe720)
    at classad.cpp:102
#6  0x080d1679 in HistoryFile::poll (this=0x8217300) at HistoryFile.cpp:205
#7  0x080d4620 in ProcessCurrentHistory () at HistoryProcessingUtils.cpp:234
#8  0x080c63ae in ProcessHistoryTimer () at job_server_main.cpp:245
#9  0x080f7553 in TimerManager::Timeout (this=0x8217348)
    at timer_manager.cpp:397
#10 0x080de60f in DaemonCore::Driver (this=0x8235898) at daemon_core.cpp:2900
#11 0x080f400f in main (argc=1, argv=0xbffff0fc) at daemon_core_main.cpp:2281

Comment 2 Pete MacKinnon 2010-05-06 20:08:16 UTC
Culprit...bulk load?

(gdb) p hash
$19 = (HashTable<YourString, AttrListElem*> *) 0x0
(gdb) where
#0  AttrList::Lookup (this=0x8259a70, name=0x81bf388 "JobStatus")
    at attrlist.cpp:1001
#1  0x080c7cec in LiveJob::Get (this=0x8233ac0, _name=0x81bf388 "JobStatus", 
    _attribute=@0xbfffe79c) at Job.cpp:176
#2  0x080c8013 in LiveJob::GetStatus (this=0x8233ac0) at Job.cpp:236
#3  0x080cb90a in SubmissionObject::Decrement (this=0x0, job=0x8233ac0)
    at SubmissionObject.cpp:117
#4  0x080c782f in Job::DecrementSubmission (this=0x8233ac0) at Job.cpp:107
#5  0x080c7aa4 in LiveJob::~LiveJob (this=0x8233ac0, 
    __in_chrg=<value optimized out>) at Job.cpp:127
#6  0x080cd5be in JobServerJobLogConsumer::Reset (this=0x8258478)
    at JobServerJobLogConsumer.cpp:50
#7  0x0812c925 in JobLogReader::BulkLoad (this=0x8258994)
    at JobLogReader.cpp:97
#8  0x0812c8b7 in JobLogReader::Poll (this=0x8258994) at JobLogReader.cpp:75
#9  0x080cf1ee in JobLogMirror::TimerHandler_JobLogPolling (this=0x8258990)
    at JobLogMirror.cpp:79
#10 0x080f753a in TimerManager::Timeout (this=0x8217348)
    at timer_manager.cpp:394
#11 0x080de60f in DaemonCore::Driver (this=0x8235898) at daemon_core.cpp:2900
#12 0x080f400f in main (argc=1, argv=0xbffff0fc) at daemon_core_main.cpp:2281

Comment 3 Pete MacKinnon 2010-05-11 14:22:42 UTC
Fixed an order problem in LiveJob dtor. Delete the ad AFTER decrementing the submission, due to a lookup in the decrement. :-)

Comment 4 Pete MacKinnon 2010-05-21 21:07:30 UTC
Same symptom but different trace...

05/21 16:57:50 ******************************************************
05/21 16:57:50 ** condor_job_server (CONDOR_JOB_SERVER) STARTING UP
05/21 16:57:50 ** /usr/sbin/condor_job_server
05/21 16:57:50 ** SubsystemInfo: name=JOB_SERVER type=DAEMON(11) class=DAEMON(1)
05/21 16:57:50 ** Configuration: subsystem:JOB_SERVER local:<NONE> class:DAEMON
05/21 16:57:50 ** $CondorVersion: 7.4.3 May 19 2010 BuildID: RH-7.4.3-0.14.el5 PRE-RELEASE $
05/21 16:57:50 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
05/21 16:57:50 ** PID = 15135
05/21 16:57:50 ** Log last touched 5/21 15:57:50
05/21 16:57:50 ******************************************************
05/21 16:57:50 Using config source: /etc/condor/condor_config
05/21 16:57:50 Using local config sources:
05/21 16:57:50    /var/lib/condor/condor_config.local
05/21 16:57:50    /var/lib/condor/config/99configd.config
05/21 16:57:50    /var/lib/condor/config/ZQMF_BROKER_HOST
05/21 16:57:50    /var/lib/condor/config/qmf_condor_config
05/21 16:57:50    /var/lib/condor/config/zmaster
05/21 16:57:50    /var/lib/condor/config/zmatt
05/21 16:57:50    /var/lib/condor/config/zpmackinn
05/21 16:57:50 DaemonCore: Command Socket at <10.16.44.232:31118>
05/21 16:57:50 main_init() called
05/21 16:57:50 error reading /mnt/pool/spool/job_queue.log: no such job '025819.-1' for 'ClusterId = 25819'
05/21 16:57:50 error reading /mnt/pool/spool/job_queue.log: Failed to process log entry.
05/21 16:58:00 error reading /mnt/pool/spool/job_queue.log: no such job '025819.-1' for 'ClusterId = 25819'
05/21 16:58:00 error reading /mnt/pool/spool/job_queue.log: Failed to process log entry.
Stack dump for process 15135 at timestamp 1274475484 (15 frames)
condor_job_server(dprintf_dump_stack+0x4e)[0x4a0f9e]
condor_job_server[0x4a2d72]
/lib64/libpthread.so.0[0x320f80e930]
/lib64/libc.so.6(gsignal+0x35)[0x320ec30265]
/lib64/libc.so.6(abort+0x110)[0x320ec31d10]
/lib64/libc.so.6[0x320ec6a84b]
/lib64/libc.so.6[0x320ec723e5]
/lib64/libc.so.6(cfree+0x4b)[0x320ec7273b]
/usr/lib64/libqpidclient.so.2(_ZN4qpid6client12TCPConnectorD0Ev+0xff)[0x2b7bbc40329f]
/usr/lib64/libqpidclient.so.2(_ZN4qpid6client14ConnectionImplD0Ev+0xe8)[0x2b7bbc3b9248]
/usr/lib64/libqpidclient.so.2(_ZN4qpid6client10Connection4openERKNS0_18ConnectionSettingsE+0x569)[0x2b7bbc3acdc9]
/usr/lib64/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread3runEv+0x25b)[0x2b7bbc65051b]
/usr/lib64/libqpidcommon.so.2[0x2b7bbbfe458a]
/lib64/libpthread.so.0[0x320f806617]
/lib64/libc.so.6(clone+0x6d)[0x320ecd3c2d]

Comment 5 Pete MacKinnon 2010-05-22 13:14:21 UTC
Running condor_job_server -t -f tells me...

2010-05-22 09:01:54 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211)
2010-05-22 09:01:56 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211)
2010-05-22 09:02:00 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211)


Don't see this in the plug-in daemons. ld issue?

Comment 6 Pete MacKinnon 2010-05-25 14:04:20 UTC
Think we need a reproducer with the sample or test C++ agent in a pkg we can try in the pool.

Comment 7 Pete MacKinnon 2010-05-26 15:31:26 UTC
condor_job_server failure trace...

gdb) where
#0  qpid::Url::parse (this=<value optimized out>, url=<value optimized out>)
    at qpid/Url.cpp:211
#1  0x00000031a9c5b10c in qpid::client::ConnectionHandler::openOk (
    this=<value optimized out>, knownBrokers=<value optimized out>)
    at ../include/qpid/Url.h:54
#2  0x00000031a9763352 in qpid::framing::AMQP_ClientOperations::ConnectionHandler::Invoker::visit (this=<value optimized out>, body=<value optimized out>)
    at qpid/framing/ConnectionOpenOkBody.h:63
#3  0x00000031a9c60244 in qpid::framing::invoke<qpid::framing::AMQP_ClientOperations::ConnectionHandler> (target=<value optimized out>, 
    body=<value optimized out>) at qpid/framing/Invoker.h:80
#4  0x00000031a9c5b907 in qpid::client::ConnectionHandler::incoming (
    this=<value optimized out>, frame=<value optimized out>)
    at qpid/client/ConnectionHandler.cpp:105
#5  0x00000031a9cb07bd in qpid::client::TCPConnector::decode (
    this=<value optimized out>, buffer=<value optimized out>, 
    size=<value optimized out>) at qpid/client/TCPConnector.cpp:299
#6  0x00000031a9cacea0 in qpid::client::TCPConnector::readbuff (
    this=<value optimized out>, aio=<value optimized out>, 
    buff=<value optimized out>) at qpid/client/TCPConnector.cpp:267
#7  0x00000031a9720fca in boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator() (
    this=<value optimized out>, a0=<value optimized out>, 
    a1=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:576
#8  0x00000031a971e753 in qpid::sys::posix::AsynchIO::readable (
    this=<value optimized out>, h=<value optimized out>)
    at qpid/sys/posix/AsynchIO.cpp:418
#9  0x00000031a97f6067 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (this=<value optimized out>, 
    a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:576
#10 0x00000031a97f2637 in qpid::sys::DispatchHandle::processEvent (
    this=<value optimized out>, type=<value optimized out>)
    at qpid/sys/DispatchHandle.cpp:278
#11 0x00000031a972c9ff in qpid::sys::Poller::run (this=<value optimized out>)
    at qpid/sys/Poller.h:123
#12 0x00000031a972358a in runRunnable (p=<value optimized out>)
    at qpid/sys/posix/Thread.cpp:35
#13 0x000000320f806617 in start_thread () from /lib64/libpthread.so.0
#14 0x000000320ecd3c2d in clone () from /lib64/libc.so.6

Comment 8 Pete MacKinnon 2010-06-02 22:35:45 UTC
For some reason the Address object is not added to the URL vector. Seems like some kind of initialization gap that doesn't affect the plug-ins which are dlopened.

Comment 9 Pete MacKinnon 2010-06-03 16:20:07 UTC
Created attachment 419436 [details]
condor_job_server connection failure trace

Note the absence of the line found in the good trace...

2010-06-02 12:20:47 debug Created IO thread: 0

Comment 10 Pete MacKinnon 2010-06-03 16:21:26 UTC
Created attachment 419438 [details]
condor_master plug-in successful connection

Comment 11 Alan Conway 2010-06-04 15:34:03 UTC
Fix in trunk r951441, release repo
http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=4b96352abe88aa513434d59812a6d44679d2acd8

Was not able to reproduce independently, fix was tested directly on the system that showed the problem.

Comment 12 Pete MacKinnon 2010-06-04 16:00:14 UTC
Fix looks good to me. No error on connection to broker by condor daemon.

Comment 14 Frantisek Reznicek 2010-06-16 13:51:18 UTC
The issue has been fixed, tested on RHEL 4.8 / 5.5, i386 / x86_64 on packages:

python-qmf-0.7.946106-3.el5
python-qpid-0.7.946106-1.el5
qmf-0.7.946106-3.el5
qmf-devel-0.7.946106-3.el5
qpid-cpp-client-0.7.946106-3.el5
qpid-cpp-client-devel-0.7.946106-3.el5
qpid-cpp-client-devel-docs-0.7.946106-3.el5
qpid-cpp-client-rdma-0.7.946106-3.el5
qpid-cpp-client-ssl-0.7.946106-3.el5
qpid-cpp-mrg-debuginfo-0.7.946106-3.el5
qpid-cpp-server-0.7.946106-3.el5
qpid-cpp-server-cluster-0.7.946106-3.el5
qpid-cpp-server-devel-0.7.946106-3.el5
qpid-cpp-server-rdma-0.7.946106-3.el5
qpid-cpp-server-ssl-0.7.946106-3.el5
qpid-cpp-server-store-0.7.946106-3.el5
qpid-cpp-server-xml-0.7.946106-3.el5
qpid-java-client-0.7.946106-3.el5
qpid-java-common-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5
rh-qpid-cpp-tests-0.7.946106-3.el5
ruby-qmf-0.7.946106-3.el5


-> VERIFIED