Description of problem: When trying to send a QMF method out argument with strings that grow to over 64K, the agent will throw an exception and crash. Version-Release number of selected component (if applicable): qmf-0.5.752600-5.fc10.i386 qpidc-0.5.752600-5.fc10.i386 How reproducible: 100% Steps to Reproduce: 1. start condor_job_server with a queue containing jobs with long attribute values, e.g. a big environment 2. use qpid-tool to call the server's "GetJob" method 3. watch server crash Actual results: terminate called after throwing an instance of 'qpid::framing::OutOfBounds' what(): Out of Bounds Stack dump for process 22429 at timestamp 1247238103 (21 frames) ./condor_job_server(dprintf_dump_stack+0xd0)[0x80ff64c] ./condor_job_server[0x80ff80a] [0x276400] /lib/libc.so.6(abort+0x188)[0x71fe28] /usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x158)[0x7eb3c48] /usr/lib/libstdc++.so.6[0x7eb1b35] /usr/lib/libstdc++.so.6[0x7eb1b72] /usr/lib/libstdc++.so.6[0x7eb1caa] /usr/lib/libqpidcommon.so.0(_ZN4qpid7framing6Buffer10getRawDataERSsj+0xbb)[0xc8f83b] /usr/lib/libqmfagent.so.0(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread10sendBufferERNS_7framing6BufferEjRKSsS7_+0xf6)[0xeeca36] /usr/lib/libqmfagent.so.0(_ZN4qpid10management19ManagementAgentImpl19invokeMethodRequestERNS_7framing6BufferEjSs+0x2c3)[0xef0fa3] /usr/lib/libqmfagent.so.0(_ZN4qpid10management19ManagementAgentImpl13pollCallbacksEj+0x10c)[0xef16ec] ./condor_job_server(_Z16HandleMgmtSocketP7ServiceP6Stream+0x1c)[0x80c2353] ./condor_job_server(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x177)[0x80ec889] ./condor_job_server(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x34)[0x80ecb88] ./condor_job_server(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x29)[0x8141f35] ./condor_job_server(_ZN10DaemonCore17CallSocketHandlerERib+0x1b4)[0x80df028] ./condor_job_server(_ZN10DaemonCore6DriverEv+0x1959)[0x80e0a0b] ./condor_job_server(main+0x17d3)[0x80f6dc6] /lib/libc.so.6(__libc_start_main+0xe5)[0x7096e5] ./condor_job_server[0x80bb2f1]
*** Bug 508145 has been marked as a duplicate of this bug. ***
Fix committed upstream at revision 929716.
May I ask you for more info, please? An example would be very appreciated. Raising NEEDINFO.
I ran into the bug when submitting jobs to a schedd and then querying for all of them. However, the broker has an echo method. You may be able to simply send >64K of data to that method to reproduce.
There are some uncertainities: 1. There is no condor_job_server in 1.1.1 Grid release. 2. When I try to run 1.3RC Grid against 1.1.1 broker, I can not access grid objects via qpid-tool (either 1.3RC or 1.1.1) and I think it may be caused by QMF versions. How should I verify it?
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously,the Qpid Management Framework (QMF) method would exit with a segmentation fault when the result was larger than 10 MB.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously,the Qpid Management Framework (QMF) method would exit with a segmentation fault when the result was larger than 10 MB.+Previously,the Qpid Management Framework (QMF) method would exit with a segmentation fault when the result was larger than 64kB. With this update, this method works as expected, even for larger results.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously,the Qpid Management Framework (QMF) method would exit with a segmentation fault when the result was larger than 64kB. With this update, this method works as expected, even for larger results.+Previously, a QMF method would exit with a segmentation fault when the result was larger than 64kB. With this update, this method works as expected, even for larger results.
Created attachment 452111 [details] Verification scripts During phone meeting I was told this bug does not have to be reproduced to verify it is working. [user@host bz601828]$ ./runtest.sh x86_64 redhat-release-5Server-5.5.0.2 condor-qmf-7.4.4-0.16.el5 python-qpid-0.7.946106-14.el5 python-qmf-0.7.946106-13.el5 Clean: . Submit: ..Submitting job(s)............ 12 job(s) submitted to cluster 1. Verify: ...SUCCESS I will have to finish it for RHEL4 and RHEL5 i386 tomorrow.
Cleaning NEEDINFO.
x86_64 redhat-release-4AS-9 condor-qmf-7.4.4-0.16.el4 python-qpid-0.7.946106-14.el4 python-qmf-0.7.946106-13.el4 Clean: . Submit: ..Submitting job(s)............ 12 job(s) submitted to cluster 1. Verify: SUCCESS
i686 redhat-release-4AS-9 condor-qmf-7.4.4-0.16.el4 python-qpid-0.7.946106-14.el4 python-qmf-0.7.946106-13.el4 qpid-cpp-server-0.7.946106-17.el4 Clean: . Submit: ..Submitting job(s)............ 12 job(s) submitted to cluster 1. Verify: SUCCESS
It is vital to set auth=no on the broker for Condor QMF agents to appear on all the latest qpid-cpp-server-0.7.946106-17 builds. The test should run under an unprivileged user which has sudo NOPASSWORD right, see sudo(8) manual page for more info.
Created attachment 452296 [details] Updated verification scripts $ ./runtest.sh 100 i686 redhat-release-5Server-5.5.0.2 qpid-cpp-server-0.7.946106-17.el5 condor-qmf-7.4.4-0.16.el5 python-qpid-0.7.946106-14.el5 python-qmf-0.7.946106-13.el5 Clean: . Submit: ..Submitting job(s).................................................................................................... 100 job(s) submitted to cluster 1. Verify: SUCCESS
Verified on all supported architectures and RHEL versions.
Created attachment 452353 [details] Fixed verification scripts Sorry, the test was deleting /var/lib/qpidd which contains SASL database. Here are the updated scripts. No need for auth=no anymore.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html