Description of problem: Qpid-perftest segfaults, when running over RDMA against qpidd --auth yes without explicitly specifying sasl mechanism ( on client or server side) While testing as described in bz631969 Program terminated with signal 11, Segmentation fault. #0 0x00002b55fc82bca9 in qpid::sys::cyrus::CyrusSecurityLayer::canEncode (this=<value optimized out>) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:109 109 return encrypted || codec->canEncode(); (gdb) info threads 17 Thread 18083 qpid::client::RdmaConnector::activateSecurityLayer (this=<value optimized out>, sl=<value optimized out>) at qpid/client/RdmaConnector.cpp:429 16 Thread 18086 0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6 15 Thread 18089 0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6 14 Thread 18091 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 13 Thread 18092 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 12 Thread 18093 0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6 11 Thread 18099 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 10 Thread 18100 0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6 9 Thread 18102 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 8 Thread 18103 0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6 7 Thread 18105 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 6 Thread 18106 0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6 5 Thread 18108 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 18109 0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6 3 Thread 18111 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 18115 0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 * 1 Thread 18112 0x00002b55fc82bca9 in qpid::sys::cyrus::CyrusSecurityLayer::canEncode (this=<value optimized out>) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:109 (gdb) thread apply 1 bt Thread 1 (Thread 18112): #0 0x00002b55fc82bca9 in qpid::sys::cyrus::CyrusSecurityLayer::canEncode (this=<value optimized out>) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:109 #1 0x00002b55fd138ff5 in qpid::client::RdmaConnector::writebuff (this=0x2aaab0013a80) at qpid/client/RdmaConnector.cpp:356 #2 0x00002b55fd3764ba in boost::function1<void, Rdma::AsynchIO&, std::allocator<boost::function_base> >::operator() (this=0x1, a0=...) at /usr/include/boost/function/function_template.hpp:576 #3 0x00002b55fd36df8d in Rdma::AsynchIO::doWriteCallback (this=0x1fd6a6d0) at qpid/sys/rdma/RdmaIO.cpp:448 #4 0x00002b55fd3738fb in Rdma::AsynchIO::writeEvent (this=0x1fd6a6d0) at qpid/sys/rdma/RdmaIO.cpp:325 #5 0x00002b55fd374060 in Rdma::AsynchIO::dataEvent (this=0x1fd6a6d0) at qpid/sys/rdma/RdmaIO.cpp:305 #6 0x00002b55fc81c83a in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (this=0x1, a0=...) at /usr/include/boost/function/function_template.hpp:576 #7 0x00002b55fc81bf3f in qpid::sys::DispatchHandle::processEvent (this=0x1fd6a7d0, type=READABLE) at qpid/sys/DispatchHandle.cpp:280 #8 0x00002b55fc74f764 in process (this=0x1bba16e0) at qpid/sys/Poller.h:125 #9 qpid::sys::Poller::run (this=0x1bba16e0) at qpid/sys/epoll/EpollPoller.cpp:519 #10 0x00002b55fc74598a in qpid::sys::(anonymous namespace)::runRunnable (p=0x0) at qpid/sys/posix/Thread.cpp:35 #11 0x0000003463a0673d in start_thread () from /lib64/libpthread.so.0 #12 0x0000003462ed3f6d in clone () from /lib64/libc.so.6 Version-Release number of selected component (if applicable): redhat-release-5Server-5.5.0.2 x86_64 # rpm -qa | grep qpid | sort -u python-qpid-0.7.946106-14.el5 qpid-cpp-client-0.7.946106-26.el5 qpid-cpp-client-devel-0.7.946106-26.el5 qpid-cpp-client-devel-docs-0.7.946106-26.el5 qpid-cpp-client-rdma-0.7.946106-26.el5 qpid-cpp-client-ssl-0.7.946106-26.el5 qpid-cpp-mrg-debuginfo-0.7.946106-26.el5 qpid-cpp-server-0.7.946106-26.el5 qpid-cpp-server-cluster-0.7.946106-26.el5 qpid-cpp-server-devel-0.7.946106-26.el5 qpid-cpp-server-rdma-0.7.946106-26.el5 qpid-cpp-server-ssl-0.7.946106-26.el5 qpid-cpp-server-store-0.7.946106-26.el5 qpid-cpp-server-xml-0.7.946106-26.el5 qpid-java-client-0.7.946106-14.el5 qpid-java-common-0.7.946106-14.el5 qpid-java-example-0.7.946106-14.el5 qpid-tools-0.7.946106-11.el5 # rpm -qa | grep sasl | sort -u cyrus-sasl-2.1.22-5.el5_4.3 cyrus-sasl-devel-2.1.22-5.el5_4.3 cyrus-sasl-lib-2.1.22-5.el5_4.3 cyrus-sasl-md5-2.1.22-5.el5_4.3 cyrus-sasl-plain-2.1.22-5.el5_4.3 python-saslwrapper-0.1.934605-2.el5 ruby-saslwrapper-0.1.934605-2.el5 saslwrapper-0.1.934605-2.el5 saslwrapper-devel-0.1.934605-2.el5 How reproducible: ~20% Steps to Reproduce: #run qpidd 1.qpidd -d #continuously run qpid-perfests 2.while [ ! -f core.* ]; do date; qpid-perftest --username guest --username guest -Prdma -b $(ip r l | grep ib[0-9] | cut -d" " -f 12) --qt 4 --count 10; done Actual results: qpid client segfaults Expected results: qpid-perftest running and authenticated Additional info: When mech_list specified in /etc/sasl2/apidd.conf or --mechanism specified for qpid-perftest then no crash observed (tested for more than one hour)
Can you verify that specifying --max-ssf 0 to qpid-perftest also resolves the issue?
with --max-ssf 0 no segfault observed(In reply to comment #1) > Can you verify that specifying --max-ssf 0 to qpid-perftest also resolves the > issue? with --max-ssf 0 no segfault observed
So far I've only been able to reproduce this when running against a broker on the same machine as qpid-perftest (which isn't a likely deployment scenario)
This bug is also present in mrg_1.3_errata (qpid-cpp-mrg-0.7.946106-17) too (so not a recent regression)
(In reply to comment #3) > So far I've only been able to reproduce this when running against a broker on > the same machine as qpid-perftest (which isn't a likely deployment scenario) Although for me it did take a good while (10s of minutes) to reproduce the bug even on the same machine. So it's possible that if I'd run for longer then I'd have encountered the bug eventually when running against a broker on a different machine.
Trying to replicate using mrg_1.2.2 fails because it runs into a different bug first and crashes with a different stack trace. There were a number of bugs fixed in 1.3 that could have done this.
testing with perftest and broker on same system, I have managed to reproduce the problem reported above AS WELL AS ANOTHER PROBLEM -- which seems likely to be related. After more 133 iterations of the qpid-perftest, I have a *broker-side* core, but in the same shared code that produced the client-side core already reported. It has a bad this-pointer in the CyrusSecurityLayer at level 10 in the stack: #0 0x0000003f8d030265 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003f8d031d10 in abort () at abort.c:88 #2 0x0000003f8d0296e6 in __assert_fail (assertion=0x2ad960c410aa "completionsNeeded.get() > 0", file=0x2ad960c40e58 "./qpid/broker/AsyncCompletion.h", line=167, function=0x2ad960c41440 "void qpid::broker::AsyncCompletion::end(qpid::broker::AsyncCompletion::Callback&)") at assert.c:78 #3 0x00002ad960be3935 in qpid::broker::AsyncCompletion::end (this=0x2aaab04a5660, cb=<value optimized out>) at ./qpid/broker/AsyncCompletion.h:167 #4 0x00002ad960be03f3 in qpid::broker::SessionState::handleContent (this=0x2aaab04da490, frame=..., id=<value optimized out>) at qpid/broker/SessionState.cpp:265 #5 0x00002ad960be0830 in qpid::broker::SessionState::handleIn (this=0x2aaab04da490, frame=...) at qpid/broker/SessionState.cpp:355 #6 0x00002ad96107a9ac in qpid::amqp_0_10::SessionHandler::handleIn (this=0x2aaab04d9270, f=...) at qpid/amqp_0_10/SessionHandler.cpp:91 #7 0x00002ad960b2722b in operator() (this=0x2aaab046d830, frame=...) at ./qpid/framing/Handler.h:42 #8 qpid::broker::Connection::received (this=0x2aaab046d830, frame=...) at qpid/broker/Connection.cpp:164 #9 0x00002ad960af9300 in qpid::amqp_0_10::Connection::decode (this=0x2aaab0548d30, buffer=<value optimized out>, size=<value optimized out>) at qpid/amqp_0_10/Connection.cpp:58 #10 0x00002ad9610c23d0 in qpid::sys::cyrus::CyrusSecurityLayer::encode (this=0x90, buffer=0xa4 <Address 0xa4 out of bounds>, size=164) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:77
for the above comment, here is my broker start script, and my perftest script: ============ start of qpidd script ================== #! /bin/bash export LD_LIBRARY_PATH=$TRUNK/qpid/cpp/src/.libs QPID_SRC=$TRUNK/qpid/cpp/src QPIDD=${QPID_SRC}/.libs/qpidd echo $QPIDD rm -rf /tmp/mick mkdir /tmp/mick $QPIDD \ --no-module-dir \ --load-module ${QPID_SRC}/.libs/rdma.so \ --data-dir /tmp/mick/data_1 \ --auth=yes \ --mgmt-enable=yes \ --log-enable info+ \ --log-to-file /tmp/mick/qpidd_1.log \ --log-source yes \ --sasl-config=${QPID_SRC}/tests/sasl_config \ -d echo "started broker from $QPIDD" ============= end of script ============================ ================= start of perftest script ========================= #! /bin/bash rm core.* count=0 while [ $count -lt 2000 ] do echo "===========================================" echo "TEST $count" echo "===========================================" sleep 2 core_files=`ls -l core.* | wc -l` echo "core files: ${core_files}" if [ ${core_files} -gt 0 ]; then echo "core files found!" exit 1 else echo "no core files found." fi ./qpid-perftest --username zig --password zig --protocol rdma --broker 20.0.40.14 --qt 4 --count 10 count=$(( $count + 1 )) done ================= end of script ==================================
fixed by rev 1087047 The registration of the codec happens on a different thread from the use of the codec. It is possible for the registration to occur after the first attempted use. In my testing, this happened 3% of the time -- 165 times out of 5000 tests -- when using RDMA transport, and 0 times out of 5000 when using TCP. Which is why we didn't notice it earlier. We have a function that tells when we are ready to encode -- CyrusSecurityLayer::canEncode. But it does not check the validity of the codec pointer before using it, so it cores in this situation. I believe simply checking that pointer is probably the best solution. Introducing that check caused the crash not to show up in 10,000 trials. There were also no hangs.
Re-posting this BZ. The fix was never back-ported to the package build.
I was able to reproduce the crash, using the dash-7 source RPM -- after manually fixing the small issue in broker/SessionState.cpp that broke the build. ( Just fixing an unused variable -- no effect on code function. ) Using an 8-broker 8-client test -- similar to PPecka's testing strategy, except multiplied by 4 -- I was able to get 31 crashes in 32,000 tests. The core files were identical to the ones I was seeing before I thought that I fixed the problem. So I looked at the source, and my little fix was not there. I assume that this is because I failed to check in the fix to the proper branch -- and only checked it into trunk. After re-introducing that fix to the dash-7 code, I was able to run the same 8x8 test for a total of 80,000 iterations with zero failures. ( The test was 4 brokers and 4 clients on mrg25, 4 brokers and 4 clients on mrg26, each client talks to a single broker and never on the same box with it. )
also PLEASE NOTE -- in all the testing I have done, which totaled close to 200,000 tests -- I have never seen another *broker-side* core like I mentioned in comment #8 above. All cores have been client-side only, SEGV, and zero have appeared in any case where my small fix was present. also -- I was never able to produce any core if the broker and client were on the same box. All successful (core-producing) tests always involved clients talking to brokers on a different box.
Verifid on RHEL5.6 x86_64 60000 runs # rpm -qa | grep qpid qpid-cpp-client-rdma-0.10-8.el5 qpid-cpp-server-store-0.10-8.el5 qpid-qmf-0.10-10.el5 qpid-cpp-client-devel-docs-0.10-8.el5 qpid-cpp-server-ssl-0.10-8.el5 python-qpid-0.10-1.el5 qpid-java-example-0.10-6.el5 qpid-cpp-client-0.10-8.el5 python-qpid-qmf-0.10-10.el5 qpid-cpp-client-devel-0.10-8.el5 qpid-tools-0.10-6.el5 qpid-cpp-server-0.10-8.el5 qpid-cpp-client-ssl-0.10-8.el5 rh-qpid-cpp-tests-0.10-8.el5 qpid-cpp-server-rdma-0.10-8.el5 qpid-cpp-server-devel-0.10-8.el5 qpid-java-client-0.10-6.el5 qpid-cpp-server-cluster-0.10-8.el5 qpid-cpp-server-xml-0.10-8.el5 qpid-java-common-0.10-6.el5 qpid-cpp-mrg-debuginfo-0.10-8.el5 --> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0890.html