Bug 671369 - RDMA client can segfault when no SASL mechanism specified
Summary: RDMA client can segfault when no SASL mechanism specified
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0
: ---
Assignee: mick
QA Contact: ppecka
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-21 12:52 UTC by ppecka
Modified: 2011-08-12 16:21 UTC (History)
4 users (show)

Fixed In Version: qpid-cpp-mrg-0.10-8
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-23 15:45:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0890 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 2.0 Release 2011-06-23 15:42:41 UTC

Description ppecka 2011-01-21 12:52:43 UTC
Description of problem:
Qpid-perftest segfaults, when running over RDMA against qpidd --auth yes without explicitly specifying sasl mechanism ( on client or server side) 

While testing as described in bz631969


Program terminated with signal 11, Segmentation fault.
#0  0x00002b55fc82bca9 in qpid::sys::cyrus::CyrusSecurityLayer::canEncode (this=<value optimized out>) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:109
109         return encrypted || codec->canEncode();
(gdb) info threads
  17 Thread 18083  qpid::client::RdmaConnector::activateSecurityLayer (this=<value optimized out>, sl=<value optimized out>) at qpid/client/RdmaConnector.cpp:429
  16 Thread 18086  0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6
  15 Thread 18089  0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6
  14 Thread 18091  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  13 Thread 18092  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12 Thread 18093  0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6
  11 Thread 18099  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10 Thread 18100  0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6
  9 Thread 18102  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8 Thread 18103  0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6
  7 Thread 18105  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6 Thread 18106  0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6
  5 Thread 18108  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4 Thread 18109  0x0000003462ed4358 in epoll_wait () from /lib64/libc.so.6
  3 Thread 18111  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 18115  0x0000003463a0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 18112  0x00002b55fc82bca9 in qpid::sys::cyrus::CyrusSecurityLayer::canEncode (this=<value optimized out>) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:109
(gdb) thread apply 1 bt

Thread 1 (Thread 18112):
#0  0x00002b55fc82bca9 in qpid::sys::cyrus::CyrusSecurityLayer::canEncode (this=<value optimized out>) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:109
#1  0x00002b55fd138ff5 in qpid::client::RdmaConnector::writebuff (this=0x2aaab0013a80) at qpid/client/RdmaConnector.cpp:356
#2  0x00002b55fd3764ba in boost::function1<void, Rdma::AsynchIO&, std::allocator<boost::function_base> >::operator() (this=0x1, a0=...) at /usr/include/boost/function/function_template.hpp:576
#3  0x00002b55fd36df8d in Rdma::AsynchIO::doWriteCallback (this=0x1fd6a6d0) at qpid/sys/rdma/RdmaIO.cpp:448
#4  0x00002b55fd3738fb in Rdma::AsynchIO::writeEvent (this=0x1fd6a6d0) at qpid/sys/rdma/RdmaIO.cpp:325
#5  0x00002b55fd374060 in Rdma::AsynchIO::dataEvent (this=0x1fd6a6d0) at qpid/sys/rdma/RdmaIO.cpp:305
#6  0x00002b55fc81c83a in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (this=0x1, a0=...)
    at /usr/include/boost/function/function_template.hpp:576
#7  0x00002b55fc81bf3f in qpid::sys::DispatchHandle::processEvent (this=0x1fd6a7d0, type=READABLE) at qpid/sys/DispatchHandle.cpp:280
#8  0x00002b55fc74f764 in process (this=0x1bba16e0) at qpid/sys/Poller.h:125
#9  qpid::sys::Poller::run (this=0x1bba16e0) at qpid/sys/epoll/EpollPoller.cpp:519
#10 0x00002b55fc74598a in qpid::sys::(anonymous namespace)::runRunnable (p=0x0) at qpid/sys/posix/Thread.cpp:35
#11 0x0000003463a0673d in start_thread () from /lib64/libpthread.so.0
#12 0x0000003462ed3f6d in clone () from /lib64/libc.so.6


Version-Release number of selected component (if applicable):

redhat-release-5Server-5.5.0.2 x86_64

# rpm -qa  | grep qpid | sort -u
python-qpid-0.7.946106-14.el5
qpid-cpp-client-0.7.946106-26.el5
qpid-cpp-client-devel-0.7.946106-26.el5
qpid-cpp-client-devel-docs-0.7.946106-26.el5
qpid-cpp-client-rdma-0.7.946106-26.el5
qpid-cpp-client-ssl-0.7.946106-26.el5
qpid-cpp-mrg-debuginfo-0.7.946106-26.el5
qpid-cpp-server-0.7.946106-26.el5
qpid-cpp-server-cluster-0.7.946106-26.el5
qpid-cpp-server-devel-0.7.946106-26.el5
qpid-cpp-server-rdma-0.7.946106-26.el5
qpid-cpp-server-ssl-0.7.946106-26.el5
qpid-cpp-server-store-0.7.946106-26.el5
qpid-cpp-server-xml-0.7.946106-26.el5
qpid-java-client-0.7.946106-14.el5
qpid-java-common-0.7.946106-14.el5
qpid-java-example-0.7.946106-14.el5
qpid-tools-0.7.946106-11.el5

# rpm -qa | grep sasl | sort -u
cyrus-sasl-2.1.22-5.el5_4.3
cyrus-sasl-devel-2.1.22-5.el5_4.3
cyrus-sasl-lib-2.1.22-5.el5_4.3
cyrus-sasl-md5-2.1.22-5.el5_4.3
cyrus-sasl-plain-2.1.22-5.el5_4.3
python-saslwrapper-0.1.934605-2.el5
ruby-saslwrapper-0.1.934605-2.el5
saslwrapper-0.1.934605-2.el5
saslwrapper-devel-0.1.934605-2.el5


How reproducible:
~20%

Steps to Reproduce:
#run qpidd
1.qpidd -d

#continuously run qpid-perfests 
2.while [ ! -f core.* ]; do date; qpid-perftest --username guest --username guest -Prdma -b $(ip r l | grep ib[0-9] | cut -d" " -f 12) --qt 4 --count 10; done

Actual results:
qpid client segfaults

Expected results:
qpid-perftest running and authenticated

Additional info:
When mech_list specified in /etc/sasl2/apidd.conf or --mechanism specified for qpid-perftest then no crash observed (tested for more than one hour)

Comment 1 Gordon Sim 2011-01-25 15:46:26 UTC
Can you verify that specifying --max-ssf 0 to qpid-perftest also resolves the issue?

Comment 2 ppecka 2011-01-25 16:10:05 UTC
with --max-ssf 0 no segfault observed(In reply to comment #1)
> Can you verify that specifying --max-ssf 0 to qpid-perftest also resolves the
> issue?

with --max-ssf 0 no segfault observed

Comment 3 Andrew Stitcher 2011-01-25 18:31:07 UTC
So far I've only been able to reproduce this when running against a broker on the same machine as qpid-perftest (which isn't a likely deployment scenario)

Comment 4 Andrew Stitcher 2011-01-25 19:16:11 UTC
This bug is also present in mrg_1.3_errata (qpid-cpp-mrg-0.7.946106-17) too (so not a recent regression)

Comment 5 Andrew Stitcher 2011-01-25 19:23:29 UTC
(In reply to comment #3)
> So far I've only been able to reproduce this when running against a broker on
> the same machine as qpid-perftest (which isn't a likely deployment scenario)

Although for me it did take a good while (10s of minutes) to reproduce the bug even on the same machine. So it's possible that if I'd run for longer then I'd have encountered the bug eventually when running against a broker on a different machine.

Comment 6 Andrew Stitcher 2011-01-25 19:42:01 UTC
Trying to replicate using mrg_1.2.2 fails because it runs into a different bug first and crashes with a different stack trace. There were a number of bugs fixed in 1.3 that could have done this.

Comment 8 mick 2011-03-29 13:09:10 UTC
testing with perftest and broker on same system, I have managed to reproduce the problem reported above AS WELL AS ANOTHER PROBLEM -- which seems likely to be related.

After more 133 iterations of the qpid-perftest, I have a *broker-side* core, but in the same shared code that produced the client-side core already reported.

It has a bad this-pointer in the CyrusSecurityLayer at level 10 in the stack:

    #0  0x0000003f8d030265 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
    #1  0x0000003f8d031d10 in abort () at abort.c:88
    #2  0x0000003f8d0296e6 in __assert_fail (assertion=0x2ad960c410aa "completionsNeeded.get() > 0",
        file=0x2ad960c40e58 "./qpid/broker/AsyncCompletion.h", line=167,
        function=0x2ad960c41440 "void qpid::broker::AsyncCompletion::end(qpid::broker::AsyncCompletion::Callback&)")
        at assert.c:78
    #3  0x00002ad960be3935 in qpid::broker::AsyncCompletion::end (this=0x2aaab04a5660, cb=<value optimized out>)
        at ./qpid/broker/AsyncCompletion.h:167
    #4  0x00002ad960be03f3 in qpid::broker::SessionState::handleContent (this=0x2aaab04da490, frame=...,
        id=<value optimized out>) at qpid/broker/SessionState.cpp:265
    #5  0x00002ad960be0830 in qpid::broker::SessionState::handleIn (this=0x2aaab04da490, frame=...)
        at qpid/broker/SessionState.cpp:355
    #6  0x00002ad96107a9ac in qpid::amqp_0_10::SessionHandler::handleIn (this=0x2aaab04d9270, f=...)
        at qpid/amqp_0_10/SessionHandler.cpp:91
    #7  0x00002ad960b2722b in operator() (this=0x2aaab046d830, frame=...) at ./qpid/framing/Handler.h:42
    #8  qpid::broker::Connection::received (this=0x2aaab046d830, frame=...) at qpid/broker/Connection.cpp:164
    #9  0x00002ad960af9300 in qpid::amqp_0_10::Connection::decode (this=0x2aaab0548d30, buffer=<value optimized out>,
        size=<value optimized out>) at qpid/amqp_0_10/Connection.cpp:58
    #10 0x00002ad9610c23d0 in qpid::sys::cyrus::CyrusSecurityLayer::encode (this=0x90,
        buffer=0xa4 <Address 0xa4 out of bounds>, size=164) at qpid/sys/cyrus/CyrusSecurityLayer.cpp:77

Comment 9 mick 2011-03-29 13:32:39 UTC
for the above comment, here is my broker start script,
and my perftest script:


============ start of qpidd script ==================
#! /bin/bash

export LD_LIBRARY_PATH=$TRUNK/qpid/cpp/src/.libs

QPID_SRC=$TRUNK/qpid/cpp/src
QPIDD=${QPID_SRC}/.libs/qpidd

echo $QPIDD

rm -rf /tmp/mick
mkdir /tmp/mick

$QPIDD                                        \
  --no-module-dir                             \
  --load-module ${QPID_SRC}/.libs/rdma.so     \
  --data-dir /tmp/mick/data_1                 \
  --auth=yes                                  \
  --mgmt-enable=yes                           \
  --log-enable info+                          \
  --log-to-file /tmp/mick/qpidd_1.log         \
  --log-source yes                            \
  --sasl-config=${QPID_SRC}/tests/sasl_config \
  -d

echo "started broker from $QPIDD"
============= end of script ============================




================= start of perftest script =========================
#! /bin/bash

rm core.*

count=0
while [ $count -lt 2000 ]
do
  echo "==========================================="
  echo "TEST $count"
  echo "==========================================="
  sleep 2
  core_files=`ls -l core.* | wc -l`
  echo "core files:  ${core_files}"

  if [ ${core_files} -gt 0 ]; then
    echo "core files found!"
    exit 1
  else
    echo "no core files found."
  fi
  ./qpid-perftest --username zig --password zig --protocol rdma --broker 20.0.40.14 --qt 4 --count 10
  count=$(( $count + 1 ))
done
================= end of script ==================================

Comment 10 mick 2011-03-30 19:15:00 UTC
fixed by rev 1087047

     The registration of the codec happens on a different thread from the
    use of the codec.  It is possible for the registration to occur after
    the first attempted use.  In my testing, this happened 3% of the time
    -- 165 times out of 5000 tests -- when using RDMA transport, and 0 times
    out of 5000 when using TCP.  Which is why we didn't notice it earlier.

    We have a function that tells when we are ready to encode --
    CyrusSecurityLayer::canEncode.  But it does not check the validity of
    the codec pointer before using it, so it cores in this situation.

    I believe simply checking that pointer is probably the best solution.
    Introducing that check caused the crash not to show up in 10,000
    trials.  There were also no hangs.

Comment 16 Ted Ross 2011-06-06 21:34:10 UTC
Re-posting this BZ.  The fix was never back-ported to the package build.

Comment 17 mick 2011-06-07 12:54:45 UTC
I was able to reproduce the crash, using the dash-7 source RPM -- after manually fixing the small issue in broker/SessionState.cpp that broke the build.  ( Just fixing an unused variable -- no effect on code function. )

Using an 8-broker 8-client test -- similar to PPecka's testing strategy, except multiplied by 4 -- I was able to get 31 crashes in 32,000 tests.

The core files were identical to the ones I was seeing before I thought that I fixed the problem.  So I looked at the source, and my little fix was not there.  I assume that this is because I failed to check in the fix to the proper branch -- and only checked it into trunk.

After re-introducing that fix to the dash-7 code, I was able to run the same 8x8 test for a total of 80,000 iterations with zero failures.  


( The test was 4 brokers and 4 clients on mrg25, 4 brokers and 4 clients on mrg26, each client talks to a single broker and never on the same box with it. )

Comment 18 mick 2011-06-07 12:55:32 UTC

also PLEASE NOTE -- in all the testing I have done, which totaled close to 200,000 tests -- I have never seen another *broker-side* core like I mentioned in comment #8 above.  All cores have been client-side only, SEGV, and zero have appeared in any case where my small fix was present.

also -- I was never able to produce any core if the broker and client were on the same box.  All successful (core-producing) tests always involved clients talking to brokers on a different box.

Comment 19 ppecka 2011-06-13 07:31:46 UTC
Verifid on RHEL5.6 x86_64

60000 runs

# rpm -qa | grep qpid
qpid-cpp-client-rdma-0.10-8.el5
qpid-cpp-server-store-0.10-8.el5
qpid-qmf-0.10-10.el5
qpid-cpp-client-devel-docs-0.10-8.el5
qpid-cpp-server-ssl-0.10-8.el5
python-qpid-0.10-1.el5
qpid-java-example-0.10-6.el5
qpid-cpp-client-0.10-8.el5
python-qpid-qmf-0.10-10.el5
qpid-cpp-client-devel-0.10-8.el5
qpid-tools-0.10-6.el5
qpid-cpp-server-0.10-8.el5
qpid-cpp-client-ssl-0.10-8.el5
rh-qpid-cpp-tests-0.10-8.el5
qpid-cpp-server-rdma-0.10-8.el5
qpid-cpp-server-devel-0.10-8.el5
qpid-java-client-0.10-6.el5
qpid-cpp-server-cluster-0.10-8.el5
qpid-cpp-server-xml-0.10-8.el5
qpid-java-common-0.10-6.el5
qpid-cpp-mrg-debuginfo-0.10-8.el5

--> VERIFIED

Comment 20 errata-xmlrpc 2011-06-23 15:45:59 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0890.html


Note You need to log in before you can comment on or make changes to this bug.