Bug 604688

Summary:	rdma stability issues including client crashing when broker is killed from underneath it and broker likewise
Product:	Red Hat Enterprise MRG	Reporter:	Gordon Sim <gsim>
Component:	qpid-cpp	Assignee:	Andrew Stitcher <astitcher>
Status:	CLOSED CURRENTRELEASE	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	beta	CC:	jross
Target Milestone:	1.3
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-12-11 18:53:55 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gordon Sim 2010-06-16 13:55:25 UTC

Description of problem:

rdma client crashing when broker is killed from underneath it

Version-Release number of selected component (if applicable):

qpid trunk r954983

How reproducible:

Within a few minutes usually for me

Steps to Reproduce:
in two separate widnows
1.while true; do src/qpidd --auth no --load-module ./src/.libs/rdma.so --load-module ./src/.libs/ssl.so --ssl-cert-db ~/ssl_test_certs/server_db/ --ssl-cert-password-file ~/ssl_test_certs/cert.password & sleep 10 ; kill -9 %%; sleep 1; done
2. while [ ! -f core.* ]; do ./src/tests/qpid-perftest --count 15000 --protocol rdma --broker 30.0.20.15; done
  
Actual results:

Segfaults

Expected results:

No segfaults

Additional info:

Core was generated by `/home/gordon/qpid/cpp/build/src/tests/.libs/lt-qpid-perftest --count 15000 --pr'.
Program terminated with signal 11, Segmentation fault.
[New process 13913]
[New process 13912]
[New process 13911]
[New process 13910]
[New process 13909]
[New process 13893]
#0  0x00000038fa408ac0 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038fa408ac0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00002b6ad089dffd in std::for_each<__gnu_cxx::__normal_iterator<qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::ThreadStatus**, std::vector<qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::ThreadStatus*, std::allocator<qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::ThreadStatus*> > >, qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::AllThreadsStatuses::handleAdder> (__first=<value optimized out>, 
    __last={_M_current = 0x18986028}, __f={handle = {px = 0x43be2c40, pn = {pi_ = 0x18986028}}}) at ../../include/qpid/sys/posix/Mutex.h:116
#2  0x00002b6ad089e15c in qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::markForDeletion (handle=0x2aaab0001cd0)
    at ../../src/qpid/sys/DeletionManager.h:154
#3  0x00002b6ad095f51a in ~DispatchHandle (this=0x2aaab0001b90) at ../../src/qpid/sys/DispatchHandle.cpp:44
#4  0x00002b6ad0961971 in qpid::sys::DispatchHandle::processEvent (this=0x2aaab0001b90, type=qpid::sys::Poller::READABLE)
    at ../../src/qpid/sys/DispatchHandle.cpp:334
#5  0x00002b6ad089b0b8 in qpid::sys::Poller::run (this=0x1897b7a0) at ../../src/qpid/sys/Poller.h:125
#6  0x00002b6ad089336a in runRunnable (p=0x0) at ../../src/qpid/sys/posix/Thread.cpp:35
#7  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#8  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 6 (process 13893):
#0  0x00000038fa40ad09 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b6ad051015a in qpid::client::LocalQueueImpl::get (this=0x2aaab101f140, result=@0x7fff8e59c010, timeout=<value optimized out>)
    at ../../include/qpid/sys/posix/Condition.h:63
#2  0x00002b6ad0510ab9 in qpid::client::LocalQueueImpl::get (this=0x2aaab101f140, timeout={nanosecs = 9223372036854775807})
    at ../../src/qpid/client/LocalQueueImpl.cpp:40
#3  0x00002b6ad0510c59 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout={nanosecs = 1}) at ../../src/qpid/client/LocalQueueImpl.cpp:36
#4  0x00002b6ad050db0c in qpid::client::LocalQueue::pop (this=<value optimized out>, timeout={nanosecs = 1})
    at ../../src/qpid/client/LocalQueue.cpp:43
#5  0x00000000004132e4 in qpid::tests::Controller::process (this=<value optimized out>, n=1, lq=
        {<qpid::client::Handle<qpid::client::LocalQueueImpl>> = {impl = 0x7fff8e59c360}, <No data fields>}, queue=<value optimized out>, msgFn=
        {<boost::function1<void,const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&,std::allocator<void> >> = {<boost::function_base> = {manager = 0x7fff8e59c1e0, functor = {obj_ptr = 0x7fff8e59c360, const_obj_ptr = 0x7fff8e59c360, func_ptr = 0x7fff8e59c360, data = "`"}}, static args = <optimized out>, static arity = <optimized out>, invoker = 0}, <No data fields>}) at ../../../src/tests/qpid-perftest.cpp:382
#6  0x00000000004138f1 in qpid::tests::Controller::run (this=0x7fff8e59c570) at ../../../src/tests/qpid-perftest.cpp:422
#7  0x000000000040bcd8 in main (argc=-1333701040, argv=<value optimized out>) at ../../../src/tests/qpid-perftest.cpp:719

Thread 5 (process 13909):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b6ad089a563 in qpid::sys::Poller::wait (this=0x1897b7a0, timeout=<value optimized out>) at ../../src/qpid/sys/epoll/EpollPoller.cpp:570
#2  0x00002b6ad089b0c7 in qpid::sys::Poller::run (this=0x1897b7a0) at ../../src/qpid/sys/epoll/EpollPoller.cpp:517
#3  0x00002b6ad089336a in runRunnable (p=0x5) at ../../src/qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 4 (process 13910):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b6ad089a563 in qpid::sys::Poller::wait (this=0x1897b7a0, timeout=<value optimized out>) at ../../src/qpid/sys/epoll/EpollPoller.cpp:570
#2  0x00002b6ad089b0c7 in qpid::sys::Poller::run (this=0x1897b7a0) at ../../src/qpid/sys/epoll/EpollPoller.cpp:517
#3  0x00002b6ad089336a in runRunnable (p=0x5) at ../../src/qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 3 (process 13911):
#0  0x00000038fccbca69 in __gxx_personality_v0 () from /usr/lib64/libstdc++.so.6
#1  0x00000038fbc088e8 in ?? () from /lib64/libgcc_s.so.1
#2  0x00000038fbc08c5a in _Unwind_Resume () from /lib64/libgcc_s.so.1
#3  0x00002b6ad051f13d in qpid::client::SessionImpl::sendContent (this=0x2aaab0808470, content=@0x1897e260)
    at ../../src/qpid/client/SessionImpl.cpp:451
#4  0x00002b6ad05245c2 in qpid::client::SessionImpl::sendCommand (this=0x2aaab0808470, command=@0x427e0d00, content=0x1897e260)
---Type <return> to continue, or q <return> to quit---
    at ../../src/qpid/client/SessionImpl.cpp:398
#5  0x00002b6ad0524769 in qpid::client::SessionImpl::send (this=0x0, command=@0x191d1c20, content=@0x427e0598)
    at ../../src/qpid/client/SessionImpl.cpp:306
#6  0x00002b6ad04d5b3b in qpid::client::no_keyword::AsyncSession_0_10::messageTransfer (this=0x1919ca80, destination=<value optimized out>, 
    acceptMode=<value optimized out>, acquireMode=0 '\0', content=@0x427e0e60, sync=false) at ./qpid/client/no_keyword/AsyncSession_0_10.cpp:63
#7  0x0000000000418357 in qpid::tests::PublishThread::run (this=0x1919ca40) at ../../include/qpid/client/AsyncSession_0_10.h:306
#8  0x00002b6ad089336a in runRunnable (p=0x1) at ../../src/qpid/sys/posix/Thread.cpp:35
#9  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#10 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 2 (process 13912):
#0  0x00000038fa40ad09 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b6ad051015a in qpid::client::LocalQueueImpl::get (this=0x1897e7d0, result=@0x431e1f40, timeout=<value optimized out>)
    at ../../include/qpid/sys/posix/Condition.h:63
#2  0x00002b6ad0510ab9 in qpid::client::LocalQueueImpl::get (this=0x1897e7d0, timeout={nanosecs = 9223372036854775807})
    at ../../src/qpid/client/LocalQueueImpl.cpp:40
#3  0x00002b6ad0510c59 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout={nanosecs = 107}) at ../../src/qpid/client/LocalQueueImpl.cpp:36
#4  0x00002b6ad050db0c in qpid::client::LocalQueue::pop (this=<value optimized out>, timeout={nanosecs = 107})
    at ../../src/qpid/client/LocalQueue.cpp:43
#5  0x0000000000416a7c in qpid::tests::SubscribeThread::run (this=0x2aaab0808aa0) at ../../../src/tests/qpid-perftest.cpp:626
#6  0x00002b6ad089336a in runRunnable (p=0x18986adc) at ../../src/qpid/sys/posix/Thread.cpp:35
#7  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#8  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 1 (process 13913):
#0  0x00000038fa408ac0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00002b6ad089dffd in std::for_each<__gnu_cxx::__normal_iterator<qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::ThreadStatus**, std::vector<qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::ThreadStatus*, std::allocator<qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::ThreadStatus*> > >, qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::AllThreadsStatuses::handleAdder> (__first=<value optimized out>, 
    __last={_M_current = 0x18986028}, __f={handle = {px = 0x43be2c40, pn = {pi_ = 0x18986028}}}) at ../../include/qpid/sys/posix/Mutex.h:116
#2  0x00002b6ad089e15c in qpid::sys::DeletionManager<qpid::sys::PollerHandlePrivate>::markForDeletion (handle=0x2aaab0001cd0)
    at ../../src/qpid/sys/DeletionManager.h:154
#3  0x00002b6ad095f51a in ~DispatchHandle (this=0x2aaab0001b90) at ../../src/qpid/sys/DispatchHandle.cpp:44
#4  0x00002b6ad0961971 in qpid::sys::DispatchHandle::processEvent (this=0x2aaab0001b90, type=qpid::sys::Poller::READABLE)
    at ../../src/qpid/sys/DispatchHandle.cpp:334
#5  0x00002b6ad089b0b8 in qpid::sys::Poller::run (this=0x1897b7a0) at ../../src/qpid/sys/Poller.h:125
#6  0x00002b6ad089336a in runRunnable (p=0x0) at ../../src/qpid/sys/posix/Thread.cpp:35
#7  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#8  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6
(gdb)

Comment 1 Gordon Sim 2010-06-16 14:01:11 UTC

See also https://bugzilla.redhat.com/show_bug.cgi?id=603802#c3 for a different crash. That was without setting ulimit -l 131072 for both client and server. The above trace was after setting this.

Comment 2 Andrew Stitcher 2010-06-16 14:10:29 UTC

is the ip address 30.0.20.15 from the above reproducer an error?

Comment 3 Gordon Sim 2010-06-16 14:13:17 UTC

No, thats the ip of the IB interface on mrg15 where I was testing. Why? It appears to work (perftest will get through its run if allowed to complete, though will often core dump on exit).

Comment 4 Andrew Stitcher 2010-06-16 14:31:28 UTC

That address is not using the same convention as the other machines "nearby". The addresses in use on mrg12/mrg14 are 20.0.40.12/14.

Also there seem to be 2 IB interfaces (ib0/ib1) on mrg15 each with this same address - that seems odd.

Comment 5 Gordon Sim 2010-06-16 14:44:34 UTC

The ifs are addr:30.0.10.15 and addr:30.0.20.15. The first one doesn't work for me so I used the second which does (barring these issues).

Comment 6 Andrew Stitcher 2010-07-19 09:47:55 UTC

Having looked at crash dumps from this and similar crashes now. I'm tending to the conclusion that there is a heap corruption bug somewhere in the rdma code that only exhibits on the client side of the code.

A leading candidate would be freeing a block of memory back to the memory allocator before it has been removed from the rdma receive ring.

Comment 7 Andrew Stitcher 2010-07-22 12:07:15 UTC

I've carried out an experiment that would seem to disprove the idea that it's the rdma hardware overwriting things:

If I stop the client side code from ever freeing the buffers that are used by the rdma hardware then the crash seems to happen faster rather than slower (and in different places).

It still seems consistent with memory corruption of some sort though.

Comment 8 Andrew Stitcher 2010-09-08 18:45:18 UTC

A big raft of changes which massively improve the rdma stability has been checked in as of trunk r995165

Also checked into the mrg_1.3 release branch

Comment 9 Andrew Stitcher 2010-09-08 18:54:56 UTC

There are essentially 3 tests to use to verify the bug:
All of them run the broker and a client in parallel in different windows:

Prerequisites:

Infiniband/IBoE installed and working on the machine.
Either installed packages including the qpid-rdma* packages or compile (make install) from source with a --prefix that is writable (to hold the necessary loadable modules).

1. Run the client in a loop and kill the broker.

Broker:
while [ ! -f core.* ]; do date; src/qpidd --auth no & sleep 1; kill %%; sleep 1; done

client:
while [ ! -f core.* ]; do date; src/tests/qpid-perftest -Prdma -b 20.0.40.14 --qt 4 --count 10; done

2. Run the broker; loop killing the client

Broker:
src/qpidd --auth no

Client:
while [ ! -f core.* ]; do date; src/tests/qpid-perftest --qt 4 -Prdma -b 20.0.40.14 & sleep 5 ; kill %%; sleep 1; done

3. Run the broker; loop the client with very small counts (so exercising the teardown and setup logic frequently)

Broker:
src/qpidd --auth no

Client:
while [ ! -f core.* ]; do date; src/tests/qpid-perftest -Prdma -b 20.0.40.14 --qt 4 --count 10; done