Bug 674011

Summary: iWARP configuration qpid-perftest hangs
Product: Red Hat Enterprise MRG Reporter: ppecka <ppecka>
Component: qpid-cppAssignee: Andrew Stitcher <astitcher>
Status: CLOSED WONTFIX QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 1.3CC: astitcher, gsim, iboverma, jross
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-22 14:35:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 700156, 484691, 681313, 717875    

Description ppecka 2011-01-31 10:23:27 UTC
While verifying bz484691 following errors appeared in qpidd log. Used iWARP
over Chelsio S310-CR.
Both qpid-perftest and qpid-latency-test were running simultaneously over rdma
against qpidd on the other side. Qpid-perftest found in hang state, after couple of hundred runs (56148).



Description of problem:

(*): Shared library is missing debugging information.
(gdb)   6 Thread 0x41572940 (LWP 11893)  0x00000036226d44b8 in epoll_wait ()
   from /lib64/libc.so.6
  5 Thread 0x41f73940 (LWP 11898)  0x00000036226d44b8 in epoll_wait ()
   from /lib64/libc.so.6
  4 Thread 0x42974940 (LWP 11900)  0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 0x43375940 (LWP 11904)  0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x43d76940 (LWP 11905)  0x00000036226d44b8 in epoll_wait ()
   from /lib64/libc.so.6
* 1 Thread 0x2b5924673020 (LWP 11892)  0x0000003623207b35 in pthread_join ()
   from /lib64/libpthread.so.0
(gdb) 
Thread 6 (Thread 0x41572940 (LWP 11893)):
#0  0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003eddf2a4c1 in qpid::sys::Poller::wait (this=0x1d7e9910, 
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x0000003eddf2af57 in qpid::sys::Poller::run (this=0x1d7e9910)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6)
    at qpid/sys/posix/Thread.cpp:35
#4  0x000000362320673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000036226d40cd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x41f73940 (LWP 11898)):
#0  0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003eddf2a4c1 in qpid::sys::Poller::wait (this=0x1d7e9910, 
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x0000003eddf2af57 in qpid::sys::Poller::run (this=0x1d7e9910)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6)
    at qpid/sys/posix/Thread.cpp:35
#4  0x000000362320673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000036226d40cd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x42974940 (LWP 11900)):
#0  0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000003edf08520b in pop (this=0x1d7ee900, result=..., 
    timeout=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63
#2  qpid::client::LocalQueueImpl::get (this=0x1d7ee900, result=..., 
    timeout=<value optimized out>) at qpid/client/LocalQueueImpl.cpp:49
#3  0x0000003edf085a79 in qpid::client::LocalQueueImpl::get (this=0x1d7ee900, 
    timeout=...) at qpid/client/LocalQueueImpl.cpp:40
#4  0x0000003edf085c19 in qpid::client::LocalQueueImpl::pop (this=0x80, 
    timeout=...) at qpid/client/LocalQueueImpl.cpp:36
#5  0x0000003edf082c2c in qpid::client::LocalQueue::pop (
    this=<value optimized out>, timeout=...) at qpid/client/LocalQueue.cpp:43
#6  0x000000000041c564 in qpid::tests::PublishThread::run (this=0x1d7ecd60)
    at qpid-perftest.cpp:515
#7  0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x1d7f982c) at qpid/sys/posix/Thread.cpp:35
#8  0x000000362320673d in start_thread () from /lib64/libpthread.so.0
#9  0x00000036226d40cd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x43375940 (LWP 11904)):
#0  0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000003edf08520b in pop (this=0x1d7f1300, result=..., 
    timeout=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63
#2  qpid::client::LocalQueueImpl::get (this=0x1d7f1300, result=..., 
    timeout=<value optimized out>) at qpid/client/LocalQueueImpl.cpp:49
#3  0x0000003edf085a79 in qpid::client::LocalQueueImpl::get (this=0x1d7f1300, 
    timeout=...) at qpid/client/LocalQueueImpl.cpp:40
#4  0x0000003edf085c19 in qpid::client::LocalQueueImpl::pop (this=0x80, 
    timeout=...) at qpid/client/LocalQueueImpl.cpp:36
#5  0x0000003edf082c2c in qpid::client::LocalQueue::pop (
    this=<value optimized out>, timeout=...) at qpid/client/LocalQueue.cpp:43
#6  0x000000000041b0f2 in qpid::tests::SubscribeThread::run (
    this=0x2aaab0805330) at qpid-perftest.cpp:626
#7  0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (
    p=0x1d7fe85c) at qpid/sys/posix/Thread.cpp:35
#8  0x000000362320673d in start_thread () from /lib64/libpthread.so.0
#9  0x00000036226d40cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x43d76940 (LWP 11905)):
#0  0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003eddf2a4c1 in qpid::sys::Poller::wait (this=0x1d7e9910, 
    timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x0000003eddf2af57 in qpid::sys::Poller::run (this=0x1d7e9910)
    at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6)
    at qpid/sys/posix/Thread.cpp:35
#4  0x000000362320673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000036226d40cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b5924673020 (LWP 11892)):
#0  0x0000003623207b35 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000003eddf216fd in qpid::sys::Thread::join (this=<value optimized out>)
    at qpid/sys/posix/Thread.cpp:70
#2  0x000000000040cf49 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at qpid-perftest.cpp:731
(gdb) Detaching from program: /usr/bin/qpid-perftest, process 11892


Version-Release number of selected component (if applicable):
rpm -qa | grep qpid | sort -u
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5
qpid-cpp-client-rdma-0.7.946106-27.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-server-rdma-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-java-client-0.7.946106-14.el5
qpid-java-common-0.7.946106-14.el5
qpid-java-example-0.7.946106-14.el5
qpid-tools-0.7.946106-12.el5

libcxgb3-1.2.5-2.el5
kernel-2.6.18-238.el5


How reproducible:


Steps to Reproduce:
HostA (192.168.1.5 )
1. qpidd --auth no --mgmt-enable no --log-to-file /tmp/qpidd.log -d

HostB (192.168.1.4)
2.
while true; do date; qpid-perftest -b 192.168.1.5 --count 100 --protocol rdma
--log-to-file /tmp/qpid-perftest.log --log-to-stderr no --base-name
"perf.$(date +%s%N)"  2>&1 ; sleep 0.5; done>>/tmp/qpid-perftest.log

3.
while true; do date; qpid-latency-test -b 192.168.1.5 --count 100 --protocol
rdma --log-to-file /tmp/qpid-latency-test.log --log-to-stderr no
--queue-base-name "latency.$(date +%s%N)" 2>&1 ; sleep 0.5;
done>>/tmp/qpid-latency-test.log


Actual results:
qpid-perftest hangs
Processing 1 messages from perf.1296229075621256000_sub_ready . done.
Sending start 1 times to perf.1296229075621256000_pub_start
Processing 1 messages from perf.1296229075621256000_pub_done . done.
Processing 1 messages from perf.1296229075621256000_sub_done . done.

Total 200 transfers of 1024 bytes in 0.002958 seconds.

Publish transfers/sec:    
52521
Average: 52521

Subscribe transfers/sec:  
467.502
Average: 467.502

Total transfers/sec:      67613.3
Total Mbytes/sec: 66.0286
Fri Jan 28 10:37:56 EST 2011
2011-01-28 10:37:57 warning Connection  closed

Connection  closed


Expected results:
qpid-perftest works

Additional info:

Comment 1 Andrew Stitcher 2011-02-01 15:10:52 UTC
In the reproducer are you running perftest and latencytest in parallel on Host B? Or just running them one after the other?

Comment 2 Andrew Stitcher 2011-02-01 15:11:56 UTC
sorry, didn't read report carefully enough.

Comment 3 Andrew Stitcher 2011-02-01 15:16:10 UTC
Looking at the output from perftest you can see that it has actually finished the run and has hung exiting.

This is clear because it has output the test results.
Looking at the stack trace you can see this because the main thread has hung in pthread_join().

Comment 4 Andrew Stitcher 2011-02-01 19:05:37 UTC
Failed to replicate this bug using Infiniband hardware - test ran for over 3 hours with nothing hanging.

This probably indicates that the problem is specifically triggered by different timing of the iWarp hardware/drivers on the test system.

Comment 5 ppecka 2011-02-02 09:49:14 UTC
(In reply to comment #4)
> Failed to replicate this bug using Infiniband hardware - test ran for over 3
> hours with nothing hanging.
> 
> This probably indicates that the problem is specifically triggered by different
> timing of the iWarp hardware/drivers on the test system.

it's not obvious from my first post, but it happened in iteration 56148 of qpid-perftest while-loop

Comment 6 Andrew Stitcher 2011-02-02 15:29:18 UTC
My run on Infinband went for approximately 17,000 iterations of qpid-perftest without failing. So I'd still say it is not reproduced on Infiniband.

Comment 8 ppecka 2011-05-27 11:36:57 UTC
QA unable to verify bz484691 due to this issue, due to fact that that qpid-lanecy-test is unable to connect to a broker. Andrew can you confirm that this backtrace is the same one or similar as the former one from comment#0?

# rpm -qa | grep qpid | sort -u
python-qpid-0.10-1.el5
python-qpid-qmf-0.10-9.el5
qpid-cpp-client-0.10-7.el5
qpid-cpp-client-devel-0.10-7.el5
qpid-cpp-client-devel-docs-0.10-7.el5
qpid-cpp-client-rdma-0.10-7.el5
qpid-cpp-client-ssl-0.10-7.el5
qpid-cpp-mrg-debuginfo-0.10-7.el5
qpid-cpp-server-0.10-7.el5
qpid-cpp-server-cluster-0.10-7.el5
qpid-cpp-server-devel-0.10-7.el5
qpid-cpp-server-rdma-0.10-7.el5
qpid-cpp-server-ssl-0.10-7.el5
qpid-cpp-server-store-0.10-7.el5
qpid-cpp-server-xml-0.10-7.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-qmf-0.10-9.el5
qpid-qmf-devel-0.10-9.el5
qpid-tests-0.10-1.el5
qpid-tools-0.10-5.el5
rh-qpid-cpp-tests-0.10-7.el5



(gdb) info threads
  6 Thread 32731  0x0000003e1fe07b35 in pthread_join () from /lib64/libpthread.so.0
  5 Thread 32743  0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6
  4 Thread 32742  0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 32737  0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 32736  0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6
* 1 Thread 32732  0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 6 (Thread 32731):
#0  0x0000003e1fe07b35 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000003e20f2386d in qpid::sys::Thread::join (this=<value optimized out>) at qpid/sys/posix/Thread.cpp:70
#2  0x000000000040cf2c in main (argc=<value optimized out>, argv=<value optimized out>) at qpid-perftest.cpp:736

Thread 5 (Thread 32743):
#0  0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003e20f2c631 in qpid::sys::Poller::wait (this=0xa24680, timeout=<value optimized out>)
    at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x0000003e20f2d0c7 in qpid::sys::Poller::run (this=0xa24680) at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
#4  0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003e1f2d44bd in clone () from /lib64/libc.so.6

Thread 4 (Thread 32742):
#0  0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000003e21886dd1 in pop (this=0xa25240, result=..., timeout=<value optimized out>)
    at ../include/qpid/sys/posix/Condition.h:63
#2  qpid::client::LocalQueueImpl::get (this=0xa25240, result=..., timeout=<value optimized out>)
    at qpid/client/LocalQueueImpl.cpp:49
#3  0x0000003e218876e9 in qpid::client::LocalQueueImpl::get (this=0xa25240, timeout=...)
    at qpid/client/LocalQueueImpl.cpp:40
#4  0x0000003e21887889 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout=...)
    at qpid/client/LocalQueueImpl.cpp:36
#5  0x0000003e2188488c in qpid::client::LocalQueue::pop (this=<value optimized out>, timeout=...)
    at qpid/client/LocalQueue.cpp:43
#6  0x000000000041b14a in qpid::tests::SubscribeThread::run (this=0x2aaab00010b0) at qpid-perftest.cpp:631
---Type <return> to continue, or q <return> to quit---
#7  0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0xa3d7fc) at qpid/sys/posix/Thread.cpp:35
#8  0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0
#9  0x0000003e1f2d44bd in clone () from /lib64/libc.so.6

Thread 3 (Thread 32737):
#0  0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000003e21886dd1 in pop (this=0xa27ad0, result=..., timeout=<value optimized out>)
    at ../include/qpid/sys/posix/Condition.h:63
#2  qpid::client::LocalQueueImpl::get (this=0xa27ad0, result=..., timeout=<value optimized out>)
    at qpid/client/LocalQueueImpl.cpp:49
#3  0x0000003e218876e9 in qpid::client::LocalQueueImpl::get (this=0xa27ad0, timeout=...)
    at qpid/client/LocalQueueImpl.cpp:40
#4  0x0000003e21887889 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout=...)
    at qpid/client/LocalQueueImpl.cpp:36
#5  0x0000003e2188488c in qpid::client::LocalQueue::pop (this=<value optimized out>, timeout=...)
    at qpid/client/LocalQueue.cpp:43
#6  0x000000000041c58a in qpid::tests::PublishThread::run (this=0x2aaab0000c30) at qpid-perftest.cpp:517
#7  0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0xa3776c) at qpid/sys/posix/Thread.cpp:35
#8  0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0
#9  0x0000003e1f2d44bd in clone () from /lib64/libc.so.6

Thread 2 (Thread 32736):
#0  0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003e20f2c631 in qpid::sys::Poller::wait (this=0xa24680, timeout=<value optimized out>)
    at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x0000003e20f2d0c7 in qpid::sys::Poller::run (this=0xa24680) at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
#4  0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#5  0x0000003e1f2d44bd in clone () from /lib64/libc.so.6

Thread 1 (Thread 32732):
#0  0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003e20f2c631 in qpid::sys::Poller::wait (this=0xa24680, timeout=<value optimized out>)
    at qpid/sys/epoll/EpollPoller.cpp:563
#2  0x0000003e20f2d0c7 in qpid::sys::Poller::run (this=0xa24680) at qpid/sys/epoll/EpollPoller.cpp:515
#3  0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
#4  0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003e1f2d44bd in clone () from /lib64/libc.so.6

Comment 9 ppecka 2011-05-27 11:58:46 UTC
and i forgot to note that: 
- used mrg{4,5} with iWARP
- bt is from qpid-perftest
- it hangs in 50-60% runs
- this means that qpid-perftest hangs, and it blocks qpid-latency-test to connect

count=0;while true; do date; echo "COUNTperf: $((count++))";\
 qpid-perftest -b 192.168.1.5 --count 100 --protocol rdma\
 --log-to-file ./qpid-perftest.log --log-to-stderr no \
 --base-name "perf.$(uname -n)"  2>&1 ; sleep 0.5; \
 done|tee ./tee_qpid-perftest.log

Comment 10 Andrew Stitcher 2011-05-27 13:44:36 UTC
It looks to me like the latest stack trace is the same as the original.