Hide Forgot
While verifying bz484691 following errors appeared in qpidd log. Used iWARP over Chelsio S310-CR. Both qpid-perftest and qpid-latency-test were running simultaneously over rdma against qpidd on the other side. Qpid-perftest found in hang state, after couple of hundred runs (56148). Description of problem: (*): Shared library is missing debugging information. (gdb) 6 Thread 0x41572940 (LWP 11893) 0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6 5 Thread 0x41f73940 (LWP 11898) 0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6 4 Thread 0x42974940 (LWP 11900) 0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 0x43375940 (LWP 11904) 0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 0x43d76940 (LWP 11905) 0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6 * 1 Thread 0x2b5924673020 (LWP 11892) 0x0000003623207b35 in pthread_join () from /lib64/libpthread.so.0 (gdb) Thread 6 (Thread 0x41572940 (LWP 11893)): #0 0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003eddf2a4c1 in qpid::sys::Poller::wait (this=0x1d7e9910, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563 #2 0x0000003eddf2af57 in qpid::sys::Poller::run (this=0x1d7e9910) at qpid/sys/epoll/EpollPoller.cpp:515 #3 0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35 #4 0x000000362320673d in start_thread () from /lib64/libpthread.so.0 #5 0x00000036226d40cd in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x41f73940 (LWP 11898)): #0 0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003eddf2a4c1 in qpid::sys::Poller::wait (this=0x1d7e9910, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563 #2 0x0000003eddf2af57 in qpid::sys::Poller::run (this=0x1d7e9910) at qpid/sys/epoll/EpollPoller.cpp:515 #3 0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35 #4 0x000000362320673d in start_thread () from /lib64/libpthread.so.0 #5 0x00000036226d40cd in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x42974940 (LWP 11900)): #0 0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003edf08520b in pop (this=0x1d7ee900, result=..., timeout=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63 #2 qpid::client::LocalQueueImpl::get (this=0x1d7ee900, result=..., timeout=<value optimized out>) at qpid/client/LocalQueueImpl.cpp:49 #3 0x0000003edf085a79 in qpid::client::LocalQueueImpl::get (this=0x1d7ee900, timeout=...) at qpid/client/LocalQueueImpl.cpp:40 #4 0x0000003edf085c19 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout=...) at qpid/client/LocalQueueImpl.cpp:36 #5 0x0000003edf082c2c in qpid::client::LocalQueue::pop ( this=<value optimized out>, timeout=...) at qpid/client/LocalQueue.cpp:43 #6 0x000000000041c564 in qpid::tests::PublishThread::run (this=0x1d7ecd60) at qpid-perftest.cpp:515 #7 0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable ( p=0x1d7f982c) at qpid/sys/posix/Thread.cpp:35 #8 0x000000362320673d in start_thread () from /lib64/libpthread.so.0 #9 0x00000036226d40cd in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x43375940 (LWP 11904)): #0 0x000000362320aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003edf08520b in pop (this=0x1d7f1300, result=..., timeout=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63 #2 qpid::client::LocalQueueImpl::get (this=0x1d7f1300, result=..., timeout=<value optimized out>) at qpid/client/LocalQueueImpl.cpp:49 #3 0x0000003edf085a79 in qpid::client::LocalQueueImpl::get (this=0x1d7f1300, timeout=...) at qpid/client/LocalQueueImpl.cpp:40 #4 0x0000003edf085c19 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout=...) at qpid/client/LocalQueueImpl.cpp:36 #5 0x0000003edf082c2c in qpid::client::LocalQueue::pop ( this=<value optimized out>, timeout=...) at qpid/client/LocalQueue.cpp:43 #6 0x000000000041b0f2 in qpid::tests::SubscribeThread::run ( this=0x2aaab0805330) at qpid-perftest.cpp:626 #7 0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable ( p=0x1d7fe85c) at qpid/sys/posix/Thread.cpp:35 #8 0x000000362320673d in start_thread () from /lib64/libpthread.so.0 #9 0x00000036226d40cd in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x43d76940 (LWP 11905)): #0 0x00000036226d44b8 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003eddf2a4c1 in qpid::sys::Poller::wait (this=0x1d7e9910, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563 #2 0x0000003eddf2af57 in qpid::sys::Poller::run (this=0x1d7e9910) at qpid/sys/epoll/EpollPoller.cpp:515 #3 0x0000003eddf211aa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35 #4 0x000000362320673d in start_thread () from /lib64/libpthread.so.0 #5 0x00000036226d40cd in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2b5924673020 (LWP 11892)): #0 0x0000003623207b35 in pthread_join () from /lib64/libpthread.so.0 #1 0x0000003eddf216fd in qpid::sys::Thread::join (this=<value optimized out>) at qpid/sys/posix/Thread.cpp:70 #2 0x000000000040cf49 in main (argc=<value optimized out>, argv=<value optimized out>) at qpid-perftest.cpp:731 (gdb) Detaching from program: /usr/bin/qpid-perftest, process 11892 Version-Release number of selected component (if applicable): rpm -qa | grep qpid | sort -u python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-27.el5 qpid-cpp-client-devel-0.7.946106-27.el5 qpid-cpp-client-devel-docs-0.7.946106-27.el5 qpid-cpp-client-rdma-0.7.946106-27.el5 qpid-cpp-client-ssl-0.7.946106-27.el5 qpid-cpp-server-0.7.946106-27.el5 qpid-cpp-server-cluster-0.7.946106-27.el5 qpid-cpp-server-devel-0.7.946106-27.el5 qpid-cpp-server-rdma-0.7.946106-27.el5 qpid-cpp-server-ssl-0.7.946106-27.el5 qpid-cpp-server-store-0.7.946106-27.el5 qpid-cpp-server-xml-0.7.946106-27.el5 qpid-java-client-0.7.946106-14.el5 qpid-java-common-0.7.946106-14.el5 qpid-java-example-0.7.946106-14.el5 qpid-tools-0.7.946106-12.el5 libcxgb3-1.2.5-2.el5 kernel-2.6.18-238.el5 How reproducible: Steps to Reproduce: HostA (192.168.1.5 ) 1. qpidd --auth no --mgmt-enable no --log-to-file /tmp/qpidd.log -d HostB (192.168.1.4) 2. while true; do date; qpid-perftest -b 192.168.1.5 --count 100 --protocol rdma --log-to-file /tmp/qpid-perftest.log --log-to-stderr no --base-name "perf.$(date +%s%N)" 2>&1 ; sleep 0.5; done>>/tmp/qpid-perftest.log 3. while true; do date; qpid-latency-test -b 192.168.1.5 --count 100 --protocol rdma --log-to-file /tmp/qpid-latency-test.log --log-to-stderr no --queue-base-name "latency.$(date +%s%N)" 2>&1 ; sleep 0.5; done>>/tmp/qpid-latency-test.log Actual results: qpid-perftest hangs Processing 1 messages from perf.1296229075621256000_sub_ready . done. Sending start 1 times to perf.1296229075621256000_pub_start Processing 1 messages from perf.1296229075621256000_pub_done . done. Processing 1 messages from perf.1296229075621256000_sub_done . done. Total 200 transfers of 1024 bytes in 0.002958 seconds. Publish transfers/sec: 52521 Average: 52521 Subscribe transfers/sec: 467.502 Average: 467.502 Total transfers/sec: 67613.3 Total Mbytes/sec: 66.0286 Fri Jan 28 10:37:56 EST 2011 2011-01-28 10:37:57 warning Connection closed Connection closed Expected results: qpid-perftest works Additional info:
In the reproducer are you running perftest and latencytest in parallel on Host B? Or just running them one after the other?
sorry, didn't read report carefully enough.
Looking at the output from perftest you can see that it has actually finished the run and has hung exiting. This is clear because it has output the test results. Looking at the stack trace you can see this because the main thread has hung in pthread_join().
Failed to replicate this bug using Infiniband hardware - test ran for over 3 hours with nothing hanging. This probably indicates that the problem is specifically triggered by different timing of the iWarp hardware/drivers on the test system.
(In reply to comment #4) > Failed to replicate this bug using Infiniband hardware - test ran for over 3 > hours with nothing hanging. > > This probably indicates that the problem is specifically triggered by different > timing of the iWarp hardware/drivers on the test system. it's not obvious from my first post, but it happened in iteration 56148 of qpid-perftest while-loop
My run on Infinband went for approximately 17,000 iterations of qpid-perftest without failing. So I'd still say it is not reproduced on Infiniband.
QA unable to verify bz484691 due to this issue, due to fact that that qpid-lanecy-test is unable to connect to a broker. Andrew can you confirm that this backtrace is the same one or similar as the former one from comment#0? # rpm -qa | grep qpid | sort -u python-qpid-0.10-1.el5 python-qpid-qmf-0.10-9.el5 qpid-cpp-client-0.10-7.el5 qpid-cpp-client-devel-0.10-7.el5 qpid-cpp-client-devel-docs-0.10-7.el5 qpid-cpp-client-rdma-0.10-7.el5 qpid-cpp-client-ssl-0.10-7.el5 qpid-cpp-mrg-debuginfo-0.10-7.el5 qpid-cpp-server-0.10-7.el5 qpid-cpp-server-cluster-0.10-7.el5 qpid-cpp-server-devel-0.10-7.el5 qpid-cpp-server-rdma-0.10-7.el5 qpid-cpp-server-ssl-0.10-7.el5 qpid-cpp-server-store-0.10-7.el5 qpid-cpp-server-xml-0.10-7.el5 qpid-java-client-0.10-6.el5 qpid-java-common-0.10-6.el5 qpid-java-example-0.10-6.el5 qpid-qmf-0.10-9.el5 qpid-qmf-devel-0.10-9.el5 qpid-tests-0.10-1.el5 qpid-tools-0.10-5.el5 rh-qpid-cpp-tests-0.10-7.el5 (gdb) info threads 6 Thread 32731 0x0000003e1fe07b35 in pthread_join () from /lib64/libpthread.so.0 5 Thread 32743 0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6 4 Thread 32742 0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 32737 0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 32736 0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6 * 1 Thread 32732 0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 6 (Thread 32731): #0 0x0000003e1fe07b35 in pthread_join () from /lib64/libpthread.so.0 #1 0x0000003e20f2386d in qpid::sys::Thread::join (this=<value optimized out>) at qpid/sys/posix/Thread.cpp:70 #2 0x000000000040cf2c in main (argc=<value optimized out>, argv=<value optimized out>) at qpid-perftest.cpp:736 Thread 5 (Thread 32743): #0 0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003e20f2c631 in qpid::sys::Poller::wait (this=0xa24680, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563 #2 0x0000003e20f2d0c7 in qpid::sys::Poller::run (this=0xa24680) at qpid/sys/epoll/EpollPoller.cpp:515 #3 0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35 #4 0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0 #5 0x0000003e1f2d44bd in clone () from /lib64/libc.so.6 Thread 4 (Thread 32742): #0 0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003e21886dd1 in pop (this=0xa25240, result=..., timeout=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63 #2 qpid::client::LocalQueueImpl::get (this=0xa25240, result=..., timeout=<value optimized out>) at qpid/client/LocalQueueImpl.cpp:49 #3 0x0000003e218876e9 in qpid::client::LocalQueueImpl::get (this=0xa25240, timeout=...) at qpid/client/LocalQueueImpl.cpp:40 #4 0x0000003e21887889 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout=...) at qpid/client/LocalQueueImpl.cpp:36 #5 0x0000003e2188488c in qpid::client::LocalQueue::pop (this=<value optimized out>, timeout=...) at qpid/client/LocalQueue.cpp:43 #6 0x000000000041b14a in qpid::tests::SubscribeThread::run (this=0x2aaab00010b0) at qpid-perftest.cpp:631 ---Type <return> to continue, or q <return> to quit--- #7 0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0xa3d7fc) at qpid/sys/posix/Thread.cpp:35 #8 0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0 #9 0x0000003e1f2d44bd in clone () from /lib64/libc.so.6 Thread 3 (Thread 32737): #0 0x0000003e1fe0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000003e21886dd1 in pop (this=0xa27ad0, result=..., timeout=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63 #2 qpid::client::LocalQueueImpl::get (this=0xa27ad0, result=..., timeout=<value optimized out>) at qpid/client/LocalQueueImpl.cpp:49 #3 0x0000003e218876e9 in qpid::client::LocalQueueImpl::get (this=0xa27ad0, timeout=...) at qpid/client/LocalQueueImpl.cpp:40 #4 0x0000003e21887889 in qpid::client::LocalQueueImpl::pop (this=0x80, timeout=...) at qpid/client/LocalQueueImpl.cpp:36 #5 0x0000003e2188488c in qpid::client::LocalQueue::pop (this=<value optimized out>, timeout=...) at qpid/client/LocalQueue.cpp:43 #6 0x000000000041c58a in qpid::tests::PublishThread::run (this=0x2aaab0000c30) at qpid-perftest.cpp:517 #7 0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0xa3776c) at qpid/sys/posix/Thread.cpp:35 #8 0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0 #9 0x0000003e1f2d44bd in clone () from /lib64/libc.so.6 Thread 2 (Thread 32736): #0 0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003e20f2c631 in qpid::sys::Poller::wait (this=0xa24680, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563 #2 0x0000003e20f2d0c7 in qpid::sys::Poller::run (this=0xa24680) at qpid/sys/epoll/EpollPoller.cpp:515 #3 0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35 #4 0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #5 0x0000003e1f2d44bd in clone () from /lib64/libc.so.6 Thread 1 (Thread 32732): #0 0x0000003e1f2d48a8 in epoll_wait () from /lib64/libc.so.6 #1 0x0000003e20f2c631 in qpid::sys::Poller::wait (this=0xa24680, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:563 #2 0x0000003e20f2d0c7 in qpid::sys::Poller::run (this=0xa24680) at qpid/sys/epoll/EpollPoller.cpp:515 #3 0x0000003e20f2331a in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35 #4 0x0000003e1fe0673d in start_thread () from /lib64/libpthread.so.0 #5 0x0000003e1f2d44bd in clone () from /lib64/libc.so.6
and i forgot to note that: - used mrg{4,5} with iWARP - bt is from qpid-perftest - it hangs in 50-60% runs - this means that qpid-perftest hangs, and it blocks qpid-latency-test to connect count=0;while true; do date; echo "COUNTperf: $((count++))";\ qpid-perftest -b 192.168.1.5 --count 100 --protocol rdma\ --log-to-file ./qpid-perftest.log --log-to-stderr no \ --base-name "perf.$(uname -n)" 2>&1 ; sleep 0.5; \ done|tee ./tee_qpid-perftest.log
It looks to me like the latest stack trace is the same as the original.