Red Hat Bugzilla – Bug 473936
perftest hanging during durable perftest
Last modified: 2015-11-15 19:06:05 EST
Description of problem: While running perftest RHTS tests I can see strange perftest behavior. perftest hangs one of first perftest<--->qpidd sessions. Rapidly reproducable on my RHEL5.2 x86_64 using qpid_test_broker_perftest_ecte_bz468428 RHTS test (http://cvs.devel.redhat.com/cgi-bin/cvsweb.cgi/tests/distribution/MRG_Messaging/qpid_test_broker_perftest_ecte_bz468428/). Part A loops following command: PERFTEST_TEST_PARAMS="--log-enable info+" perftest --durable yes --tcp-nodelay ${PERFTEST_TEST_PARAMS} \ --npubs 4 --nsubs 4 --sub-tx 10 Version-Release number of selected component (if applicable): rhm-0.3.2804-4.el5, qpidd-0.3.719671-1.el5, qpidc-perftest-0.3.719671-1.el5 How reproducible: 95-100% on quad code Intel(R) Xeon(R) CPU X5460 @ 3.16GHz Steps to Reproduce: alpha] 1. Download qpid_test_broker_perftest_ecte_bz468428 2. Run test beta] 1. Run broker (non-daemon) 2. Run perftest --durable yes --tcp-nodelay --log-enable info+ --npubs 4 --nsubs 4 --sub-tx 10 Actual results: perftest hangs during perftest-qpidd session (qpid_test_broker_perftest_ecte_bz468428 test): [root@dhcp-lab-200 qpid_test_broker_perftest_ecte_bz468428]# gps 'make rb' root 10977 0.0 0.0 59116 864 pts/13 S 12:02 0:00 make rb root 25699 0.0 0.0 61108 748 pts/13 S+ 15:16 0:00 grep make rb [root@dhcp-lab-200 qpid_test_broker_perftest_ecte_bz468428]# pstree -p 10977 make(10977)───runtest.sh(10981)─┬─runtest.sh(11003)─┬─perftest(11145)─┬─{perftest}(11155) │ │ ├─{perftest}(11156) │ │ ├─{perftest}(11157) │ │ ├─{perftest}(11158) │ │ ├─{perftest}(11159) │ │ ├─{perftest}(11161) │ │ ├─{perftest}(11162) │ │ ├─{perftest}(11163) │ │ └─{perftest}(11164) │ └─runtest.sh(11091)───runtest.sh(11092)───qpidd(11099)─┬─{qpidd}(11100) │ ├─{qpidd}(11101) │ ├─{qpidd}(11102) │ ├─{qpidd}(11103) │ ├─{qpidd}(11104) │ ├─{qpidd}(11106) │ ├─{qpidd}(11107) │ ├─{qpidd}(11109) │ ├─{qpidd}(11110) │ ├─{qpidd}(11111) │ ├─{qpidd}(11112) │ └─{qpidd}(11113) └─tee(11005) perftest program stack trace: [root@dhcp-lab-200 qpid_test_broker_perftest_ecte_bz468428]# pstack 11145 Thread 10 (Thread 1175386432 (LWP 11155)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a5261f2b in qpid::client::TCPConnector::run () #4 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #5 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #6 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 9 (Thread 1185876288 (LWP 11156)): #0 0x000000350bc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () #1 0x00000034a5285529 in qpid::sys::Waitable::wait () #2 0x00000034a5284108 in qpid::client::LocalQueue::get () #3 0x00000034a5284969 in qpid::client::LocalQueue::get () #4 0x00000034a5284c29 in qpid::client::LocalQueue::pop () #5 0x0000000000419c55 in qpid::log::Options::~Options () #6 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #7 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #8 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 8 (Thread 1196366144 (LWP 11157)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a5261f2b in qpid::client::TCPConnector::run () #4 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #5 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #6 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 7 (Thread 1206856000 (LWP 11158)): #0 0x000000350bc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () #1 0x00000034a5285529 in qpid::sys::Waitable::wait () #2 0x00000034a5284108 in qpid::client::LocalQueue::get () #3 0x00000034a5284969 in qpid::client::LocalQueue::get () #4 0x00000034a5284c29 in qpid::client::LocalQueue::pop () #5 0x0000000000419c55 in qpid::log::Options::~Options () #6 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #7 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #8 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 6 (Thread 1217345856 (LWP 11159)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a5261f2b in qpid::client::TCPConnector::run () #4 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #5 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #6 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 5 (Thread 1227835712 (LWP 11161)): #0 0x000000350bc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () #1 0x00000034a5285529 in qpid::sys::Waitable::wait () #2 0x00000034a5284108 in qpid::client::LocalQueue::get () #3 0x00000034a5284969 in qpid::client::LocalQueue::get () #4 0x00000034a5284c29 in qpid::client::LocalQueue::pop () #5 0x0000000000419c55 in qpid::log::Options::~Options () #6 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #7 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #8 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 4 (Thread 1238325568 (LWP 11162)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a5261f2b in qpid::client::TCPConnector::run () #4 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #5 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #6 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 3 (Thread 1248815424 (LWP 11163)): #0 0x000000350bc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () #1 0x00000034a5285529 in qpid::sys::Waitable::wait () #2 0x00000034a5284108 in qpid::client::LocalQueue::get () #3 0x00000034a5284969 in qpid::client::LocalQueue::get () #4 0x00000034a5284c29 in qpid::client::LocalQueue::pop () #5 0x0000000000419c55 in qpid::log::Options::~Options () #6 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #7 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #8 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 2 (Thread 1259305280 (LWP 11164)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a5261f2b in qpid::client::TCPConnector::run () #4 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #5 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #6 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 1 (Thread 47361525085888 (LWP 11145)): #0 0x000000350bc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () #1 0x00000034a5285529 in qpid::sys::Waitable::wait () #2 0x00000034a5284108 in qpid::client::LocalQueue::get () #3 0x00000034a5284969 in qpid::client::LocalQueue::get () #4 0x00000034a5284c29 in qpid::client::LocalQueue::pop () #5 0x0000000000414ba6 in qpid::optValue<std::string> () #6 0x00000000004158b8 in qpid::optValue<std::string> () #7 0x000000000040c6d4 in qpid::framing::TransferContent::getData () #8 0x000000350b01d8b4 in __libc_start_main () from /lib64/libc.so.6 #9 0x000000000040b7a9 in qpid::framing::TransferContent::getData () #10 0x00007fff73ccbcc8 in ?? () #11 0x0000000000000000 in ?? () #0 0x000000350bc0a496 in pthread_cond_wait@@GLIBC_2.3.2 () qpidd program stack trace: [root@dhcp-lab-200 qpid_test_broker_perftest_ecte_bz468428]# pstack 11099 Thread 13 (Thread 1110358336 (LWP 11100)): #0 0x000000350bc0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () #1 0x00000034a536f2f6 in qpid::broker::Timer::run () #2 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #3 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #4 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 12 (Thread 1120848192 (LWP 11101)): #0 0x000000350bc0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () #1 0x00000034a536f2f6 in qpid::broker::Timer::run () #2 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #3 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #4 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 11 (Thread 1131338048 (LWP 11102)): #0 0x000000350bc0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () #1 0x00000034a536f2f6 in qpid::broker::Timer::run () #2 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #3 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #4 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 10 (Thread 1095444800 (LWP 11103)): #0 0x000000350bc0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () #1 0x00000034a536f2f6 in qpid::broker::Timer::run () #2 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #3 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #4 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 9 (Thread 1141827904 (LWP 11104)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 8 (Thread 1152317760 (LWP 11106)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 7 (Thread 1162807616 (LWP 11107)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 6 (Thread 1173297472 (LWP 11109)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 5 (Thread 1183787328 (LWP 11110)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 4 (Thread 1194277184 (LWP 11111)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 3 (Thread 1204767040 (LWP 11112)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 2 (Thread 1215256896 (LWP 11113)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a4d7311a in qpid::sys::AbsTime::AbsTime () #4 0x000000350bc062f7 in start_thread () from /lib64/libpthread.so.0 #5 0x000000350b0d1b6d in clone () from /lib64/libc.so.6 Thread 1 (Thread 47255519278992 (LWP 11099)): #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 #1 0x00000034a4d7b3cd in qpid::sys::Poller::wait () #2 0x00000034a4dbef18 in qpid::sys::Dispatcher::run () #3 0x00000034a52c72de in qpid::broker::Broker::run () #4 0x0000000000406858 in qpid::log::Options::~Options () #5 0x0000000000405438 in __cxa_pure_virtual () #6 0x000000350b01d8b4 in __libc_start_main () from /lib64/libc.so.6 #7 0x0000000000404eb9 in __cxa_pure_virtual () #8 0x00007fff223ce3f8 in ?? () #9 0x0000000000000000 in ?? () #0 0x000000350b0d1f58 in epoll_wait () from /lib64/libc.so.6 Expected results: No hang. Perftest should exit after some reasonable time. Additional info: I performed pstack dumps multiple-times and see the same ones over the time.
Reproduced now also on another machine RHTS 'nec-em15.rhts.bos.redhat.com' which is: [10:19:39] Machine and user info: [10:19:39] os: Red Hat Enterprise Linux Server release 5.2 (Tikanga) [10:19:39] uname: Linux nec-em15.rhts.bos.redhat.com 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux [10:19:39] user: root [10:19:39] pwd: /root/qpid_test_broker_perftest_ecte_bz468428 [10:19:39] Get CPU information: processor(s) : 0...3 {4 in total} vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 2.80GHz cpu MHz : 2793.093 cpu cores : 1 bogomips : 5588.46 bogomips : 5585.36 bogomips : 5585.35 bogomips : 5585.29 I saw it hang on the first durable perftest (pid:4281) After 45 minutes I cancelled the test and experiment package is available on mrg3.lab.bos.redhat.com:/root/qpid_test_broker_perftest_ecte_bz468428_hang_081202.tar.bz2 Both qpidd and perftest switched to trace log mode so log files available in: qpid_test_broker_perftest_ecte_bz468428/perftest.transcript.log qpid_test_broker_perftest_ecte_bz468428/qpidd.transcript.log pstree & ps gave this (afte 40 mins of inactivity) make(4131)───runtest.sh(4135)─┬─runtest.sh(4150)─┬─perftest(4281)─┬─{perftest} │ │ ├─{perftest} │ │ ├─{perftest} │ │ ├─{perftest} │ │ ├─{perftest} │ │ ├─{perftest} │ │ ├─{perftest} │ │ ├─{perftest} │ │ └─{perftest} │ └─runtest.sh(4237)───runtest.sh(4238)───qpidd(4241)─┬─{qpidd}(4243) └─tee(4151) [root@nec-em15 qpid_test_broker_perftest_ecte_bz468428]# ps auxw | egrep '(4131|4135|4150|4151|4281|4237|4238|4241|429[1-9]|424[3-9]|4250)' root 4131 0.0 0.0 59164 896 pts/0 S+ 10:19 0:00 make rb root 4135 0.0 0.1 64480 1876 pts/0 S+ 10:19 0:00 /bin/bash ./runtest.sh root 4150 0.0 0.1 64480 1408 pts/0 S+ 10:19 0:00 /bin/bash ./runtest.sh root 4151 0.0 0.0 58904 584 pts/0 S+ 10:19 0:00 tee /root/qpid_test_broker_perftest_ecte_bz468428/qpid_test_broker_perftest_ecte_bz468428.log root 4237 0.0 0.1 64480 1068 pts/0 S+ 10:19 0:00 /bin/bash ./runtest.sh root 4238 0.0 0.1 64480 1288 pts/0 S+ 10:19 0:00 /bin/bash ./runtest.sh root 4241 0.7 3.7 394400 38728 pts/0 Sl+ 10:19 0:13 /usr/sbin/qpidd --data-dir /root/qpid_test_broker_perftest_ecte_bz468428/rhts_qpidd/20081202_101939/broker.4229 --port 5672 --auth no --log-enable trace+ root 4281 0.2 3.1 389512 32396 pts/0 Sl+ 10:19 0:03 /usr/bin/perftest --durable yes --tcp-nodelay --log-enable trace+ --npubs 4 --nsubs 4 --sub-tx 10 pstack gave to me "" output on both PIDs: 4281 (perftest), 4241(qpidd) I can re-run somewhere else if needed.
I believe this is not-a-bug (at least based on the logs on mrg3). The publisher is failing due to "PublishThread exception: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:510)" which is expected as the subscribers will be much slower than the publishers and the queue will grow until the publishers complete. As there is four of them using default count (500,000) and size (1k) the journal capacity needs to be above 2G to avoid running out of capacity. When the publisher fails midway, the subscribers will never get the required number of messages and will wait for ever.
I performed test with reduced number of messages: perftest --durable yes --tcp-nodelay --count 5000 --npubs 4 --nsubs 4 --sub-tx 10 and now do not see PublishThread exception: Enqueue capacity threshold exceeded on queue "perftest0" anymore and perftest normally passes also enlarging the journal will make perftest passing. ->CLOSE && NOT-A-BUG