Bug 454463

Summary:	topic tests fails to complete - not delivering all msgs
Product:	Red Hat Enterprise MRG	Reporter:	David Sommerseth <davids>
Component:	qpid-cpp	Assignee:	Kim van der Riet <kim.vdriet>
Status:	CLOSED NOTABUG	QA Contact:	Kim van der Riet <kim.vdriet>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	1.0	CC:	ovasik
Target Milestone:	1.0.1
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-07-15 20:26:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Sommerseth 2008-07-08 16:44:12 UTC

When running topic tests with BDB storage and topic_listener + topic_publisher
with --durable --transactional it seems to stop at some point ...

Qpid SVN rev: 674822
BDB SVN rev: 2179

--------------------------------------------------------------------------
*** pstack: broker
Thread 9 (Thread 1082132800 (LWP 23890)):
#0  0x0000003b8480a687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1  0x00002b71d5b6b57f in qpid::broker::Timer::run ()
#2  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#3  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 8 (Thread 1090525504 (LWP 23891)):
#0  0x0000003b8480a496 in pthread_cond_wait@@GLIBC_2.3.2 ()
#1  0x00002b71d5b6b331 in qpid::broker::Timer::run ()
#2  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#3  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 7 (Thread 1098918208 (LWP 23892)):
#0  0x0000003b8480a687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1  0x00002b71d5b6b57f in qpid::broker::Timer::run ()
#2  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#3  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 6 (Thread 1107310912 (LWP 23893)):
#0  0x0000003b8480a687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1  0x00002b71d5b6b57f in qpid::broker::Timer::run ()
#2  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#3  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 5 (Thread 1115703616 (LWP 23894)):
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b71d5eff8cd in qpid::sys::Poller::wait ()
#2  0x00002b71d5f3f3b8 in qpid::sys::Dispatcher::run ()
#3  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#4  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 4 (Thread 1124096320 (LWP 23895)):
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b71d5eff8cd in qpid::sys::Poller::wait ()
#2  0x00002b71d5f3f3b8 in qpid::sys::Dispatcher::run ()
#3  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#4  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 3 (Thread 1132489024 (LWP 23896)):
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b71d5eff8cd in qpid::sys::Poller::wait ()
#2  0x00002b71d5f3f3b8 in qpid::sys::Dispatcher::run ()
#3  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#4  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 2 (Thread 1140881728 (LWP 23897)):
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b71d5eff8cd in qpid::sys::Poller::wait ()
#2  0x00002b71d5f3f3b8 in qpid::sys::Dispatcher::run ()
#3  0x00002b71d5f0c57a in qpid::sys::Thread::runRunnable ()
#4  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 47767923555888 (LWP 23876)):
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b71d5eff8cd in qpid::sys::Poller::wait ()
#2  0x00002b71d5f3f3b8 in qpid::sys::Dispatcher::run ()
#3  0x00002b71d5ae15af in qpid::broker::Broker::run ()
#4  0x0000000000407429 in main ()
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6


--------------------------------------------------------------------------
*** pstack: listener
Thread 2 (Thread 1082132800 (LWP 24017)):
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b6e360a58cd in qpid::sys::Poller::wait ()
#2  0x00002b6e360e53b8 in qpid::sys::Dispatcher::run ()
#3  0x00002b6e35d28646 in qpid::client::Connector::run ()
#4  0x00002b6e360b257a in qpid::sys::Thread::runRunnable ()
#5  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 47752356023536 (LWP 23901)):
#0  0x0000003b8480a687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1  0x00002b6e35d43e42 in qpid::sys::Waitable::wait ()
#2  0x00002b6e35d41161 in qpid::client::Dispatcher::run ()
#3  0x000000000040bf6d in main ()
#0  0x0000003b8480a687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()

--------------------------------------------------------------------------
*** pstack: publisher
Thread 2 (Thread 1082132800 (LWP 24075)):
#0  0x0000003b83cd1f58 in epoll_wait () from /lib64/libc.so.6
#1  0x00002af6df3518cd in qpid::sys::Poller::wait ()
#2  0x00002af6df3913b8 in qpid::sys::Dispatcher::run ()
#3  0x00002af6defd4646 in qpid::client::Connector::run ()
#4  0x00002af6df35e57a in qpid::sys::Thread::runRunnable ()
#5  0x0000003b848062f7 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003b83cd1b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 47239798099184 (LWP 24061)):
#0  0x0000003b8480a687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1  0x00002af6defefe42 in qpid::sys::Waitable::wait ()
#2  0x00002af6deff527b in qpid::client::LocalQueue::get ()
#3  0x00002af6deff5d5e in qpid::client::LocalQueue::get ()
#4  0x00002af6deff5ea9 in qpid::client::LocalQueue::pop ()
#5  0x000000000040abf2 in Publisher::publish ()
#6  0x000000000040bc69 in main ()
#0  0x0000003b8480a687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()

Comment 1 David Sommerseth 2008-07-09 16:24:00 UTC

It seems to happen also without --transactional ... Is reproducible with
rhts_topictests.py from marple testkit.

rhts_topictests.py --rhts-mode -v -B /usr/sbin --topic-path /usr/bin
--broker-params "--data-dir /tmp/topic-qpiddata"  -L 50 --publisher-params
"--batches 30 --durable" --listener-params "--durable"

Reproduced on dell-pe1420-01.rhts.bos.redhat.com (RHTS managed)

Comment 2 Gordon Sim 2008-07-15 17:19:32 UTC

Is it possible that not all the listeners have registered their queues before
the publisher is started?

Comment 3 Kim van der Riet 2008-07-15 20:26:26 UTC

Gordon was spot on. The rhts_topictests.py script appears to wait 3 sec between
starting 50 listeners (in the example above) and the publisher. However, when
the test is run in durable mode, the store must create journal files for each of
the 50 queues. This can take around 10-20 secs for normal disks - each journal
has 8 files of 1.5MB. I increased the wait time to 60 secs in the script, and
the tests passed ok.

Ideally some kind of callback should be used when all the listeners have
finished initializing, but this would need support from the topic_listener program.

Comment 4 Kim van der Riet 2008-07-15 20:46:59 UTC

Note that the above scenario is for running the test on a clean journal dir
where the creation of durable queues for the first time results in journal files
being initialized on disk. If the 50 queues already exist, then time is taken
when the broker starts to recover the 50 journals - this could take several
seconds, but not as long as creating from scratch. In this case, once the broker
is recovered, starting the 50 listeners could fall within the 3 second window
because the queue already exists on the broker - or it could be a borderline
case. Much depends on the disk hardware for durable tests.