Bug 1093996 - Purging TTL expired messages blocks all other timers, causing connection drops
Summary: Purging TTL expired messages blocks all other timers, causing connection drops
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.0
Hardware: All
OS: Linux
Target Milestone: 3.1
: ---
Assignee: Gordon Sim
QA Contact: Michal Toth
: 666944 (view as bug list)
Depends On:
Blocks: 1142833
TreeView+ depends on / blocked
Reported: 2014-05-04 07:36 UTC by Pavel Moravec
Modified: 2019-06-13 07:59 UTC (History)
6 users (show)

Fixed In Version: qpid-cpp-0.30-2
Doc Type: Bug Fix
Doc Text:
It was discovered that purging expired messages could take a long time, which prevented the timer thread from servicing other periodic tasks--such as heartbeats--in a timely manner. The timer thread is now used only to trigger the purge, with the actual purge task now offloaded to one of the broker's worker threads. This change allows the timer thread to monitor time state correctly, and allow other jobs such as heartbeat reporting to function correctly.
Clone Of:
: 1142833 (view as bug list)
Last Closed: 2015-04-14 13:48:00 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Apache JIRA QPID-5758 None None None Never
Red Hat Product Errata RHEA-2015:0805 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.1 Release 2015-04-14 17:45:54 UTC

Description Pavel Moravec 2014-05-04 07:36:37 UTC
Description of problem:
Purging TTL-expired messages prevents (all?) other timers to be fired / executing their job. That especially causes problems for connection heartbeat firing.

If having e.g. few hundred queues with few thousands durable TTL-expired messages each, the purge action takes few minutes and during whole that time, no heartbeat timer is fired. That causes all affected connections (waiting for heartbeat) to timeout after 2*heartbeat seconds (at least C++ client does not initiate connection.heartbeat but just waits for the frame from the broker; if the link is silent for 2*heartbeat, it drops the connection).

Version-Release number of selected component (if applicable):
qpid-cpp 0.22-38

How reproducible:

Steps to Reproduce:
1. Have in /etc/qpid/qpidd.conf:

2. Execute below commands in a script (to have 200 queues with 10k expired messages to be purged in one shot; the while cycle below waits for purge interval to elapse to ensure all messages are enqueued before the _next_ purging, such that one purging removes _all_ those messages):
rm -rf /tmp/qpidd.log /var/lib/qpidd/*; service qpidd restart
for i in $(seq 1 $queues); do
	qpid-receive -a "Durable_${i}; {create:always, node:{type:queue, durable:true, x-bindings:[{ exchange:'amq.fanout', queue:'Durable_${i}' }, { exchange:'amq.direct', queue:'Durable_${i}', key:'key_${keyId}' }] }}"

while [ $(grep -c Purged /tmp/qpidd.log) -eq 0 ]; do
	echo "$(date): waiting for qpid-send until purge happened"
	sleep 10
echo "$(date): starting qpid-send"
for i in $(seq 0 3); do
        qpid-send -a "amq.direct/key_${i}" -m 10000 --durable=yes --ttl=3000 &
        sleep 3

sleep 3
echo "$(date): starting the testing clients with heartbeat"

qpid-receive -a "tempQueue; {create:always}" -m 1 -f --connection-options "{heartbeat:10, reconnect:true}" &
qpid-send -a amq.fanout -m 10000 --durable=yes --connection-options "{heartbeat:10, reconnect:true}" &
# end of the script

3. Check if the qpid-send and qpid-receive will re-connect

Actual results:
Both clients reconnect few times.

Expected results:
Neither client should reconnect.

Additional info:
Often the reconnect fails with "Channel exception: session-busy: Session already attached:" error - that is just a side effect.

It is worth comparing "Purged 10000 expired messages from" in qpidd log with tcpdump when broker stopped firing connection.heartbeat, to see the purging really triggers the heartbeat issue.

Possible solutions:
1) Have independent timer for each queue to run purgeExpired method
2) Or have the timers independent on each other

Comment 1 Pavel Moravec 2014-05-09 10:33:47 UTC
There is a significant performance regression between 0.18-20 and 0.22-38 in purgeExpired method, when traversing even transient messages without TTL. Reproducer:

1) have in qpidd.conf:

2) Provision 500 queue bound to amq.fanout:
for i in $(seq 1 500); do
  qpid-receive -a "queue_${i}; {create:always, node:{type:queue, durable:false, x-bindings:[{ exchange:'amq.fanout', queue:'queue_${i}' }] }}"

3) send there some messages:
qpid-send -a "amq.fanout" -m 50000 --durable=yes --content-size=1024

4) Wait for first "Purged" message in log and stop broker then:
purged=$(grep -c Purged /tmp/qpidd.log)
while [ $(grep -c Purged /tmp/qpidd.log) -eq $purged ]; do
	echo "$(date): waiting until next purge happens"
	sleep 10
service qpidd stop

5) Count time difference between 1st and last "Purged" log

On my VM, purge check traversal takes 2.929318886s on 3.0, while just 0.965582676s on 2.4. That means three times more (!).

(to see "Purged XYZ expired messages from .." on 0.18-20, let apply patch:
diff -rup _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp
--- _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp	2014-03-22 13:33:27.706897195 +0100
+++ _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp	2014-04-29 09:10:41.994011625 +0200
@@ -658,6 +658,7 @@ void Queue::purgeExpired(sys::Duration l
     if (seconds == 0 || count / seconds < 1) {
         std::deque<QueuedMessage> dequeued;
         dequeueIf(boost::bind(&isExpired, _1), dequeued);
+	QPID_LOG(debug, "Purged " << dequeued.size() << " expired messages from " << getName());
         if (dequeued.size()) {
             if (mgmtObject) {

there. Or ask me for rpms with the patch applied)

Comment 2 Pavel Moravec 2014-05-09 11:28:26 UTC
Some further comparison:

	        0.18	        0.22	        perf.degradation
500q, 50kM	0.965582676	2.929318886	3.0337318169
100q, 50kM	0.169874585	0.528339322	3.110172849
100q, 100kM	0.555664606	1.088116653	1.9582255937
1q, 1mM         0.056135355	0.094856374	1.6897795338
1q, 2mM(e)	0.086394264	0.177540997	2.0550090802

"500q, 50kM" means 500 queues got 50k messages via fanout exchange, (e) means messages were empty.

I dont see a big correlation among #queues, #messages or message length.. :-/

Comment 3 Gordon Sim 2014-05-13 13:51:41 UTC
Fixed upstream: https://svn.apache.org/r1594220

Comment 5 Justin Ross 2014-10-10 11:38:40 UTC
*** Bug 666944 has been marked as a duplicate of this bug. ***

Comment 8 Michal Toth 2015-03-09 07:26:11 UTC
Ran on rhel 6x,i and rhel7x using

Bigger purge does not disconnect quick heartbeat clients.
Marking as verified.

Comment 10 errata-xmlrpc 2015-04-14 13:48:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.