1142833 – Purging TTL expired messages blocks all other timers, causing connection drops

Bug 1142833 - Purging TTL expired messages blocks all other timers, causing connection drops

Summary: Purging TTL expired messages blocks all other timers, causing connection drops

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	2.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	2.5.13
Target Release:	---
Assignee:	messaging-bugs
QA Contact:	Michal Toth
Docs Contact:
URL:
Whiteboard:
Depends On:	1093996
Blocks:	1191757 1191759
TreeView+	depends on / blocked

Reported:	2014-09-17 13:36 UTC by Mike Cressman
Modified:	2019-09-12 08:00 UTC (History)
CC List:	11 users (show)
Fixed In Version:	qpid-cpp-0.18-38
Doc Type:	Bug Fix
Doc Text:	It was discovered that because message purging was performed on a timer thread, large purge events could have caused all other timer tasks to be delayed. Because heartbeats were also driven by a timer on this thread, this could have resulted in clients timing out because they were not receiving heartbeats. The fix moves expired message purging from the timer thread to a worker thread, which allow long-running expired message purges to not affect timer tasks such as the heartbeat timer.
Clone Of:	1093996
Environment:
Last Closed:	2015-03-09 13:50:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-5758	0	None	None	None	Never
Red Hat Product Errata	RHSA-2015:0661	0	normal	SHIPPED_LIVE	Moderate: qpid-cpp security and bug fix update	2015-03-09 18:41:05 UTC

Description Mike Cressman 2014-09-17 13:36:06 UTC

+++ This bug was initially created as a clone of Bug #1093996 +++

Description of problem:
Purging TTL-expired messages prevents (all?) other timers to be fired / executing their job. That especially causes problems for connection heartbeat firing.

If having e.g. few hundred queues with few thousands durable TTL-expired messages each, the purge action takes few minutes and during whole that time, no heartbeat timer is fired. That causes all affected connections (waiting for heartbeat) to timeout after 2*heartbeat seconds (at least C++ client does not initiate connection.heartbeat but just waits for the frame from the broker; if the link is silent for 2*heartbeat, it drops the connection).


Version-Release number of selected component (if applicable):
qpid-cpp 0.22-38


How reproducible:
100%


Steps to Reproduce:
1. Have in /etc/qpid/qpidd.conf:
auth=no
log-to-file=/tmp/qpidd.log
log-enable=trace+:purge
log-enable=info+
log-thread=yes
queue-purge-interval=300
default-queue-limit=0

2. Execute below commands in a script (to have 200 queues with 10k expired messages to be purged in one shot; the while cycle below waits for purge interval to elapse to ensure all messages are enqueued before the _next_ purging, such that one purging removes _all_ those messages):
rm -rf /tmp/qpidd.log /var/lib/qpidd/*; service qpidd restart
queues=200
for i in $(seq 1 $queues); do
	keyId=$(($((i))/50))
	qpid-receive -a "Durable_${i}; {create:always, node:{type:queue, durable:true, x-bindings:[{ exchange:'amq.fanout', queue:'Durable_${i}' }, { exchange:'amq.direct', queue:'Durable_${i}', key:'key_${keyId}' }] }}"
done

while [ $(grep -c Purged /tmp/qpidd.log) -eq 0 ]; do
	echo "$(date): waiting for qpid-send until purge happened"
	sleep 10
done
echo "$(date): starting qpid-send"
for i in $(seq 0 3); do
        qpid-send -a "amq.direct/key_${i}" -m 10000 --durable=yes --ttl=3000 &
        sleep 3
done
wait

sleep 3
echo "$(date): starting the testing clients with heartbeat"

qpid-receive -a "tempQueue; {create:always}" -m 1 -f --connection-options "{heartbeat:10, reconnect:true}" &
qpid-send -a amq.fanout -m 10000 --durable=yes --connection-options "{heartbeat:10, reconnect:true}" &
# end of the script

3. Check if the qpid-send and qpid-receive will re-connect


Actual results:
Both clients reconnect few times.


Expected results:
Neither client should reconnect.


Additional info:
Often the reconnect fails with "Channel exception: session-busy: Session already attached:" error - that is just a side effect.

It is worth comparing "Purged 10000 expired messages from" in qpidd log with tcpdump when broker stopped firing connection.heartbeat, to see the purging really triggers the heartbeat issue.


Possible solutions:
1) Have independent timer for each queue to run purgeExpired method
2) Or have the timers independent on each other

--- Additional comment from Pavel Moravec on 2014-05-09 06:33:47 EDT ---

There is a significant performance regression between 0.18-20 and 0.22-38 in purgeExpired method, when traversing even transient messages without TTL. Reproducer:

1) have in qpidd.conf:
log-to-file=/tmp/qpidd.log
log-enable=trace+:purge
log-enable=info+
log-hires-timestamp=yes
queue-purge-interval=60
default-queue-limit=0

2) Provision 500 queue bound to amq.fanout:
for i in $(seq 1 500); do
  qpid-receive -a "queue_${i}; {create:always, node:{type:queue, durable:false, x-bindings:[{ exchange:'amq.fanout', queue:'queue_${i}' }] }}"
done

3) send there some messages:
qpid-send -a "amq.fanout" -m 50000 --durable=yes --content-size=1024

4) Wait for first "Purged" message in log and stop broker then:
purged=$(grep -c Purged /tmp/qpidd.log)
while [ $(grep -c Purged /tmp/qpidd.log) -eq $purged ]; do
	echo "$(date): waiting until next purge happens"
	sleep 10
done
service qpidd stop

5) Count time difference between 1st and last "Purged" log

On my VM, purge check traversal takes 2.929318886s on 3.0, while just 0.965582676s on 2.4. That means three times more (!).



(to see "Purged XYZ expired messages from .." on 0.18-20, let apply patch:
diff -rup _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp
--- _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp	2014-03-22 13:33:27.706897195 +0100
+++ _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp	2014-04-29 09:10:41.994011625 +0200
@@ -658,6 +658,7 @@ void Queue::purgeExpired(sys::Duration l
     if (seconds == 0 || count / seconds < 1) {
         std::deque<QueuedMessage> dequeued;
         dequeueIf(boost::bind(&isExpired, _1), dequeued);
+	QPID_LOG(debug, "Purged " << dequeued.size() << " expired messages from " << getName());
         if (dequeued.size()) {
             if (mgmtObject) {
                 mgmtObject->inc_discardsTtl(dequeued.size());

there. Or ask me for rpms with the patch applied)

--- Additional comment from Pavel Moravec on 2014-05-09 07:28:26 EDT ---

Some further comparison:

	        0.18	        0.22	        perf.degradation
500q, 50kM	0.965582676	2.929318886	3.0337318169
100q, 50kM	0.169874585	0.528339322	3.110172849
100q, 100kM	0.555664606	1.088116653	1.9582255937
1q, 1mM         0.056135355	0.094856374	1.6897795338
1q, 2mM(e)	0.086394264	0.177540997	2.0550090802

"500q, 50kM" means 500 queues got 50k messages via fanout exchange, (e) means messages were empty.

I dont see a big correlation among #queues, #messages or message length.. :-/

--- Additional comment from Gordon Sim on 2014-05-13 09:51:41 EDT ---

Fixed upstream: https://svn.apache.org/r1594220

Comment 12 Michal Toth 2015-02-25 07:52:02 UTC

Verified on 
qpid-cpp-client-0.18-38
qpid-cpp-server-0.18-38

works as expected. Clients are obtaining heartbeats while big (ttl) message purging occurs, so they are not getting disconnected due to missing heartbeats.
Marking as verified.

Comment 14 errata-xmlrpc 2015-03-09 13:50:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0661.html

Note You need to log in before you can comment on or make changes to this bug.