Bug 1093996

Summary: Purging TTL expired messages blocks all other timers, causing connection drops
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: Gordon Sim <gsim>
Status: CLOSED ERRATA QA Contact: Michal Toth <mtoth>
Severity: medium Docs Contact:
Priority: high    
Version: 3.0CC: freznice, gsim, jross, kgiusti, mcressma, mtoth
Target Milestone: 3.1Keywords: TestCaseProvided
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-0.30-2 Doc Type: Bug Fix
Doc Text:
It was discovered that purging expired messages could take a long time, which prevented the timer thread from servicing other periodic tasks--such as heartbeats--in a timely manner. The timer thread is now used only to trigger the purge, with the actual purge task now offloaded to one of the broker's worker threads. This change allows the timer thread to monitor time state correctly, and allow other jobs such as heartbeat reporting to function correctly.
Story Points: ---
Clone Of:
: 1142833 (view as bug list) Environment:
Last Closed: 2015-04-14 13:48:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1142833    

Description Pavel Moravec 2014-05-04 07:36:37 UTC
Description of problem:
Purging TTL-expired messages prevents (all?) other timers to be fired / executing their job. That especially causes problems for connection heartbeat firing.

If having e.g. few hundred queues with few thousands durable TTL-expired messages each, the purge action takes few minutes and during whole that time, no heartbeat timer is fired. That causes all affected connections (waiting for heartbeat) to timeout after 2*heartbeat seconds (at least C++ client does not initiate connection.heartbeat but just waits for the frame from the broker; if the link is silent for 2*heartbeat, it drops the connection).


Version-Release number of selected component (if applicable):
qpid-cpp 0.22-38


How reproducible:
100%


Steps to Reproduce:
1. Have in /etc/qpid/qpidd.conf:
auth=no
log-to-file=/tmp/qpidd.log
log-enable=trace+:purge
log-enable=info+
log-thread=yes
queue-purge-interval=300
default-queue-limit=0

2. Execute below commands in a script (to have 200 queues with 10k expired messages to be purged in one shot; the while cycle below waits for purge interval to elapse to ensure all messages are enqueued before the _next_ purging, such that one purging removes _all_ those messages):
rm -rf /tmp/qpidd.log /var/lib/qpidd/*; service qpidd restart
queues=200
for i in $(seq 1 $queues); do
	keyId=$(($((i))/50))
	qpid-receive -a "Durable_${i}; {create:always, node:{type:queue, durable:true, x-bindings:[{ exchange:'amq.fanout', queue:'Durable_${i}' }, { exchange:'amq.direct', queue:'Durable_${i}', key:'key_${keyId}' }] }}"
done

while [ $(grep -c Purged /tmp/qpidd.log) -eq 0 ]; do
	echo "$(date): waiting for qpid-send until purge happened"
	sleep 10
done
echo "$(date): starting qpid-send"
for i in $(seq 0 3); do
        qpid-send -a "amq.direct/key_${i}" -m 10000 --durable=yes --ttl=3000 &
        sleep 3
done
wait

sleep 3
echo "$(date): starting the testing clients with heartbeat"

qpid-receive -a "tempQueue; {create:always}" -m 1 -f --connection-options "{heartbeat:10, reconnect:true}" &
qpid-send -a amq.fanout -m 10000 --durable=yes --connection-options "{heartbeat:10, reconnect:true}" &
# end of the script

3. Check if the qpid-send and qpid-receive will re-connect


Actual results:
Both clients reconnect few times.


Expected results:
Neither client should reconnect.


Additional info:
Often the reconnect fails with "Channel exception: session-busy: Session already attached:" error - that is just a side effect.

It is worth comparing "Purged 10000 expired messages from" in qpidd log with tcpdump when broker stopped firing connection.heartbeat, to see the purging really triggers the heartbeat issue.


Possible solutions:
1) Have independent timer for each queue to run purgeExpired method
2) Or have the timers independent on each other

Comment 1 Pavel Moravec 2014-05-09 10:33:47 UTC
There is a significant performance regression between 0.18-20 and 0.22-38 in purgeExpired method, when traversing even transient messages without TTL. Reproducer:

1) have in qpidd.conf:
log-to-file=/tmp/qpidd.log
log-enable=trace+:purge
log-enable=info+
log-hires-timestamp=yes
queue-purge-interval=60
default-queue-limit=0

2) Provision 500 queue bound to amq.fanout:
for i in $(seq 1 500); do
  qpid-receive -a "queue_${i}; {create:always, node:{type:queue, durable:false, x-bindings:[{ exchange:'amq.fanout', queue:'queue_${i}' }] }}"
done

3) send there some messages:
qpid-send -a "amq.fanout" -m 50000 --durable=yes --content-size=1024

4) Wait for first "Purged" message in log and stop broker then:
purged=$(grep -c Purged /tmp/qpidd.log)
while [ $(grep -c Purged /tmp/qpidd.log) -eq $purged ]; do
	echo "$(date): waiting until next purge happens"
	sleep 10
done
service qpidd stop

5) Count time difference between 1st and last "Purged" log

On my VM, purge check traversal takes 2.929318886s on 3.0, while just 0.965582676s on 2.4. That means three times more (!).



(to see "Purged XYZ expired messages from .." on 0.18-20, let apply patch:
diff -rup _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp
--- _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp	2014-03-22 13:33:27.706897195 +0100
+++ _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp	2014-04-29 09:10:41.994011625 +0200
@@ -658,6 +658,7 @@ void Queue::purgeExpired(sys::Duration l
     if (seconds == 0 || count / seconds < 1) {
         std::deque<QueuedMessage> dequeued;
         dequeueIf(boost::bind(&isExpired, _1), dequeued);
+	QPID_LOG(debug, "Purged " << dequeued.size() << " expired messages from " << getName());
         if (dequeued.size()) {
             if (mgmtObject) {
                 mgmtObject->inc_discardsTtl(dequeued.size());

there. Or ask me for rpms with the patch applied)

Comment 2 Pavel Moravec 2014-05-09 11:28:26 UTC
Some further comparison:

	        0.18	        0.22	        perf.degradation
500q, 50kM	0.965582676	2.929318886	3.0337318169
100q, 50kM	0.169874585	0.528339322	3.110172849
100q, 100kM	0.555664606	1.088116653	1.9582255937
1q, 1mM         0.056135355	0.094856374	1.6897795338
1q, 2mM(e)	0.086394264	0.177540997	2.0550090802

"500q, 50kM" means 500 queues got 50k messages via fanout exchange, (e) means messages were empty.

I dont see a big correlation among #queues, #messages or message length.. :-/

Comment 3 Gordon Sim 2014-05-13 13:51:41 UTC
Fixed upstream: https://svn.apache.org/r1594220

Comment 5 Justin Ross 2014-10-10 11:38:40 UTC
*** Bug 666944 has been marked as a duplicate of this bug. ***

Comment 8 Michal Toth 2015-03-09 07:26:11 UTC
Ran on rhel 6x,i and rhel7x using
qpid-cpp-server-0.30-6.el6
qpid-cpp-server-0.30-6.el7.x86_64


Bigger purge does not disconnect quick heartbeat clients.
Marking as verified.

Comment 10 errata-xmlrpc 2015-04-14 13:48:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0805.html