+++ This bug was initially created as a clone of Bug #1093996 +++ Description of problem: Purging TTL-expired messages prevents (all?) other timers to be fired / executing their job. That especially causes problems for connection heartbeat firing. If having e.g. few hundred queues with few thousands durable TTL-expired messages each, the purge action takes few minutes and during whole that time, no heartbeat timer is fired. That causes all affected connections (waiting for heartbeat) to timeout after 2*heartbeat seconds (at least C++ client does not initiate connection.heartbeat but just waits for the frame from the broker; if the link is silent for 2*heartbeat, it drops the connection). Version-Release number of selected component (if applicable): qpid-cpp 0.22-38 How reproducible: 100% Steps to Reproduce: 1. Have in /etc/qpid/qpidd.conf: auth=no log-to-file=/tmp/qpidd.log log-enable=trace+:purge log-enable=info+ log-thread=yes queue-purge-interval=300 default-queue-limit=0 2. Execute below commands in a script (to have 200 queues with 10k expired messages to be purged in one shot; the while cycle below waits for purge interval to elapse to ensure all messages are enqueued before the _next_ purging, such that one purging removes _all_ those messages): rm -rf /tmp/qpidd.log /var/lib/qpidd/*; service qpidd restart queues=200 for i in $(seq 1 $queues); do keyId=$(($((i))/50)) qpid-receive -a "Durable_${i}; {create:always, node:{type:queue, durable:true, x-bindings:[{ exchange:'amq.fanout', queue:'Durable_${i}' }, { exchange:'amq.direct', queue:'Durable_${i}', key:'key_${keyId}' }] }}" done while [ $(grep -c Purged /tmp/qpidd.log) -eq 0 ]; do echo "$(date): waiting for qpid-send until purge happened" sleep 10 done echo "$(date): starting qpid-send" for i in $(seq 0 3); do qpid-send -a "amq.direct/key_${i}" -m 10000 --durable=yes --ttl=3000 & sleep 3 done wait sleep 3 echo "$(date): starting the testing clients with heartbeat" qpid-receive -a "tempQueue; {create:always}" -m 1 -f --connection-options "{heartbeat:10, reconnect:true}" & qpid-send -a amq.fanout -m 10000 --durable=yes --connection-options "{heartbeat:10, reconnect:true}" & # end of the script 3. Check if the qpid-send and qpid-receive will re-connect Actual results: Both clients reconnect few times. Expected results: Neither client should reconnect. Additional info: Often the reconnect fails with "Channel exception: session-busy: Session already attached:" error - that is just a side effect. It is worth comparing "Purged 10000 expired messages from" in qpidd log with tcpdump when broker stopped firing connection.heartbeat, to see the purging really triggers the heartbeat issue. Possible solutions: 1) Have independent timer for each queue to run purgeExpired method 2) Or have the timers independent on each other --- Additional comment from Pavel Moravec on 2014-05-09 06:33:47 EDT --- There is a significant performance regression between 0.18-20 and 0.22-38 in purgeExpired method, when traversing even transient messages without TTL. Reproducer: 1) have in qpidd.conf: log-to-file=/tmp/qpidd.log log-enable=trace+:purge log-enable=info+ log-hires-timestamp=yes queue-purge-interval=60 default-queue-limit=0 2) Provision 500 queue bound to amq.fanout: for i in $(seq 1 500); do qpid-receive -a "queue_${i}; {create:always, node:{type:queue, durable:false, x-bindings:[{ exchange:'amq.fanout', queue:'queue_${i}' }] }}" done 3) send there some messages: qpid-send -a "amq.fanout" -m 50000 --durable=yes --content-size=1024 4) Wait for first "Purged" message in log and stop broker then: purged=$(grep -c Purged /tmp/qpidd.log) while [ $(grep -c Purged /tmp/qpidd.log) -eq $purged ]; do echo "$(date): waiting until next purge happens" sleep 10 done service qpidd stop 5) Count time difference between 1st and last "Purged" log On my VM, purge check traversal takes 2.929318886s on 3.0, while just 0.965582676s on 2.4. That means three times more (!). (to see "Purged XYZ expired messages from .." on 0.18-20, let apply patch: diff -rup _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp --- _qpid-0.18_orig/cpp/src/qpid/broker/Queue.cpp 2014-03-22 13:33:27.706897195 +0100 +++ _qpid-0.18_new/cpp/src/qpid/broker/Queue.cpp 2014-04-29 09:10:41.994011625 +0200 @@ -658,6 +658,7 @@ void Queue::purgeExpired(sys::Duration l if (seconds == 0 || count / seconds < 1) { std::deque<QueuedMessage> dequeued; dequeueIf(boost::bind(&isExpired, _1), dequeued); + QPID_LOG(debug, "Purged " << dequeued.size() << " expired messages from " << getName()); if (dequeued.size()) { if (mgmtObject) { mgmtObject->inc_discardsTtl(dequeued.size()); there. Or ask me for rpms with the patch applied) --- Additional comment from Pavel Moravec on 2014-05-09 07:28:26 EDT --- Some further comparison: 0.18 0.22 perf.degradation 500q, 50kM 0.965582676 2.929318886 3.0337318169 100q, 50kM 0.169874585 0.528339322 3.110172849 100q, 100kM 0.555664606 1.088116653 1.9582255937 1q, 1mM 0.056135355 0.094856374 1.6897795338 1q, 2mM(e) 0.086394264 0.177540997 2.0550090802 "500q, 50kM" means 500 queues got 50k messages via fanout exchange, (e) means messages were empty. I dont see a big correlation among #queues, #messages or message length.. :-/ --- Additional comment from Gordon Sim on 2014-05-13 09:51:41 EDT --- Fixed upstream: https://svn.apache.org/r1594220
Verified on qpid-cpp-client-0.18-38 qpid-cpp-server-0.18-38 works as expected. Clients are obtaining heartbeats while big (ttl) message purging occurs, so they are not getting disconnected due to missing heartbeats. Marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0661.html