Description of problem: qpid broker sometimes forgets to send connection.heartbeat. It forgets so on quite random basis, the more connections it handles the (much) bigger probability of doing so is. I.e. for 100 connections with heartbeat=5, there are few such "misses" in 30 minutes, for 500 conns with hb=5, usually all connections are affected within 30minutes. Reproducer is so so trivial: just have sufficient connections with heartbeats set. I noticed this is a regression due to bz1093996 / QPID-5758 / svn commit 1594220. I.e. regression between MRG-M 3.0 (not present) and MRG-M 3.1 (present). I confirmed the regression by running today's upstream qpid (bug present) and qpid with commit 1594220 rolled back (no bug). Version-Release number of selected component (if applicable): qpid-cpp-server-0.30-2 How reproducible: 100% within 30 minutes Steps to Reproduce: 1. qpidd --queue-purge-interval=20 --auth=no --max-connections=10000 & 2. run below written script ./reproduce_ttl_timeout.sh that creates 500 queues on 500 connections with heartbeat=5. The script checks every 10 seconds if number of client connections has not decreased. Actual results: script terminates soon with: error: found just 225 connections instead of 500, in iteration 2 Expected results: script to terminate after 30 minutes with "no error" Additional info: script: noDelQueues=500 maxIter=180 mySleep=10 echo "creating connections.." for i in $(seq 1 $noDelQueues); do qpid-receive --connection-options "{'heartbeat':5}" -a "autoDelQueueNoBound_${i}; {create:always, node:{ x-declare:{auto-delete:True, arguments:{'qpid.auto_delete_timeout':1}}}}" -f --print-content=no > /dev/null 2>&1 & sleep 0.1; done iter=0 while true; do iter=$(($((iter))+1)) conns=$(pgrep qpid-receive | wc -w) echo "$(date): iteration:$iter connections:$conns" if [ $conns -lt $noDelQueues ]; then echo "error: found just $conns connections instead of $noDelQueues, in iteration $iter" break fi if [ $iter -eq $maxIter ]; then echo "no error" break fi sleep $mySleep done
Quite important notice/observation for developers: I do not send a single message to the broker, so purge task should be done very soon. I just create sufficiently many connections with heartbeats.
Fix available upstream: https://svn.apache.org/r1636848
Results from my repetitive testing: Upstream broker with commit r1594220 reverted (original behaviour): - problem never appeared. Tried 5times for 1000 connections for 30 minutes, no connection drop Upstream broker before Gordon's patch (i.e. before r1636848): - problem reproducible very easily, since 300 connections there is almost certainty all connections are gone in 10 minutes Upstream broker with Gordon's patch (i.e. after r1636848): - problem reproducible seldom. I.e. for 500 connections, there is approx. 50% probability the issue appears in 30 minutes. More connections (1000), higher probability Conclusions: - Gordon's patch (r1636848) fixes vast majority of bug occurrences - even with the patch, there is still some regression due to r1594220
From my (less extensive) testing, *without* the most recent fix, I see problems for 500 connections almost immediately within the first couple of iterations of checking. For 1000 connections I ran it five times and it failed on the first iteration each time. Retesting *with* the fix, each of 5 runs (1000 connections each) ran to the no-error conclusion.
New patch form Chuck that should address remaining issues: https://svn.apache.org/r1642681
Verified on qpid-cpp-server-0.30-5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0805.html