Bug 1160232 - [regression] broker sometimes forgets to send connection heartbeats
Summary: [regression] broker sometimes forgets to send connection heartbeats
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.0
Hardware: All
OS: Linux
high
high
Target Milestone: 3.1
: ---
Assignee: Gordon Sim
QA Contact: Michal Toth
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-11-04 11:56 UTC by Pavel Moravec
Modified: 2019-05-20 11:20 UTC (History)
8 users (show)

Fixed In Version: qpid-cpp-0.30-5
Doc Type: Bug Fix
Doc Text:
It was discovered that the timer task for periodic queue purging was added to the timer's set of tasks twice. This caused the timer's internal state to become corrupted, which in turn prevented some tasks from being triggered. The logic is now fixed to ensure the task is not included multiple times within the timer.
Clone Of:
Environment:
Last Closed: 2015-04-14 13:48:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-6213 0 None None None Never
Red Hat Product Errata RHEA-2015:0805 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.1 Release 2015-04-14 17:45:54 UTC

Description Pavel Moravec 2014-11-04 11:56:11 UTC
Description of problem:
qpid broker sometimes forgets to send connection.heartbeat. It forgets so on quite random basis, the more connections it handles the (much) bigger probability of doing so is. I.e. for 100 connections with heartbeat=5, there are few such "misses" in 30 minutes, for 500 conns with hb=5, usually all connections are affected within 30minutes.

Reproducer is so so trivial: just have sufficient connections with heartbeats set.

I noticed this is a regression due to bz1093996 / QPID-5758 / svn commit 1594220. I.e. regression between MRG-M 3.0 (not present) and MRG-M 3.1 (present). I confirmed the regression by running today's upstream qpid (bug present) and qpid with commit 1594220 rolled back (no bug).


Version-Release number of selected component (if applicable):
qpid-cpp-server-0.30-2


How reproducible:
100% within 30 minutes


Steps to Reproduce:
1. qpidd --queue-purge-interval=20 --auth=no --max-connections=10000 &
2. run below written script ./reproduce_ttl_timeout.sh that creates 500 queues on 500 connections with heartbeat=5. The script checks every 10 seconds if number of client connections has not decreased.


Actual results:
script terminates soon with:

error: found just 225 connections instead of 500, in iteration 2


Expected results:
script to terminate after 30 minutes with "no error"


Additional info:
script:

noDelQueues=500

maxIter=180
mySleep=10

echo "creating connections.."
for i in $(seq 1 $noDelQueues); do qpid-receive --connection-options "{'heartbeat':5}" -a "autoDelQueueNoBound_${i}; {create:always, node:{ x-declare:{auto-delete:True, arguments:{'qpid.auto_delete_timeout':1}}}}" -f --print-content=no > /dev/null 2>&1 & sleep 0.1; done

iter=0
while true; do
  iter=$(($((iter))+1))
  conns=$(pgrep qpid-receive | wc -w)
  echo "$(date): iteration:$iter connections:$conns"
  if [ $conns -lt $noDelQueues ]; then
    echo "error: found just $conns connections instead of $noDelQueues, in iteration $iter"
    break
  fi
  if [ $iter -eq $maxIter ]; then
    echo "no error"
    break
  fi
  sleep $mySleep
done

Comment 1 Pavel Moravec 2014-11-04 12:09:50 UTC
Quite important notice/observation for developers: I do not send a single message to the broker, so purge task should be done very soon. I just create sufficiently many connections with heartbeats.

Comment 3 Gordon Sim 2014-11-05 10:35:55 UTC
Fix available upstream: https://svn.apache.org/r1636848

Comment 6 Pavel Moravec 2014-11-05 19:28:37 UTC
Results from my repetitive testing:

Upstream broker with commit r1594220 reverted (original behaviour):
- problem never appeared. Tried 5times for 1000 connections for 30 minutes, no connection drop

Upstream broker before Gordon's patch (i.e. before r1636848):
- problem reproducible very easily, since 300 connections there is almost certainty all connections are gone in 10 minutes

Upstream broker with Gordon's patch (i.e. after r1636848):
- problem reproducible seldom. I.e. for 500 connections, there is approx. 50% probability the issue appears in 30 minutes. More connections (1000), higher probability

Conclusions:
- Gordon's patch (r1636848) fixes vast majority of bug occurrences
- even with the patch, there is still some regression due to r1594220

Comment 8 Gordon Sim 2014-11-05 19:37:27 UTC
From my (less extensive) testing, *without* the most recent fix, I see problems for 500 connections almost immediately within the first couple of iterations of checking. For 1000 connections I ran it five times and it failed on the first iteration each time.

Retesting *with* the fix, each of 5 runs (1000 connections each) ran to the no-error conclusion.

Comment 9 Gordon Sim 2014-12-01 14:37:59 UTC
New patch form Chuck that should address remaining issues: https://svn.apache.org/r1642681

Comment 17 Michal Toth 2015-01-29 14:16:46 UTC
Verified on 
qpid-cpp-server-0.30-5

Comment 21 errata-xmlrpc 2015-04-14 13:48:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0805.html


Note You need to log in before you can comment on or make changes to this bug.