Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1160232

Summary:	[regression] broker sometimes forgets to send connection heartbeats
Product:	Red Hat Enterprise MRG	Reporter:	Pavel Moravec <pmoravec>
Component:	qpid-cpp	Assignee:	Gordon Sim <gsim>
Status:	CLOSED ERRATA	QA Contact:	Michal Toth <mtoth>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.0	CC:	esammons, freznice, gsim, iboverma, jross, mcressma, mtoth, ngalvin
Target Milestone:	3.1	Keywords:	Regression, TestCaseProvided
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	qpid-cpp-0.30-5	Doc Type:	Bug Fix
Doc Text:	It was discovered that the timer task for periodic queue purging was added to the timer's set of tasks twice. This caused the timer's internal state to become corrupted, which in turn prevented some tasks from being triggered. The logic is now fixed to ensure the task is not included multiple times within the timer.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-04-14 13:48:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Moravec 2014-11-04 11:56:11 UTC

Description of problem:
qpid broker sometimes forgets to send connection.heartbeat. It forgets so on quite random basis, the more connections it handles the (much) bigger probability of doing so is. I.e. for 100 connections with heartbeat=5, there are few such "misses" in 30 minutes, for 500 conns with hb=5, usually all connections are affected within 30minutes.

Reproducer is so so trivial: just have sufficient connections with heartbeats set.

I noticed this is a regression due to bz1093996 / QPID-5758 / svn commit 1594220. I.e. regression between MRG-M 3.0 (not present) and MRG-M 3.1 (present). I confirmed the regression by running today's upstream qpid (bug present) and qpid with commit 1594220 rolled back (no bug).


Version-Release number of selected component (if applicable):
qpid-cpp-server-0.30-2


How reproducible:
100% within 30 minutes


Steps to Reproduce:
1. qpidd --queue-purge-interval=20 --auth=no --max-connections=10000 &
2. run below written script ./reproduce_ttl_timeout.sh that creates 500 queues on 500 connections with heartbeat=5. The script checks every 10 seconds if number of client connections has not decreased.


Actual results:
script terminates soon with:

error: found just 225 connections instead of 500, in iteration 2


Expected results:
script to terminate after 30 minutes with "no error"


Additional info:
script:

noDelQueues=500

maxIter=180
mySleep=10

echo "creating connections.."
for i in $(seq 1 $noDelQueues); do qpid-receive --connection-options "{'heartbeat':5}" -a "autoDelQueueNoBound_${i}; {create:always, node:{ x-declare:{auto-delete:True, arguments:{'qpid.auto_delete_timeout':1}}}}" -f --print-content=no > /dev/null 2>&1 & sleep 0.1; done

iter=0
while true; do
  iter=$(($((iter))+1))
  conns=$(pgrep qpid-receive | wc -w)
  echo "$(date): iteration:$iter connections:$conns"
  if [ $conns -lt $noDelQueues ]; then
    echo "error: found just $conns connections instead of $noDelQueues, in iteration $iter"
    break
  fi
  if [ $iter -eq $maxIter ]; then
    echo "no error"
    break
  fi
  sleep $mySleep
done

Comment 1 Pavel Moravec 2014-11-04 12:09:50 UTC

Quite important notice/observation for developers: I do not send a single message to the broker, so purge task should be done very soon. I just create sufficiently many connections with heartbeats.

Comment 3 Gordon Sim 2014-11-05 10:35:55 UTC

Fix available upstream: https://svn.apache.org/r1636848

Comment 6 Pavel Moravec 2014-11-05 19:28:37 UTC

Results from my repetitive testing:

Upstream broker with commit r1594220 reverted (original behaviour):
- problem never appeared. Tried 5times for 1000 connections for 30 minutes, no connection drop

Upstream broker before Gordon's patch (i.e. before r1636848):
- problem reproducible very easily, since 300 connections there is almost certainty all connections are gone in 10 minutes

Upstream broker with Gordon's patch (i.e. after r1636848):
- problem reproducible seldom. I.e. for 500 connections, there is approx. 50% probability the issue appears in 30 minutes. More connections (1000), higher probability

Conclusions:
- Gordon's patch (r1636848) fixes vast majority of bug occurrences
- even with the patch, there is still some regression due to r1594220

Comment 8 Gordon Sim 2014-11-05 19:37:27 UTC

From my (less extensive) testing, *without* the most recent fix, I see problems for 500 connections almost immediately within the first couple of iterations of checking. For 1000 connections I ran it five times and it failed on the first iteration each time.

Retesting *with* the fix, each of 5 runs (1000 connections each) ran to the no-error conclusion.

Comment 9 Gordon Sim 2014-12-01 14:37:59 UTC

New patch form Chuck that should address remaining issues: https://svn.apache.org/r1642681

Comment 17 Michal Toth 2015-01-29 14:16:46 UTC

Verified on 
qpid-cpp-server-0.30-5

Comment 21 errata-xmlrpc 2015-04-14 13:48:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0805.html