504590 – qpidd does not use heartbeats to detect loss of clients

Bug 504590 - qpidd does not use heartbeats to detect loss of clients

Summary: qpidd does not use heartbeats to detect loss of clients

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	1.1.2
Target Release:	---
Assignee:	Andrew Stitcher
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-08 12:11 UTC by Gordon Sim
Modified:	2009-06-12 17:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-06-12 17:39:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
exclusive subscribe example (1.01 KB, text/x-c++src) 2009-06-08 12:16 UTC, Gordon Sim	no flags	Details
heartbeat echo from the java and python clients (1.05 KB, patch) 2009-06-08 14:27 UTC, Rafael H. Schloming	no flags	Details \| Diff
Exclusive subscribe examples in c++, python and java (2.58 KB, application/x-compressed-tar) 2009-06-08 19:54 UTC, Gordon Sim	no flags	Details
Patch to add client heartbeat/detection (14.66 KB, patch) 2009-06-08 20:15 UTC, Andrew Stitcher	no flags	Details \| Diff
Fix for issues in previous patch for 1.1.2 (5.03 KB, patch) 2009-06-11 05:52 UTC, Andrew Stitcher	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1097	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging bug fixing update	2009-06-12 17:38:48 UTC

Description Gordon Sim 2009-06-08 12:11:35 UTC

E.g. have client request exclusive subscription to queue on a remote broker, power off the machine on which the client is running, from another machine try to request and exclusive subscription to that same queue.

At present it can take 14 minutes for the broker to detect that the first client's session has been lost and grant an exclusive subscription to the second client. It should happen much faster.

Comment 1 Gordon Sim 2009-06-08 12:16:25 UTC

Created attachment 346857 [details]
exclusive subscribe example

This is a simple client that will request an exclusive subscription to the specified queue. While on instance of this is active, attempts to start further instances on the same queue will fail. If the first porcess is killed, that will allow another to take the exclusive subscription.

However without bi-directional heartbeats (and with default tcp settings) if the machine the client is on is powered down (or if it is unplugged from the network) the queue remains 'locked' until tcp timesout on retries.

Comment 2 Rafael H. Schloming 2009-06-08 14:27:04 UTC

Created attachment 346886 [details]
heartbeat echo from the java and python clients

Comment 3 Gordon Sim 2009-06-08 19:54:01 UTC

Created attachment 346919 [details]
Exclusive subscribe examples in c++, python and java

This tarball includes an equivalent example to the one attached earlier in java and python as well as c++. 

(For java the ant file included will both compile and run the test e.g. ant run -Dport=6672 -Dhost=mrg-xx)

Comment 4 Andrew Stitcher 2009-06-08 20:15:54 UTC

Created attachment 346927 [details]
Patch to add client heartbeat/detection

This patch against the cpp directory of the qpid 1.1.2 tree adds c++ client heartbeat and c++ broker detection of heartbeat timeout.

Comment 5 Andrew Stitcher 2009-06-08 20:16:39 UTC

fixed in 1.2 as well

Comment 6 Carl Trieloff 2009-06-09 00:49:53 UTC

This test should also be run on a node of a cluster

Comment 9 Frantisek Reznicek 2009-06-10 06:36:51 UTC

Putting back to ON_QA.

Comment 10 Andrew Stitcher 2009-06-10 21:04:42 UTC

I've found a problem with the first fix:

Reproducer:

Run 3 clustered brokers:

qpidd --auth no --cluster-name ams --port 21022 --no-data-dir
qpidd --auth no --cluster-name ams --port 21023 --no-data-dir
qpidd --auth no --cluster-name ams --port 21024 --no-data-dir

Run this line against brokers:

while true; do src/tests/perftest --port 21022 --heartbeat 1 & sleep 2 ; kill -STOP %% ; sleep 2 ; kill -CONT %%; done

Comment 11 Andrew Stitcher 2009-06-11 05:42:37 UTC

The above test is a bit too fierce and doesn't leave enough time to be sure that the broker should kill the client connection as a hearbeat of 1s has a 2s timeout.

This means that BZ505210 can interfere.

use:

while true; do src/tests/perftest --port 21022 --heartbeat 1 & sleep 2 ; kill
-STOP %% ; sleep 4 ; kill -CONT %%; done

Instead.

Comment 12 Andrew Stitcher 2009-06-11 05:52:04 UTC

Created attachment 347332 [details]
Fix for issues in previous patch for 1.1.2

Patch which fixes the issues in the previous client heartbeat patch

Comment 13 Martin Kudlej 2009-06-11 14:08:29 UTC

Tested on RHEL 4.7 and 5.3 on i386/x86_64 with qpidd-0.5.752581-16 and it works as we expect. About after 3 heartbeats clients can exclusive connect to queue again. -->VERIFIED

Comment 15 errata-xmlrpc 2009-06-12 17:39:02 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1097.html

Note You need to log in before you can comment on or make changes to this bug.