Red Hat Bugzilla – Bug 504590
qpidd does not use heartbeats to detect loss of clients
Last modified: 2009-06-12 13:39:02 EDT
E.g. have client request exclusive subscription to queue on a remote broker, power off the machine on which the client is running, from another machine try to request and exclusive subscription to that same queue.
At present it can take 14 minutes for the broker to detect that the first client's session has been lost and grant an exclusive subscription to the second client. It should happen much faster.
Created attachment 346857 [details]
exclusive subscribe example
This is a simple client that will request an exclusive subscription to the specified queue. While on instance of this is active, attempts to start further instances on the same queue will fail. If the first porcess is killed, that will allow another to take the exclusive subscription.
However without bi-directional heartbeats (and with default tcp settings) if the machine the client is on is powered down (or if it is unplugged from the network) the queue remains 'locked' until tcp timesout on retries.
Created attachment 346886 [details]
heartbeat echo from the java and python clients
Created attachment 346919 [details]
Exclusive subscribe examples in c++, python and java
This tarball includes an equivalent example to the one attached earlier in java and python as well as c++.
(For java the ant file included will both compile and run the test e.g. ant run -Dport=6672 -Dhost=mrg-xx)
Created attachment 346927 [details]
Patch to add client heartbeat/detection
This patch against the cpp directory of the qpid 1.1.2 tree adds c++ client heartbeat and c++ broker detection of heartbeat timeout.
fixed in 1.2 as well
This test should also be run on a node of a cluster
Putting back to ON_QA.
I've found a problem with the first fix:
Run 3 clustered brokers:
qpidd --auth no --cluster-name ams --port 21022 --no-data-dir
qpidd --auth no --cluster-name ams --port 21023 --no-data-dir
qpidd --auth no --cluster-name ams --port 21024 --no-data-dir
Run this line against brokers:
while true; do src/tests/perftest --port 21022 --heartbeat 1 & sleep 2 ; kill -STOP %% ; sleep 2 ; kill -CONT %%; done
The above test is a bit too fierce and doesn't leave enough time to be sure that the broker should kill the client connection as a hearbeat of 1s has a 2s timeout.
This means that BZ505210 can interfere.
while true; do src/tests/perftest --port 21022 --heartbeat 1 & sleep 2 ; kill
-STOP %% ; sleep 4 ; kill -CONT %%; done
Created attachment 347332 [details]
Fix for issues in previous patch for 1.1.2
Patch which fixes the issues in the previous client heartbeat patch
Tested on RHEL 4.7 and 5.3 on i386/x86_64 with qpidd-0.5.752581-16 and it works as we expect. About after 3 heartbeats clients can exclusive connect to queue again. -->VERIFIED
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.