Bug 796765

Summary: heartbeats not reliable as means of detecting loss of network in c++ client
Product: Red Hat Enterprise MRG Reporter: Gordon Sim <gsim>
Component: qpid-cppAssignee: Andrew Stitcher <astitcher>
Status: CLOSED ERRATA QA Contact: Chuck Rolke <crolke>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: crolke, jross, lzhaldyb, santiago
Target Milestone: 3.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-0.22-4.el6, qpid-cpp-0.22-4.el5 Doc Type: Bug Fix
Doc Text:
It was discovered that the qpid C++ client would not disconnect from a broker that had timed out its heartbeats, if the socket it was using was not writable. This caused issues when sending large messages because the client didn't disconnect correctly upon receiving a heartbeat timeout. The fix corrects this behavior, and clients that are sending large messages correctly disconnect on heartbeat timeouts.
Story Points: ---
Clone Of:
: 835119 (view as bug list) Environment:
Last Closed: 2014-09-24 15:04:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Gordon Sim 2012-02-23 15:54:38 UTC
In particular when sending larger messages, e.g. see https://issues.apache.org/jira/browse/QPID-3828, or anything else that would cause socket to be non-writable (i.e. buffers full) at the point the lack of heartbeats is detected by the client.

Comment 1 Ed Santiago 2012-06-22 22:01:32 UTC
This might not be related, but it's the closest approximation to the symptoms I'm seeing. In short, heartbeat does not detect a dropped connection.

1) Listener client on my laptop, connecting (via VPN) to remote broker. Heartbeat set to 5.

2) Drop VPN. Leave it off for 10 seconds or 2 hours. Client will not disconnect.

3) Reconnect VPN. Client will still not disconnect.

Am I using heartbeat incorrectly? Are my expectations wrong? I expect the heartbeat option to cause a listener client to abort.

qpid-cpp-0.16-1.fc16.1, and I even tried rebuilding with the patch in your 11/Feb/12 17:08 comment on the QPID-3828 issue. No joy.

Full command and output:

   $ QPID_SSL_CERT_DB=[...]/certdb QPID_LOG_ENABLE=trace+   \
       ./drain -f  --broker [snip]:5671                     \
       --connection-options '{ sasl-mechanism:GSSAPI,       \
                               transport: ssl,              \
                               heartbeat: 5 }'              \
         'tmp.esm-test; { create:receiver, node: { type: queue, durable: False, x-declare: { exclusive: True, auto-delete: True }, x-bindings: [ { exchange: "standard.topic", queue: "tmp.esm-test", key: "something.#" }]}}'
   [...]
   2012-06-22 15:44:07 trace RECV [[53016 10.16.36.223:5671]]: Frame[BEbe; channel=0; {ConnectionHeartbeatBody: }]
   2012-06-22 15:44:07 trace SENT [[53016 10.16.36.223:5671]]: Frame[BEbe; channel=0; {ConnectionHeartbeatBody: }]
   [***this is where I disconnect VPN***]
   2012-06-22 15:44:17 debug Traffic timeout
   [***that's it. Drain process does not terminate, even after many hours***]

Clearly there's _some_ detection going on. The Traffic timeout message is in src/qpid/client/ConnectionImpl.cpp, and the code proceeds to call idleIn() which in turn calls connector->abort(), but something is getting stuck somewhere.

100% reproducible.

Comment 2 Gordon Sim 2012-06-25 08:44:20 UTC
This does indeed look like a similar issue, although in this case with the SSL transport.

Comment 3 Andrew Stitcher 2012-06-25 15:26:27 UTC
Created new bug for Ed's reported bug which is a different problem.

Comment 4 Andrew Stitcher 2012-06-25 15:27:11 UTC
(In reply to comment #3)
> Created new bug for Ed's reported bug which is a different problem.

Bug 835119

Comment 5 Andrew Stitcher 2013-04-25 14:47:55 UTC
This is now fixed upstream on trunk in r1475803 (on track for 0.24)

Comment 9 errata-xmlrpc 2014-09-24 15:04:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1296.html