Bug 1080165 - [qpid] Changing system time might cause heartbeat-enabled connections to drop
Summary: [qpid] Changing system time might cause heartbeat-enabled connections to drop
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.4
Hardware: All
OS: Linux
medium
medium
Target Milestone: 3.1
: ---
Assignee: Andrew Stitcher
QA Contact: Michal Toth
URL: http://comments.gmane.org/gmane.comp....
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-24 19:37 UTC by Rodrigo A B Freire
Modified: 2019-05-20 11:09 UTC (History)
10 users (show)

Fixed In Version: qpid-cpp-0.30-2
Doc Type: Enhancement
Doc Text:
Under some circumstances the qpid broker could get confused and drop connections if the system clock was changed. The Qpid broker timekeeping is now changed so that changing the system clock does not confuse internal timekeeping. The qpid broker timekeeping is now separated into two types: Wallclock time (as determined by the system time), which no longer affects the internal workings of the broker except to print timestamps; and Relative time (usually relative to the system boot time), which is used for internal timekeeping purposes.
Clone Of:
: 1200545 (view as bug list)
Environment:
Last Closed: 2015-04-14 13:47:41 UTC
Target Upstream Version:


Attachments (Terms of Use)
Qpid broker debug log (37.79 KB, text/plain)
2015-03-06 10:41 UTC, Michal Toth
no flags Details
Qpid broker trace log (101.59 KB, text/plain)
2015-03-06 10:42 UTC, Michal Toth
no flags Details
Qpid cpp,java clients trace logs (4.74 KB, application/zip)
2015-03-09 09:16 UTC, Michal Toth
no flags Details


Links
System ID Priority Status Summary Last Updated
Apache JIRA QPID-5865 None None None Never
Red Hat Knowledge Base (Solution) 764263 None None None Never
Red Hat Product Errata RHEA-2015:0805 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.1 Release 2015-04-14 17:45:54 UTC

Description Rodrigo A B Freire 2014-03-24 19:37:42 UTC
Description of problem:
* A server was not running NTP service for a while and it got skewed from standard time. After sysadmin starting NTP service, several connections were dropped by heartbeat failure.

Version-Release number of selected component (if applicable):
qpid-0.18

How reproducible:
Always

Steps to Reproduce:
1. Set up a heartbeat-enabled connection to the broker
2. Add some minutes to current system time

Actual results:
* Connections are dropped due to heartbeat timeout

Expected results:
* Connections should not drop.

Additional info:
* It seems that this issue is triggered because the main Timer class in Qpid uses pthread_cond_timedwait, which is fed by CLOCK_REALTIME and is affected by  settime()/settimeofday() calls on Linux.

* Maybe it could be mitigated by using CLOCK_MONOTONIC time source instead, but we are not certain about the side effects that it might cause.

Comment 1 Pavel Moravec 2014-03-25 10:52:47 UTC
-------------------
Trivial reproducer:
1) qpid-receive -a "testQueue; {create:always}" -f --connection-options="{heartbeat:1, reconnect:true}"
2) date -s "<any time in future>"

step 2 bounces connection


-------------------
Trivial workaround:
do OS time changes only when broker is stopped


-------------------
Some investigation:
Per tcpdump, broker - at the time of new keepalive probe to be sent - drops the TCP connection, sometimes preceded by connection.heartbeat being sent, sometimes not.


----------------
Potential patch:
I tested patch on upstream:

Index: cpp/src/qpid/sys/posix/Time.cpp
===================================================================
--- cpp/src/qpid/sys/posix/Time.cpp	(revision 1580817)
+++ cpp/src/qpid/sys/posix/Time.cpp	(working copy)
@@ -53,7 +53,7 @@
 
 AbsTime AbsTime::now() {
     struct timespec ts;
-    ::clock_gettime(CLOCK_REALTIME, &ts);
+    ::clock_gettime(CLOCK_MONOTONIC, &ts);
     AbsTime time_now;
     time_now.timepoint = toTime(ts).nanosecs;
     return time_now;

prevents this flaw, though I dont know what potential impact it can have elsewhere.

Well, one side-effect it has, unit_tests fail (TimerTest.cpp and valgrind reports "Conditional jump or move depends on uninitialised value(s)").

----------------
Raising proper mrg-* flags (as not sure if the customer plans to upgrade to 3.0 or stay on 2.*, I raised boths and drop the not-relevant later on).

Comment 2 Alan Conway 2014-03-25 14:54:29 UTC
(In reply to Pavel Moravec from comment #1)
> -------------------
> Trivial reproducer:
> 1) qpid-receive -a "testQueue; {create:always}" -f
> --connection-options="{heartbeat:1, reconnect:true}"
> 2) date -s "<any time in future>"
> 
> step 2 bounces connection
> 
> 
> -------------------
> Trivial workaround:
> do OS time changes only when broker is stopped
> 
> 
> -------------------
> Some investigation:
> Per tcpdump, broker - at the time of new keepalive probe to be sent - drops
> the TCP connection, sometimes preceded by connection.heartbeat being sent,
> sometimes not.
> 
> 
> ----------------
> Potential patch:
> I tested patch on upstream:
> 
> Index: cpp/src/qpid/sys/posix/Time.cpp
> ===================================================================
> --- cpp/src/qpid/sys/posix/Time.cpp	(revision 1580817)
> +++ cpp/src/qpid/sys/posix/Time.cpp	(working copy)
> @@ -53,7 +53,7 @@
>  
>  AbsTime AbsTime::now() {
>      struct timespec ts;
> -    ::clock_gettime(CLOCK_REALTIME, &ts);
> +    ::clock_gettime(CLOCK_MONOTONIC, &ts);
>      AbsTime time_now;
>      time_now.timepoint = toTime(ts).nanosecs;
>      return time_now;
> 
> prevents this flaw, though I dont know what potential impact it can have
> elsewhere.

I believe to fix this completely you need to have qpid::sys::Condition use CLOCK_MONOTONIC as well. By default pthread_cont_t uses the REALTIME clock so even if you generate AbsTime's with CLOCK_MONOTONIC, pthread_cond_timedwait may compare them with times taken from CLOCK_REALTIME so there may still be a problem.

The other thing to consider is I think that not all systems support CLOCK_MONOTONIC so we may need to check that it's available and fall back to CLOCK_REALTIME if not.

> Well, one side-effect it has, unit_tests fail (TimerTest.cpp and valgrind
> reports "Conditional jump or move depends on uninitialised value(s)").

Its important to fix that but probably not very difficult.

Comment 3 Andrew Stitcher 2014-03-26 16:31:49 UTC
I think the major issue with using CLOCK_MONOTONIC is that we use the AbsTime class to generate date/times in a few places. I think we could no longer do that reliably.

CLOCK_MONOTONIC is defined by POSIX to be present if _POSIX_MONOTONIC_CLOCK is defined so we could easily switch between CLOCK_MONOTONIC and CLOCK_REALTIME. In practice I think that POSIX systems have it nowadays.

Comment 7 Andrew Stitcher 2014-07-01 17:02:33 UTC
A fix for this issue has been checked into the qpid trunk:
 https://svn.apache.org/r1607140

Comment 19 Michal Toth 2015-03-06 10:41:30 UTC
Created attachment 998762 [details]
Qpid broker debug log

Comment 20 Michal Toth 2015-03-06 10:42:40 UTC
Created attachment 998763 [details]
Qpid broker trace log

Comment 23 Michal Toth 2015-03-09 09:16:33 UTC
Created attachment 999460 [details]
Qpid cpp,java clients trace logs

Comment 25 Andrew Stitcher 2015-03-10 20:11:11 UTC
Looking at those logs it seems that the client is closing the connection normally, rather than missing a heartbeat and force closing the connection.

Viz -

2015-03-09 10:18:40 [Network] trace SENT [[[::1]:32871-localhost:5672]]: Frame[BEbe; channel=0; {ConnectionCloseBody: reply-code=200; reply-text=OK; }]

This frame sent by the client is asking for a regular connection close with no error.

Incidentally, the java logs are probably not related to any C++ bug.

I would therefore suggest this may be a bug in the qc2_connector program that makes it think the duration of 20s is finished when the second time change occurs.

Comment 26 Andrew Stitcher 2015-03-10 20:14:10 UTC
As extra support for comment #25:

Whe I use the qpid-receive program I find no problem:

QPID_LOG_ENABLE="debug+" src/tests/qpid-receive --connection-options "{ heartbeat:'1' }" -a "test_heartbeat_connection_on_system_time_change_future;{'node': {'x-declare': {'auto-delete': True}, 'type': 'queue'}, 'create': 'always'}" --timeout 30

This causes the same behaviour in terms of queue creations etc and heartbeats.

Comment 27 Michal Toth 2015-03-11 13:40:38 UTC
Ok, I see your point. Thank you very much.
Marking as verified.

Comment 30 errata-xmlrpc 2015-04-14 13:47:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0805.html


Note You need to log in before you can comment on or make changes to this bug.