Description of problem: * A server was not running NTP service for a while and it got skewed from standard time. After sysadmin starting NTP service, several connections were dropped by heartbeat failure. Version-Release number of selected component (if applicable): qpid-0.18 How reproducible: Always Steps to Reproduce: 1. Set up a heartbeat-enabled connection to the broker 2. Add some minutes to current system time Actual results: * Connections are dropped due to heartbeat timeout Expected results: * Connections should not drop. Additional info: * It seems that this issue is triggered because the main Timer class in Qpid uses pthread_cond_timedwait, which is fed by CLOCK_REALTIME and is affected by settime()/settimeofday() calls on Linux. * Maybe it could be mitigated by using CLOCK_MONOTONIC time source instead, but we are not certain about the side effects that it might cause.
------------------- Trivial reproducer: 1) qpid-receive -a "testQueue; {create:always}" -f --connection-options="{heartbeat:1, reconnect:true}" 2) date -s "<any time in future>" step 2 bounces connection ------------------- Trivial workaround: do OS time changes only when broker is stopped ------------------- Some investigation: Per tcpdump, broker - at the time of new keepalive probe to be sent - drops the TCP connection, sometimes preceded by connection.heartbeat being sent, sometimes not. ---------------- Potential patch: I tested patch on upstream: Index: cpp/src/qpid/sys/posix/Time.cpp =================================================================== --- cpp/src/qpid/sys/posix/Time.cpp (revision 1580817) +++ cpp/src/qpid/sys/posix/Time.cpp (working copy) @@ -53,7 +53,7 @@ AbsTime AbsTime::now() { struct timespec ts; - ::clock_gettime(CLOCK_REALTIME, &ts); + ::clock_gettime(CLOCK_MONOTONIC, &ts); AbsTime time_now; time_now.timepoint = toTime(ts).nanosecs; return time_now; prevents this flaw, though I dont know what potential impact it can have elsewhere. Well, one side-effect it has, unit_tests fail (TimerTest.cpp and valgrind reports "Conditional jump or move depends on uninitialised value(s)"). ---------------- Raising proper mrg-* flags (as not sure if the customer plans to upgrade to 3.0 or stay on 2.*, I raised boths and drop the not-relevant later on).
(In reply to Pavel Moravec from comment #1) > ------------------- > Trivial reproducer: > 1) qpid-receive -a "testQueue; {create:always}" -f > --connection-options="{heartbeat:1, reconnect:true}" > 2) date -s "<any time in future>" > > step 2 bounces connection > > > ------------------- > Trivial workaround: > do OS time changes only when broker is stopped > > > ------------------- > Some investigation: > Per tcpdump, broker - at the time of new keepalive probe to be sent - drops > the TCP connection, sometimes preceded by connection.heartbeat being sent, > sometimes not. > > > ---------------- > Potential patch: > I tested patch on upstream: > > Index: cpp/src/qpid/sys/posix/Time.cpp > =================================================================== > --- cpp/src/qpid/sys/posix/Time.cpp (revision 1580817) > +++ cpp/src/qpid/sys/posix/Time.cpp (working copy) > @@ -53,7 +53,7 @@ > > AbsTime AbsTime::now() { > struct timespec ts; > - ::clock_gettime(CLOCK_REALTIME, &ts); > + ::clock_gettime(CLOCK_MONOTONIC, &ts); > AbsTime time_now; > time_now.timepoint = toTime(ts).nanosecs; > return time_now; > > prevents this flaw, though I dont know what potential impact it can have > elsewhere. I believe to fix this completely you need to have qpid::sys::Condition use CLOCK_MONOTONIC as well. By default pthread_cont_t uses the REALTIME clock so even if you generate AbsTime's with CLOCK_MONOTONIC, pthread_cond_timedwait may compare them with times taken from CLOCK_REALTIME so there may still be a problem. The other thing to consider is I think that not all systems support CLOCK_MONOTONIC so we may need to check that it's available and fall back to CLOCK_REALTIME if not. > Well, one side-effect it has, unit_tests fail (TimerTest.cpp and valgrind > reports "Conditional jump or move depends on uninitialised value(s)"). Its important to fix that but probably not very difficult.
I think the major issue with using CLOCK_MONOTONIC is that we use the AbsTime class to generate date/times in a few places. I think we could no longer do that reliably. CLOCK_MONOTONIC is defined by POSIX to be present if _POSIX_MONOTONIC_CLOCK is defined so we could easily switch between CLOCK_MONOTONIC and CLOCK_REALTIME. In practice I think that POSIX systems have it nowadays.
A fix for this issue has been checked into the qpid trunk: https://svn.apache.org/r1607140
Created attachment 998762 [details] Qpid broker debug log
Created attachment 998763 [details] Qpid broker trace log
Created attachment 999460 [details] Qpid cpp,java clients trace logs
Looking at those logs it seems that the client is closing the connection normally, rather than missing a heartbeat and force closing the connection. Viz - 2015-03-09 10:18:40 [Network] trace SENT [[[::1]:32871-localhost:5672]]: Frame[BEbe; channel=0; {ConnectionCloseBody: reply-code=200; reply-text=OK; }] This frame sent by the client is asking for a regular connection close with no error. Incidentally, the java logs are probably not related to any C++ bug. I would therefore suggest this may be a bug in the qc2_connector program that makes it think the duration of 20s is finished when the second time change occurs.
As extra support for comment #25: Whe I use the qpid-receive program I find no problem: QPID_LOG_ENABLE="debug+" src/tests/qpid-receive --connection-options "{ heartbeat:'1' }" -a "test_heartbeat_connection_on_system_time_change_future;{'node': {'x-declare': {'auto-delete': True}, 'type': 'queue'}, 'create': 'always'}" --timeout 30 This causes the same behaviour in terms of queue creations etc and heartbeats.
Ok, I see your point. Thank you very much. Marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0805.html