Bug 1121266

Summary: oo-diagnostics should check that a node's clock is synchronized with the broker's
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: ContainersAssignee: Miciah Dashiel Butler Masters <mmasters>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 2.1.0CC: adellape, anli, bleanhar, cryan, gpei, jialiu, jokerman, libra-onpremise-devel, mmccomas
Target Milestone: ---Keywords: Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The oo-diagnostics script did not check whether a node host's clock was in sync with the associated broker host's clock. MCollective ignores messages where the sender's timestamps on its messages are more than 60 seconds behind the recipient's clock at the time it receives the message, and communications between the broker and node hosts could be lost. This bug fix updates the oo-diagnostics script to add the test_node_clock_in_synch_with_broker check, which sends an HTTP request to the broker (as specified by the BROKER_HOST parameter in the /etc/openshift/node.conf file) and compares the time in the "Date:" header in the response with the node host's clock. As a result, the oo-diagnostics script now warns if the clocks are out of sync by five or more seconds, and it fails if the clocks are out of sync by 55 or more seconds.
Story Points: ---
Clone Of:
: 1121267 (view as bug list) Environment:
Last Closed: 2014-08-04 13:28:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1121267, 1122194    
Bug Blocks:    

Description Miciah Dashiel Butler Masters 2014-07-18 19:52:09 UTC
Description of problem:

MCollective ignores messages where the sender's timestamps on those messages is more than 60 seconds behind the recipient's clock.  OpenShift broker and node hosts use MCollective for communication.  Consequently, oo-diagnostics should detect when a node's clock is out of synch with its broker's clock.


How reproducible:

Completely.


Steps to Reproduce:

1. Install an OpenShift Enterprise PaaS with 1 node host and 1 distinct broker host.

2. Set the node's clock 30 seconds ahead of the broker's and run oo-diagnostics on the node.

3. Set the node's clock 30 seconds behind the broker's and run oo-diagnostics on the node.

4. Set the node's clock 90 seconds ahead of the broker's and run oo-diagnostics on the node.

5. Set the node's clock 90 seconds behind the broker's and run oo-diagnostics on the node.


Actual results:

oo-diagnostics does not complain about the clock.


Expected results:

At Steps 2 and 3, oo-diagnostics should give a warning because the node's clock is significantly off from the broker's.

At Steps 4 and 5, oo-diagnostics should give an error because the node's clock is sufficiently far off from the broker's to disrupt communications.


Additional info:

In situations where the clocks are so far out of synch as to disrupt communications, the broker has no good way to discover nodes because uses MCollective, which is disrupted by the problem.  However, a node can identify its broker by the BROKER_HOST setting in its /etc/openshift/node.conf configuration file, so it would be feasible for a node to check that it is in synch with its broker.

Comment 1 Miciah Dashiel Butler Masters 2014-07-22 16:09:15 UTC
PR: https://github.com/openshift/enterprise-server/pull/332

Comment 4 Anping Li 2014-07-24 04:02:17 UTC
Verified and pass on puddle-2-1-2014-07-22.


Result:
When the node is ahead/behind 30~60, show warning message.
When the node is ahead/behind 60~, show error message.

step 1: No complain about the clock.

step 2: Ahead > 30, Show warning message
INFO: running: test_node_clock_in_synch_with_broker
WARN: test_node_clock_in_synch_with_broker
        The local host's clock is ahead of br200.osegeo-20140724.com.cn's
        by 39 seconds.

step 3: Behind >30, show warning message 
WARN: test_node_clock_in_synch_with_broker
        The local host's clock is behind br200.osegeo-20140724.com.cn's
        by 34 seconds.  Note that a host will drop messages that it receives

step 4: Ahead > 90, show error message 
FAIL: test_node_clock_in_synch_with_broker
        The local host's clock is ahead of br200.osegeo-20140724.com.cn's
        by 90 seconds.

step 5: Behind>90, show error message
        FAIL: test_node_clock_in_synch_with_broker
        The local host's clock is behind br200.osegeo-20140724.com.cn's
        by 92 seconds.

Step 6: Ahead >60, Show error message
FAIL: test_node_clock_in_synch_with_broker
        The local host's clock is ahead of br200.osegeo-20140724.com.cn's
        by 64 seconds

Step 7: Behind>60, Show error message
FAIL: test_node_clock_in_synch_with_broker
        The local host's clock is behind br200.osegeo-20140724.com.cn's
        by 65 seconds

Comment 6 errata-xmlrpc 2014-08-04 13:28:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0999.html