Bug 1195072 - replace ntp.conf validation with time-sync check
Summary: replace ntp.conf validation with time-sync check
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhs-hadoop-install
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.0.4
Assignee: Jeff Vance
QA Contact: Daniel Horák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-02-22 17:54 UTC by Jeff Vance
Modified: 2015-05-13 17:52 UTC (History)
11 users (show)

Fixed In Version: 2.47-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-03-31 10:18:42 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:0761 0 normal SHIPPED_LIVE Red Hat Storage Server 3 Hadoop plug-in enhancement update 2015-03-31 14:17:20 UTC

Description Jeff Vance 2015-02-22 17:54:44 UTC
Description of problem:
setup_cluster, create_vol, and enable_vol invoke a function which validates the /etc/ntpd.conf file and report an error if that file does not match the test for a "valid" ntp.conf file. The validation is that there are 1 or more "server" commands in the file, and that ntpdate can be used on at least one of the servers. A Red Hat customer configures a ntp.conf file using "multicastclient" rather than "server", which causes the scripts above to exit with an error.

There are many sophisticated ntp configs available, including supplying ntpd the config options directly and bypassing ntp.conf all together.  Therefore, our scripts should not attempt to validate ntp.conf, but rather check that the time drift across the cluster is within a resonable tolerance.

Version-Release number of selected component (if applicable):
2.43 and earlier

How reproducible:
100%

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jeff Vance 2015-02-22 18:02:08 UTC
version 2.44 replaces the ntp.conf validation function with 2 new functions:
1) ntp_running(), which checks that ntpd is running and persisted to run on reboots.
2) ntp_time_sync_check(), which accepts a list of nodes and checks that the time drift among the nodes is within hard-coded limits. Currently, if the max drift across the cluster is 5 or more seconds then an error is reported and the calling script will exit. If the max drift is less than 1 second the cluster is considered to be in sync. If the max drift is between 1 and 5 seconds a warning is reported and the calling script continues to execute. The values 1 and 5 are constants defined in the ntp_time_sync_check() function.

Comment 11 Martin Kudlej 2015-03-11 14:35:55 UTC
We've tried installation with 28 machines and this check failed. I would like to suggest to solve this issue differently than by serial/parallel ssh connections with date command or reimplementing ntp protocol. Also I think this cannot be fixed by increasing tolerance in time check as it is implemented now because creating ssh connection can be quite slow. Another argument is that we support more than 28 machines (I think Gluster can be configured with 128 machines).

Suggestion:
1. Write to documentation that configured and functional NTP is highly recommended on all machines and 
2. try to check NTP by command "ntpstat". In case of error raise only warning that ntpd is not running and time should be same on all nodes.

Comment 13 Jeff Vance 2015-03-12 06:46:37 UTC
Martin, I agree that ntpstat might be the best solution. What are your thoughts on using ntpstat vs. ntpq -p?

Comment 14 Martin Kudlej 2015-03-12 13:15:52 UTC
I think ntpstat is enough for getting ntp status.

Comment 15 Jeff Vance 2015-03-16 23:47:05 UTC
Martin, re. comment #14: Is ntpstat included in RHS 3.x and RHEL 6.5+? In other words, will ntpstat be available or do the scripts need to yum install first? I'd perfer not to install ntpstat if possible.

Comment 16 Divya 2015-03-17 10:51:43 UTC
Jeff,

Could you please provide the doc text in the Doc Text field for this bug?

Comment 21 Jeff Vance 2015-03-20 19:36:47 UTC
Re-fixed in 2.47. No longer are the times on each node in the cluster compared. Instead ntpstat is run on each node and if it exits with 0 on each node then the cluster is considered to be in time sync. The output of ntpstat is recorded in the log file. The code is structured such that if we decide later to compare times across the cluster only the ntp_time_sync_check() function should need to be modified.

Comment 24 Daniel Horák 2015-03-25 15:14:43 UTC
Tested on Red Hat Storage Server 3.0 Update 4 with rhs-hadoop-install-2_47-1.el6rhs.noarch.

Current solution with ntpstat works well.

>> VERIFIED

Comment 26 Jeff Vance 2015-03-27 17:06:14 UTC
I changed the req_doc flag to "-" since this is already included in the existing install doc, Ch 7.6 "Troubleshooting" (tthough the header has the incorrect chapter number).

Comment 28 errata-xmlrpc 2015-03-31 10:18:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0761.html


Note You need to log in before you can comment on or make changes to this bug.