Description of problem: setup_cluster, create_vol, and enable_vol invoke a function which validates the /etc/ntpd.conf file and report an error if that file does not match the test for a "valid" ntp.conf file. The validation is that there are 1 or more "server" commands in the file, and that ntpdate can be used on at least one of the servers. A Red Hat customer configures a ntp.conf file using "multicastclient" rather than "server", which causes the scripts above to exit with an error. There are many sophisticated ntp configs available, including supplying ntpd the config options directly and bypassing ntp.conf all together. Therefore, our scripts should not attempt to validate ntp.conf, but rather check that the time drift across the cluster is within a resonable tolerance. Version-Release number of selected component (if applicable): 2.43 and earlier How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
version 2.44 replaces the ntp.conf validation function with 2 new functions: 1) ntp_running(), which checks that ntpd is running and persisted to run on reboots. 2) ntp_time_sync_check(), which accepts a list of nodes and checks that the time drift among the nodes is within hard-coded limits. Currently, if the max drift across the cluster is 5 or more seconds then an error is reported and the calling script will exit. If the max drift is less than 1 second the cluster is considered to be in sync. If the max drift is between 1 and 5 seconds a warning is reported and the calling script continues to execute. The values 1 and 5 are constants defined in the ntp_time_sync_check() function.
We've tried installation with 28 machines and this check failed. I would like to suggest to solve this issue differently than by serial/parallel ssh connections with date command or reimplementing ntp protocol. Also I think this cannot be fixed by increasing tolerance in time check as it is implemented now because creating ssh connection can be quite slow. Another argument is that we support more than 28 machines (I think Gluster can be configured with 128 machines). Suggestion: 1. Write to documentation that configured and functional NTP is highly recommended on all machines and 2. try to check NTP by command "ntpstat". In case of error raise only warning that ntpd is not running and time should be same on all nodes.
Martin, I agree that ntpstat might be the best solution. What are your thoughts on using ntpstat vs. ntpq -p?
I think ntpstat is enough for getting ntp status.
Martin, re. comment #14: Is ntpstat included in RHS 3.x and RHEL 6.5+? In other words, will ntpstat be available or do the scripts need to yum install first? I'd perfer not to install ntpstat if possible.
Jeff, Could you please provide the doc text in the Doc Text field for this bug?
Re-fixed in 2.47. No longer are the times on each node in the cluster compared. Instead ntpstat is run on each node and if it exits with 0 on each node then the cluster is considered to be in time sync. The output of ntpstat is recorded in the log file. The code is structured such that if we decide later to compare times across the cluster only the ntp_time_sync_check() function should need to be modified.
Tested on Red Hat Storage Server 3.0 Update 4 with rhs-hadoop-install-2_47-1.el6rhs.noarch. Current solution with ntpstat works well. >> VERIFIED
I changed the req_doc flag to "-" since this is already included in the existing install doc, Ch 7.6 "Troubleshooting" (tthough the header has the incorrect chapter number).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0761.html