From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021130 Description of problem: After upgrading a rh 7.2 dual athlon server to 8.0 ntp won't synchronize with external servers because it detects a high jitter, even with a server next to it on the network. It has been pointed out in comp.protocols.time.ntp that the redhat 8.0 release notes contain: <blockquote> HZ=512 on i686 and Athlon means that the system clock ticks 5 times as fast as on other x86 platforms (i386 and i586); HZ=100 has been the Linux default on x86 platforms for the entire history of the Linux kernel. This change provides better interactive response, lower latency response from some programs, and better response from the scheduler. We have adjusted the /proc file system to report numbers as if using the default HZ=100. </blockquote> This may be what is causing ntp to detect this high jitter: remote refid st t when poll reach delay offset jitter ============================================================================== *LOCAL(0) LOCAL(0) 10 l 2 64 377 0.000 0.000 0.008 otc2.psu.edu ntp2.usno.navy. 2 u 60 64 377 156.826 10325.1 1415.94 proxy.cc.vt.edu gps1.tns.its.ps 2 u 1 64 377 13.307 11522.6 1369.69 p1.selectacast. otc2.psu.edu 3 u 62 64 377 0.189 10237.7 1453.45 The last one is next to it on the network. A similar dual xeon machine that was upgraded from 7.2 to 8.0 does not show the same problem. A really bad hardware clock may contribute to the problem, but it worked fine under 7.2 Version-Release number of selected component (if applicable): ntp-4.1.1a-9 How reproducible: Always Steps to Reproduce: 1. Try to run ntpd on a dual athlon machine Actual Results: ntp chose to sync with itself instead of any of the external machines. Expected Results: ntp should have synced with the external machines. Additional info: This is a tyan motherboard. <pre> # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) MP 1800+ stepping : 2 cpu MHz : 1526.422 cache size : 256 KB Physical processor ID : 0 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3038.00 processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1526.422 cache size : 256 KB Physical processor ID : 0 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3046.10 </pre> I'm marking this as high because the bad clock is causing me serious problems.
Do you use -x ? How does /etc/ntp.conf, /etc/sysconfig/ntpd look like? Does it help to recompile and use the ntp from 7.2 on 8.0?
No, I'm not using -x. I'm using the standard startup script installed by the rpm. I don't see how -x would change anything, that specifies slewing vs. stepping, and I can't get it to sync in the first place. I haven't tried building old versions, when I do I'll get back to you.
ok, because -x makes things worse in 8.0 :-(
I tried versions of ntp from 7.3 and 7.2. The version from 7.3 was about the same, and the version from 7.2 is worse: ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== *LOCAL(0) LOCAL(0) 10 l 54 64 37 0.000 0.000 0.008 otc2.psu.edu gps1.tns.its.ps 2 u 1 64 77 18.226 1634.44 3445.05 proxy.cc.vt.edu navobs1.gatech. 2 u 60 64 37 10.697 1677.09 2743.05 p1.selectacast. otc2.psu.edu 3 u 65 64 37 0.173 4380.14 2664.43
err, as you can see, ntpd choose your LOCAL clock to trust. Why don't you put a reliable server ip in /etc/ntp/step-tickers and restart ntpd? # service ntpd restart Also p1.selectacast. seems to be way out of sync... remove it...
also note, that ntpd chooses the prefered server after 3 minutes... so you have to wait after a restart, until s.th. happens
>err, as you can see, ntpd choose your LOCAL clock to trust. Exactly, thats what the bug is. tc2.psu.edu proxy.cc.vt.edu are reliable ntp servers. And p1 is not out of sync. It just shows up that way in that snapshot. It might be a symptom of the problem that makes ntpd think the jitter is so high. Here is a more recent series of snapshots. p1, which has a very low delay, has the same jitter and offset of the stratum 2 servers: ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== LOCAL(0) LOCAL(0) 10 l 41 64 1 0.000 0.000 0.008 otc2.psu.edu ntp2.usno.navy. 2 u 51 64 1 18.401 272.071 0.008 proxy.cc.vt.edu gps1.tns.its.ps 2 u 51 64 1 12.214 285.203 0.008 p1.selectacast. otc2.psu.edu 3 u 52 64 1 0.161 250.357 0.008 ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== LOCAL(0) LOCAL(0) 10 l 24 64 3 0.000 0.000 0.008 otc2.psu.edu ntp2.usno.navy. 2 u 37 64 3 17.980 1633.11 1361.04 proxy.cc.vt.edu gps1.tns.its.ps 2 u 36 64 3 12.063 1668.35 1383.15 p1.selectacast. otc2.psu.edu 3 u 36 64 3 0.153 1656.76 1406.40 ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== *LOCAL(0) LOCAL(0) 10 l 10 64 77 0.000 0.000 0.008 otc2.psu.edu ntp2.usno.navy. 2 u 22 64 77 18.875 7194.44 1402.61 proxy.cc.vt.edu gps1.tns.its.ps 2 u 18 64 77 12.240 7290.64 1355.58 p1.selectacast. otc2.psu.edu 3 u 20 64 77 0.214 7239.38 1422.10
hmm... delay to the stratum 2 servers is very high... what happens if you remove LOCAL?
18 is high? Here is output after I restarted ntp without local. Notice how after reach 1 the jitter is low, but builds over time. ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== otc2.psu.edu otc1.psu.edu 2 u 4 64 1 18.346 343.673 0.008 proxy.cc.vt.edu tick.usno.navy. 2 u 1 64 1 12.266 421.555 0.008 p1.selectacast. otc2.psu.edu 3 u 9 64 1 0.167 237.928 0.008 ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== otc2.psu.edu otc1.psu.edu 2 u 31 64 3 18.812 1719.21 1375.54 proxy.cc.vt.edu tick.usno.navy. 2 u 25 64 3 10.202 1860.50 1438.95 p1.selectacast. otc2.psu.edu 3 u 33 64 3 0.200 1675.74 1437.81 ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== otc2.psu.edu otc1.psu.edu 2 u 54 64 7 18.221 3070.49 1351.28 proxy.cc.vt.edu tick.usno.navy. 2 u 45 64 7 11.533 3279.85 1419.34 p1.selectacast. otc2.psu.edu 3 u 55 64 7 0.191 3050.96 1375.21
wow, this looks really bad.. can you retry with the latest rawhide version?
I tried ntp-4.1.2-0.rc1.2.i386.rpm from rawhide, same problem. If the problem is in the kernel then changing ntp won't help much. I have a cron script that sets the time via ntp every 5 minutes, and it sets the time forward 6.2 to 6.4 seconds every time, so that's how bad the hardware clock is. The question is how come ntp thinks the jitter is so high?
because of your bad hw clock?
That's what ntp is for: compensating for bad hardware clocks. I had the same hardware clock in 7.2 but ntp still worked.
I visted the machine in the datacenter, and there were a lot of messages like this on the screen: set_rtc_mmss: can't update from 8 to 56 The messages didn't appear in the logs anywhere. After rebooting the machine, ntp worked for 8 1/2 hours and then: ntpd[740]: synchronisation lost The peers command shows the same jitter again.
I've seen this kind of problem on 7.3, 8.0 and 9 systems where a gnome-battery applet was running in the panel. Is this machine a laptop by chance?
I don't own any dual athlon laptops, do you?
please retry with the latest rawhide version
I've since upgraded this machine to redhat 9 and it doesn't have the problem anymore. Do you still want me to try the rawhide version?
no, if it works in 9, proves my patches are good and the version in 9 is all right. Thank you!