Description of problem: NTP will state that is running, will show that it is bound to our NTP server, and for the first few seconds after a "service ntpd restart", the time will be correct. However, within a short period of time (varies from system to system), the time will be so far ahead as to be unreasonable for normal operation. The skew amount is beyond normally encountered levels - we are talking minutes per hour. Version-Release number of selected component (if applicable): ntp-4.2.0.a.20040617-4 [as reported via "rpm -q ntp" How reproducible: Very. Steps to Reproduce: 1. Build an HP DL385 with RHEL 4 (Nahant). 2. Modify /etc/ntp.conf to point to a valid NTP server. 3. Start (or restart) NTP service via "service ntpd restart". 4. sleep 600 5. Compare the local time to the time of a known good server. The difference is measurable in amounts ranging from a few dozen seconds to a few dozen minutes. Actual results: Time skew ranging in various amounts as previously stated. Expected results: No skew, or at least not measurable via "date" command. Additional info: This states it best - output from a quick check (first time stated is the true time (from the NTP server), second time is the time on the RHEL4 system, and verification that NTP is running. Working on jc1lmust1 ============================ Fri Aug 12 12:23:00 EDT 2005 Fri Aug 12 12:50:26 EDT 2005 ntpd (pid 22765) is running... Working on jc1lmust2 ============================ Fri Aug 12 12:23:01 EDT 2005 Fri Aug 12 12:27:29 EDT 2005 ntpd (pid 9945) is running... Working on jc1lmust3 ============================ Fri Aug 12 12:23:01 EDT 2005 Fri Aug 12 12:25:21 EDT 2005 ntpd (pid 28859) is running... Working on jc1lmust4 ============================ Fri Aug 12 12:23:01 EDT 2005 Fri Aug 12 12:29:42 EDT 2005 ntpd (pid 3487) is running... Working on jc1lnpmd1 ============================ Fri Aug 12 12:23:02 EDT 2005 Fri Aug 12 12:23:58 EDT 2005 ntpd (pid 12376) is running... Working on jc1lnpmd2 ============================ Fri Aug 12 12:23:02 EDT 2005 Fri Aug 12 12:25:32 EDT 2005 ntpd (pid 32398) is running... Working on jc1lnpmd3 ============================ Fri Aug 12 12:23:02 EDT 2005 Fri Aug 12 12:25:58 EDT 2005 ntpd (pid 5379) is running... Working on jc1lnpmd4 ============================ Fri Aug 12 12:23:02 EDT 2005 Fri Aug 12 12:24:57 EDT 2005 ntpd (pid 29184) is running... Working on jc1lnpmddev1 ============================ Fri Aug 12 12:23:03 EDT 2005 Fri Aug 12 12:30:42 EDT 2005 ntpd (pid 2170) is running... The NTP config file is the same for all 9 servers involved: server jc1time1 server jc1time2 server jc1time3 driftfile /var/lib/ntp/drift All of our other UNIX and pure x86 (DL360/380) systems sync up are working correctly, so the issue is not with the NTP server. If I restart NTP, it will be correct for a short time.
This might shed some light, from the console of the system - warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip default_idle+0x20/0x23
Another finding that could be related to this - It appears that during system reboot, a notice appears complaining that the CPU real frequency is 2200MHz (which is true), but that cpufreq reported it as something less. On the console: powernow-k8: error - out of sync, fid 0xa 0xe, vid 0xc 0x8 Warning: CPU frequency is 2200000, cpufreq assumed 2000000 kHz. powernow-k8: error - out of sync, fid 0xa 0xe, vid 0xc 0x8 On some of the systems affected, cpufreq figured the speed to be 1800000, not 2000000. Could this be related?
I was wondering if there is some guidance regarding when this bug might be examined. I would say that it is quite important, since NTP is failing to fix time drift - and most likely making it worse. (This problem does not occur on ES3 update 5 for pure x86.)
Has anyone from the Red Hat side even taken a look at this bug submission?
Sigh. RedHat, come on - this poor guy is struggling. Maybe this is relevant: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=312df5f1a1da780e084b328bcabb02a6dcd044c3 Basically, some facilities not available in Opteron system chipsets for timekeeping, and the appropriate fallback is not available either - which the above kernel patch fixes. I do get the TSC drift errors on DL380s (searching for the cause led me here) but there's a fallback available there and the NTP time appears ok (though it's depressing to think I may have much lower time resolution than I should be getting) Hope this helps.
I'm having this same problem. Running RHEL 4 64bit on a HP DL385.
Reassigning to kernel component.
what kernel version is being used?
2.6.9-5.ELsmp (RHES4 [Nahant])
kernel-smp-2.6.9-5.EL
FYI - the following appears to have fixed the issue on my machine: 1. add clock=pmtmr to the kernel options 2. Turn off cpuspeed (chkconfig cpuspeed off) 3. Reboot I did this about 6 hours ago and haven't lost a second since. Not sure if only one change was needed or both, but i'm running both. -Piers.
ok, we've defaulted the U2 kernel and higher to use pmtimer b/c of these types of issue. So likely just upgrading the kernel will solve this issue.
what timesource is being used on the system that wont keep time? dmesg | grep time.c HPET, PMTIMER, and PIT are the only timesources that can reliably be used on an Opteron based system, especially with powernow. TSC is notorious for causing time skew. if the timesource is "TSC", can the reporter of this issue verify this issue is resolved if "notsc" is used as a boot arg?
-bash-3.00$ ssh jc1uadmin1 date ; date ; uname -a ; dmesg | grep time.c cspeare@jc1uadmin1's password: Wed Mar 15 16:30:12 EST 2006 Wed Mar 15 16:30:12 EST 2006 Linux jc1liniom5 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64 x86_64 GNU/Linux time.c: Using 1.193182 MHz PIT timer. time.c: Detected 2396.879 MHz processor. time.c: Using PIT/TSC based timekeeping. -bash-3.00$ As you can see, the time issue isn't present; no boot-time options are present. However, we had disabled cpuspeed (as part of the suggestion in comment 13 from Piers Wren.
so disabling Powernow solved the problem for you? if so, the problem is almost certainly due to Powernow/TSC interaction. booting with "notsc" as a boot arg or disabling powernow should solve this issue PMTimer is the preferred timesource on Opteron systems using Powernow
Yes, disabling powernow was the fix we deployed. At this point, what is the plan for new updates on the ES4 line? If an Opteron is detected, would the better solution be to append "notsc" for the boot args (most likely the easiest solution), or to "chkconfig cpuspeed off" (which might yield unexpected results for folks using desktops, where cpuspeed makes more sense)?
A change recently went in RHEL4 that changed the ordering of timesource selection. The latest RHEL4 should pick PMTimer as a timesource instead of TSC so as to avoid this issue. Also, future RHEL4 releases will likely have Powernow disabled by default, leaving the user to enable it if so desired.
*** Bug 190919 has been marked as a duplicate of this bug. ***
We had the same problem. Turning off the cpuspeed service and booting kernel with clock=pmtmr seems to have fixed it.