Bug 165826
Summary: | NTP cannot keep clock in sync on HP DL385-type systems. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Carl Speare <carl> |
Component: | kernel | Assignee: | Brian Maly <bmaly> |
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | jbaron, john, mlichvar, shillman, ted.grzesik, unixgroup, wrenp |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-03-15 22:04:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Carl Speare
2005-08-12 16:32:04 UTC
This might shed some light, from the console of the system - warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip default_idle+0x20/0x23 Another finding that could be related to this - It appears that during system reboot, a notice appears complaining that the CPU real frequency is 2200MHz (which is true), but that cpufreq reported it as something less. On the console: powernow-k8: error - out of sync, fid 0xa 0xe, vid 0xc 0x8 Warning: CPU frequency is 2200000, cpufreq assumed 2000000 kHz. powernow-k8: error - out of sync, fid 0xa 0xe, vid 0xc 0x8 On some of the systems affected, cpufreq figured the speed to be 1800000, not 2000000. Could this be related? I was wondering if there is some guidance regarding when this bug might be examined. I would say that it is quite important, since NTP is failing to fix time drift - and most likely making it worse. (This problem does not occur on ES3 update 5 for pure x86.) Has anyone from the Red Hat side even taken a look at this bug submission? Sigh. RedHat, come on - this poor guy is struggling. Maybe this is relevant: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=312df5f1a1da780e084b328bcabb02a6dcd044c3 Basically, some facilities not available in Opteron system chipsets for timekeeping, and the appropriate fallback is not available either - which the above kernel patch fixes. I do get the TSC drift errors on DL380s (searching for the cause led me here) but there's a fallback available there and the NTP time appears ok (though it's depressing to think I may have much lower time resolution than I should be getting) Hope this helps. I'm having this same problem. Running RHEL 4 64bit on a HP DL385. Reassigning to kernel component. what kernel version is being used? 2.6.9-5.ELsmp (RHES4 [Nahant]) kernel-smp-2.6.9-5.EL FYI - the following appears to have fixed the issue on my machine: 1. add clock=pmtmr to the kernel options 2. Turn off cpuspeed (chkconfig cpuspeed off) 3. Reboot I did this about 6 hours ago and haven't lost a second since. Not sure if only one change was needed or both, but i'm running both. -Piers. ok, we've defaulted the U2 kernel and higher to use pmtimer b/c of these types of issue. So likely just upgrading the kernel will solve this issue. what timesource is being used on the system that wont keep time? dmesg | grep time.c HPET, PMTIMER, and PIT are the only timesources that can reliably be used on an Opteron based system, especially with powernow. TSC is notorious for causing time skew. if the timesource is "TSC", can the reporter of this issue verify this issue is resolved if "notsc" is used as a boot arg? -bash-3.00$ ssh jc1uadmin1 date ; date ; uname -a ; dmesg | grep time.c cspeare@jc1uadmin1's password: Wed Mar 15 16:30:12 EST 2006 Wed Mar 15 16:30:12 EST 2006 Linux jc1liniom5 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64 x86_64 GNU/Linux time.c: Using 1.193182 MHz PIT timer. time.c: Detected 2396.879 MHz processor. time.c: Using PIT/TSC based timekeeping. -bash-3.00$ As you can see, the time issue isn't present; no boot-time options are present. However, we had disabled cpuspeed (as part of the suggestion in comment 13 from Piers Wren. so disabling Powernow solved the problem for you? if so, the problem is almost certainly due to Powernow/TSC interaction. booting with "notsc" as a boot arg or disabling powernow should solve this issue PMTimer is the preferred timesource on Opteron systems using Powernow Yes, disabling powernow was the fix we deployed. At this point, what is the plan for new updates on the ES4 line? If an Opteron is detected, would the better solution be to append "notsc" for the boot args (most likely the easiest solution), or to "chkconfig cpuspeed off" (which might yield unexpected results for folks using desktops, where cpuspeed makes more sense)? A change recently went in RHEL4 that changed the ordering of timesource selection. The latest RHEL4 should pick PMTimer as a timesource instead of TSC so as to avoid this issue. Also, future RHEL4 releases will likely have Powernow disabled by default, leaving the user to enable it if so desired. *** Bug 190919 has been marked as a duplicate of this bug. *** We had the same problem. Turning off the cpuspeed service and booting kernel with clock=pmtmr seems to have fixed it. |