Bug 76499
Summary: | Kernel-2.4.18-17.7.x SMP time drift | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Stephen John Smoogen <smooge> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 7.3 | CC: | bje, ccoy, dedourek, djuran, fche, iglesias, lgt |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Errata | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-02-19 22:28:27 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stephen John Smoogen
2002-10-22 15:13:33 UTC
Ok after email from developer, I am looking at DMA/IRQ changes for cdrom drives between 2.4.17-3 and 2.4.17-17. I have done the following tests to see how much time is lost /etc/init.d/ntp stop hdparm -u1 -d1 /dev/hdc check time; grip 66 minute cd ; check time and the clock lost 0 seconds. To summarize for future people needing answers to this: hdparm -u1 -d1 /dev/hdc had the clock lose 0 seconds. hdparm -u0 -d1 /dev/hdc had the clock lose 257+ seconds. hdparm -u1 -d0 /dev/hdc had the clock lose 0 seconds. hdparm -u0 -d0 /dev/hdc had the clock lose 300+ seconds. So this looks to be some sort of interaction in masked interrupts on cdrom drives. This bug appears more general than many of the comments about it. I am not running SMP. I am running the i686 kernel. Since installing kernel 2.4.18-17.7.x I have seen the kernel clock loose up to about 140 SECONDS. Although the previous kernel (2.4.18-10) had a less stable clock (wandering 100's ms) than the kernels of RedHat 7.1 (wandering 10's of ms), I did not witness the the losses of 100's of SECONDS until installing 2.4.18-17.7.2. (I am doing network measurement research and have been investigating clock stability as it affects measurements for some time. I am a "clock watcher".) Also note that I do not have a CDWriter, I have a CD reader, but am not actively using it. So far, no pattern has evolved to relate the problem exactly to some other event. The problem does appear to be more frequent if I leave the X desktop (running gnome and random screen saver) than if I log out (back to console mode; I use "startx" method for starting X, not X based login). This is a serious problem. For example, two days ago, the clock lost time just before the 4 a.m. cron schedule, and ntp dutifully set it ahead, but this caused the daily cron to fail to run. Consider that this bug is likely related to the just reported 77058 describing large scheduling delays. Sounds like a bug involving locking or disabling interrupts and failing to reenable some place in the new kernel, thus leaving much locked until some other miscellaneous event happens. Has there been any work done with this? I too am seeing these large-scale drifts. Our machine has a CD-ROM, but it's most certainly not being used. The kernel we're using is 2.4.18-18.7.x. The drift is pretty bad. We have realtime systems we're testing that are doing some pretty funky things because of it. # uname -a Linux xxx.xxxx.xx 2.4.18-17.7.x #1 Tue Oct 8 13:33:14 EDT 2002 i686 unknown # tickadj tick = 1953 It should be 10000 I suppose. I hope this may be helpfull. This occurs with the 2.4.18-18.8.0 i686 UP kernel too. While running grip against an ide-scsi device, system time() slows down by a factor of almost two on my machine. On kernel 2.4.28-19.8.0 (RH 8.0) on an i686, I'm also experiencing severe clock drift (losing at least 5s/min). System had been working as a charm until I started ripping (Grip, cdparanoia, lame/ogg) from IDE cdrom. Unmasking IRQ's for device via hdparm seems to reduce (if not eliminate) drift. Buggy HW, I suppose (Compaq Presario 17XL series, 440BX chipset). Hope it helps. Cheers! This bug is still marked as NEW. Has anything been done with it? We just enabled 10Mb/s backups on our now 2.4.18-19.7.x test server, and we're seeing drifts on the order of 1 minute per hour. The two devices being involved with backups would be an Intel Ethernet Pro 100 and a qla2200 FC card. Any updates would be most helpful. Ahah! I have been trying to identify the cause of this problem because of the need for relatively stable timebases for network measurements. I am not an experienced kernel hacker, but I have stumbled upon some source code that might illuminate this problem. It would appear that sometime between kernel update 2.4.18-10 and 2.4.18-18.7.x RedHat "slipped in" a change of HZ. (I don't currently have access to the source for 2.4.18-17.7.x, but this may have introduced the change.) The change appears to change the HZ (the rate of the clock interrupts) from the standard 100 to 512 for the athlon, athlon-smp, i686, i686-bigmem, i686-debug, and i686-smp kernels (leaving the remaining kernels at 100). This means that the clock interrupt is changed from once each 10 ms to approximately once each 1.95 ms. This would seem to reduce the maximum interrupt disable time without lost clock interrupts from 10-20 ms to about 1.95-3.9 ms, thus potentially making lost clock interrupts more probable. Perhaps someone at RedHat could identify the patch which instituted this change. I'm seeing this on a SMP system with dual p3s. The system receives syslog data from a variety of systems and network equipment and also does some authentication against a kerberos server (hence the need for accurate time) It appears to be losing at least .5 seconds in a 5 minute period, usually more. The system is running RH 7.2, and keeps time just fine with kernel 2.4.9-13smp. Installing and running the 2.4.18-24.7.xsmp kernel causes the time to go off wildly, and eventually the kerberos server stops talking to it. Mike Iglesias Univ of Calif Irvine Just to confirm another observation of this. I found a discrepancy of 1.5 HOURS on my system this week, which caused me some embarassement as I did not realise the clock was wrong at first. I have two separate systems: One a dual PIII 550 running RH 7.3 with unmodified 2.4.18-27.7.xsmp kernel, the other (at home) a Celeron 1GHz uniprocessor system running RH 7.3 and unmodified 2.4.18-27.7.x kernel. Both have seen signicant clock drifts since the kernel update (10, 20, 30 minutes). The 1.5 hours difference was so big that I initially put it down to timeserver problems and reset the clock by hand (only the dual processor system uses timeservers). I use both Zip and USB storage regularly, and CD reading. The single processor system has a CD writer, but it's not used much at present. I have to say I'm finding this problem *extremely* inconvenient. Please sort a fix for it as soon as possible! This bug seems to go away with later errata. |