Description of Problem: Ok I know it has been reported X times now, but I can confirm a large time drift with the Red Hat Linux 7.3 update 2.4.18-17.7.x SMP kernel. root:{log}# rpm -q glibc ntp kernel glibc-2.2.5-40 ntp-4.1.1-1 kernel-2.4.18-17.7.x root:{log}# I didnt notice it until yesterday when I couldnt authenticate with the Kerberos servers at work. The problem was that my clock was 43 minutes behind real time. I looked through the logs, and found that my ntp had died a couple times with Oct 21 13:15:48 smoogen1 ntpd[1192]: time reset 675.164428 s Oct 21 13:15:48 smoogen1 ntpd[1192]: synchronisation lost Oct 21 13:45:57 smoogen1 ntpd[1192]: time reset 254.183893 s Oct 21 13:45:57 smoogen1 ntpd[1192]: synchronisation lost Looking through the logs this did not start occuring until I started 'ripping' cds with grip yesterday. The cd player is an IDE cdrom writer so that the ripped data has to go through the SCSI layer. Looking at the clock time I am losing about 8 seconds a minute when things are getting ripped. smoogen:{ogg}$ lsmod Module Size Used by Tainted: PF sg 34628 1 (autoclean) sr_mod 16088 4 (autoclean) i810_audio 24448 1 (autoclean) ac97_codec 12384 0 (autoclean) [i810_audio] soundcore 6756 2 (autoclean) [i810_audio] radeon 97464 1 agpgart 41248 3 lp 8672 0 (autoclean) vmnet 27456 1 parport_pc 17636 1 parport 35616 1 [lp parport_pc] vmmon 24884 0 (unused) autofs 11652 1 (autoclean) 3c59x 29032 1 ipchains 42632 50 ide-scsi 9536 3 ide-cd 30208 0 cdrom 32384 0 [sr_mod ide-cd] mousedev 5152 1 hid 20736 0 (unused) input 6016 0 [mousedev hid] usb-uhci 25316 0 (unused) usbcore 74400 1 [hid usb-uhci] ext3 67424 3 jbd 51528 3 [ext3] raid1 15492 3 raid0 4064 1 aic7xxx 129568 8 sd_mod 12832 16 scsi_mod 108048 5 [sg sr_mod ide-scsi aic7xxx sd_mod] The box is a Dell Precision 530 smoogen:{smoogen}$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 1 model name : Intel(R) Xeon(TM) CPU 1.70GHz stepping : 2 cpu MHz : 1694.846 cache size : 256 KB Physical processor ID : 1100763704 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3365.99 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 1 model name : Intel(R) Xeon(TM) CPU 1.70GHz stepping : 2 cpu MHz : 1694.846 cache size : 256 KB Physical processor ID : 15 Number of siblings : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3382.19 Here are the numbers from doing a ntpdate (to resync the time), a check about 1 minute later, rip a cdrom, and then do a ntpdate again root:{log}# ntpdate 128.165.4.4 22 Oct 08:56:16 ntpdate[18936]: step time server 128.165.4.4 offset 323.022559 sec root:{log}# ntpdate 128.165.4.4 22 Oct 08:56:58 ntpdate[18939]: adjust time server 128.165.4.4 offset 0.000029 sec root:{log}# date ; ntpdate 128.165.4.4 Tue Oct 22 09:06:59 MDT 2002 22 Oct 09:12:12 ntpdate[18990]: step time server 128.165.4.4 offset 312.373567 sec So in 16 minutes I lost 6 minutes.
Ok after email from developer, I am looking at DMA/IRQ changes for cdrom drives between 2.4.17-3 and 2.4.17-17. I have done the following tests to see how much time is lost /etc/init.d/ntp stop hdparm -u1 -d1 /dev/hdc check time; grip 66 minute cd ; check time and the clock lost 0 seconds. To summarize for future people needing answers to this: hdparm -u1 -d1 /dev/hdc had the clock lose 0 seconds. hdparm -u0 -d1 /dev/hdc had the clock lose 257+ seconds. hdparm -u1 -d0 /dev/hdc had the clock lose 0 seconds. hdparm -u0 -d0 /dev/hdc had the clock lose 300+ seconds. So this looks to be some sort of interaction in masked interrupts on cdrom drives.
This bug appears more general than many of the comments about it. I am not running SMP. I am running the i686 kernel. Since installing kernel 2.4.18-17.7.x I have seen the kernel clock loose up to about 140 SECONDS. Although the previous kernel (2.4.18-10) had a less stable clock (wandering 100's ms) than the kernels of RedHat 7.1 (wandering 10's of ms), I did not witness the the losses of 100's of SECONDS until installing 2.4.18-17.7.2. (I am doing network measurement research and have been investigating clock stability as it affects measurements for some time. I am a "clock watcher".) Also note that I do not have a CDWriter, I have a CD reader, but am not actively using it. So far, no pattern has evolved to relate the problem exactly to some other event. The problem does appear to be more frequent if I leave the X desktop (running gnome and random screen saver) than if I log out (back to console mode; I use "startx" method for starting X, not X based login). This is a serious problem. For example, two days ago, the clock lost time just before the 4 a.m. cron schedule, and ntp dutifully set it ahead, but this caused the daily cron to fail to run. Consider that this bug is likely related to the just reported 77058 describing large scheduling delays. Sounds like a bug involving locking or disabling interrupts and failing to reenable some place in the new kernel, thus leaving much locked until some other miscellaneous event happens.
Has there been any work done with this? I too am seeing these large-scale drifts. Our machine has a CD-ROM, but it's most certainly not being used. The kernel we're using is 2.4.18-18.7.x. The drift is pretty bad. We have realtime systems we're testing that are doing some pretty funky things because of it.
# uname -a Linux xxx.xxxx.xx 2.4.18-17.7.x #1 Tue Oct 8 13:33:14 EDT 2002 i686 unknown # tickadj tick = 1953 It should be 10000 I suppose. I hope this may be helpfull.
This occurs with the 2.4.18-18.8.0 i686 UP kernel too. While running grip against an ide-scsi device, system time() slows down by a factor of almost two on my machine.
On kernel 2.4.28-19.8.0 (RH 8.0) on an i686, I'm also experiencing severe clock drift (losing at least 5s/min). System had been working as a charm until I started ripping (Grip, cdparanoia, lame/ogg) from IDE cdrom. Unmasking IRQ's for device via hdparm seems to reduce (if not eliminate) drift. Buggy HW, I suppose (Compaq Presario 17XL series, 440BX chipset). Hope it helps. Cheers!
This bug is still marked as NEW. Has anything been done with it? We just enabled 10Mb/s backups on our now 2.4.18-19.7.x test server, and we're seeing drifts on the order of 1 minute per hour. The two devices being involved with backups would be an Intel Ethernet Pro 100 and a qla2200 FC card. Any updates would be most helpful.
Ahah! I have been trying to identify the cause of this problem because of the need for relatively stable timebases for network measurements. I am not an experienced kernel hacker, but I have stumbled upon some source code that might illuminate this problem. It would appear that sometime between kernel update 2.4.18-10 and 2.4.18-18.7.x RedHat "slipped in" a change of HZ. (I don't currently have access to the source for 2.4.18-17.7.x, but this may have introduced the change.) The change appears to change the HZ (the rate of the clock interrupts) from the standard 100 to 512 for the athlon, athlon-smp, i686, i686-bigmem, i686-debug, and i686-smp kernels (leaving the remaining kernels at 100). This means that the clock interrupt is changed from once each 10 ms to approximately once each 1.95 ms. This would seem to reduce the maximum interrupt disable time without lost clock interrupts from 10-20 ms to about 1.95-3.9 ms, thus potentially making lost clock interrupts more probable. Perhaps someone at RedHat could identify the patch which instituted this change.
I'm seeing this on a SMP system with dual p3s. The system receives syslog data from a variety of systems and network equipment and also does some authentication against a kerberos server (hence the need for accurate time) It appears to be losing at least .5 seconds in a 5 minute period, usually more. The system is running RH 7.2, and keeps time just fine with kernel 2.4.9-13smp. Installing and running the 2.4.18-24.7.xsmp kernel causes the time to go off wildly, and eventually the kerberos server stops talking to it. Mike Iglesias Univ of Calif Irvine
Just to confirm another observation of this. I found a discrepancy of 1.5 HOURS on my system this week, which caused me some embarassement as I did not realise the clock was wrong at first. I have two separate systems: One a dual PIII 550 running RH 7.3 with unmodified 2.4.18-27.7.xsmp kernel, the other (at home) a Celeron 1GHz uniprocessor system running RH 7.3 and unmodified 2.4.18-27.7.x kernel. Both have seen signicant clock drifts since the kernel update (10, 20, 30 minutes). The 1.5 hours difference was so big that I initially put it down to timeserver problems and reset the clock by hand (only the dual processor system uses timeservers). I use both Zip and USB storage regularly, and CD reading. The single processor system has a CD writer, but it's not used much at present. I have to say I'm finding this problem *extremely* inconvenient. Please sort a fix for it as soon as possible!
This bug seems to go away with later errata.