Bug 76499

Summary:	Kernel-2.4.18-17.7.x SMP time drift
Product:	[Retired] Red Hat Linux	Reporter:	Stephen John Smoogen <smooge>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.3	CC:	bje, ccoy, dedourek, djuran, fche, iglesias, lgt
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:	Errata	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-02-19 22:28:27 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen John Smoogen 2002-10-22 15:13:33 UTC

Description of Problem:


Ok I know it has been reported X times now, but I can confirm a large 
time drift with the Red Hat Linux 7.3 update 2.4.18-17.7.x SMP kernel.

root:{log}# rpm -q glibc ntp kernel 
glibc-2.2.5-40
ntp-4.1.1-1
kernel-2.4.18-17.7.x
root:{log}# 

I didnt notice it until yesterday when I couldnt authenticate with the 
Kerberos servers at work. The problem was that my clock was 43 minutes 
behind real time. I looked through the logs, and found that my ntp had 
died a couple times with 

Oct 21 13:15:48 smoogen1 ntpd[1192]: time reset 675.164428 s
Oct 21 13:15:48 smoogen1 ntpd[1192]: synchronisation lost
Oct 21 13:45:57 smoogen1 ntpd[1192]: time reset 254.183893 s
Oct 21 13:45:57 smoogen1 ntpd[1192]: synchronisation lost


Looking through the logs this did not start occuring until I started 
'ripping' cds with grip yesterday. The cd player is an IDE cdrom writer 
so that the ripped data  has to go through the SCSI layer. 

Looking at the clock time I am losing about 8 seconds a minute when 
things are getting ripped.

smoogen:{ogg}$ lsmod
Module                  Size  Used by    Tainted: PF 
sg                     34628   1 (autoclean)
sr_mod                 16088   4 (autoclean)
i810_audio             24448   1 (autoclean)
ac97_codec             12384   0 (autoclean) [i810_audio]
soundcore               6756   2 (autoclean) [i810_audio]
radeon                 97464   1
agpgart                41248   3
lp                      8672   0 (autoclean)
vmnet                  27456   1
parport_pc             17636   1
parport                35616   1 [lp parport_pc]
vmmon                  24884   0 (unused)
autofs                 11652   1 (autoclean)
3c59x                  29032   1
ipchains               42632  50
ide-scsi                9536   3
ide-cd                 30208   0
cdrom                  32384   0 [sr_mod ide-cd]
mousedev                5152   1
hid                    20736   0 (unused)
input                   6016   0 [mousedev hid]
usb-uhci               25316   0 (unused)
usbcore                74400   1 [hid usb-uhci]
ext3                   67424   3
jbd                    51528   3 [ext3]
raid1                  15492   3
raid0                   4064   1
aic7xxx               129568   8
sd_mod                 12832  16
scsi_mod              108048   5 [sg sr_mod ide-scsi aic7xxx sd_mod]


The box is a Dell Precision 530 

smoogen:{smoogen}$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Xeon(TM) CPU 1.70GHz
stepping        : 2
cpu MHz         : 1694.846
cache size      : 256 KB
Physical processor ID   : 1100763704
Number of siblings      : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3365.99

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Xeon(TM) CPU 1.70GHz
stepping        : 2
cpu MHz         : 1694.846
cache size      : 256 KB
Physical processor ID   : 15
Number of siblings      : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3382.19

Here are the numbers from doing a ntpdate (to resync the time), a check about 1
minute later, rip a cdrom, and then do a ntpdate again

root:{log}# ntpdate 128.165.4.4
22 Oct 08:56:16 ntpdate[18936]: step time server 128.165.4.4 offset 323.022559 sec
root:{log}# ntpdate 128.165.4.4
22 Oct 08:56:58 ntpdate[18939]: adjust time server 128.165.4.4 offset 0.000029 sec
root:{log}# date ; ntpdate 128.165.4.4
Tue Oct 22 09:06:59 MDT 2002
22 Oct 09:12:12 ntpdate[18990]: step time server 128.165.4.4 offset 312.373567 sec

So in 16 minutes I lost 6 minutes.

Comment 1 Stephen John Smoogen 2002-10-25 18:06:53 UTC

Ok after email from developer, I am looking at DMA/IRQ changes for cdrom drives
between 2.4.17-3 and 2.4.17-17. I have done the following tests to see how much
time is lost

/etc/init.d/ntp stop
hdparm -u1 -d1 /dev/hdc
check time; grip 66 minute cd ; check time 

and the clock lost 0 seconds. To summarize for future people needing answers to
this:

hdparm -u1 -d1 /dev/hdc  had the clock lose   0  seconds.
hdparm -u0 -d1 /dev/hdc  had the clock lose 257+ seconds.
hdparm -u1 -d0 /dev/hdc  had the clock lose   0  seconds.
hdparm -u0 -d0 /dev/hdc  had the clock lose 300+ seconds.

So this looks to be some sort of interaction in masked interrupts on cdrom drives.

Comment 2 John DeDourek 2002-11-01 22:21:14 UTC

This bug appears more general than many of the comments about it.  I am not
running SMP.  I am running the i686 kernel.  Since installing kernel
2.4.18-17.7.x I have seen the kernel clock loose up to about 140 SECONDS.
Although the previous kernel (2.4.18-10) had a less stable clock
(wandering 100's ms) than the kernels of RedHat 7.1 (wandering 10's of
ms), I did not witness the the losses of 100's of SECONDS until
installing 2.4.18-17.7.2. (I am doing network measurement research and
have been investigating clock stability as it affects measurements for
some time.  I am a "clock watcher".)

Also note that I do not have a CDWriter, I have a CD reader, but am not
actively using it.

So far, no pattern has evolved to relate the problem exactly to some
other event.  The problem does appear to be more frequent if I leave
the X desktop (running gnome and random screen saver) than if I log
out (back to console mode; I use "startx" method for starting X, not
X based login).

This is a serious problem.  For example, two days ago, the clock lost time
just before the 4 a.m. cron schedule, and ntp dutifully set it ahead, but
this caused the daily cron to fail to run.

Consider that this bug is likely related to the just reported 77058
describing large scheduling delays.

Sounds like a bug involving locking or disabling interrupts and failing
to reenable some place in the new kernel, thus leaving much locked until
some other miscellaneous event happens.

Comment 3 Hrunting Johnson 2002-11-25 22:34:06 UTC

Has there been any work done with this?  I too am seeing these large-scale 
drifts.  Our machine has a CD-ROM, but it's most certainly not being used.  The 
kernel we're using is 2.4.18-18.7.x.  The drift is pretty bad.  We have 
realtime systems we're testing that are doing some pretty funky things because 
of it.

Comment 4 Luca 2002-12-03 08:41:24 UTC

# uname -a
Linux xxx.xxxx.xx 2.4.18-17.7.x #1 Tue Oct 8 13:33:14 EDT 2002 i686 unknown
# tickadj
tick = 1953

It should be 10000 I suppose. I hope this may be helpfull.

Comment 5 Frank Ch. Eigler 2002-12-31 22:22:07 UTC

This occurs with the 2.4.18-18.8.0 i686 UP kernel too.
While running grip against an ide-scsi device, system time()
slows down by a factor of almost two on my machine.

Comment 6 Antonio Pastor 2003-01-02 18:16:13 UTC

On kernel 2.4.28-19.8.0 (RH 8.0) on an i686, I'm also experiencing severe clock 
drift (losing at least 5s/min). System had been working as a charm until I 
started ripping (Grip, cdparanoia, lame/ogg) from IDE cdrom.
Unmasking IRQ's for device via hdparm seems to reduce (if not eliminate) drift.
Buggy HW, I suppose (Compaq Presario 17XL series, 440BX chipset).
Hope it helps. Cheers!

Comment 7 Hrunting Johnson 2003-01-13 17:02:08 UTC

This bug is still marked as NEW.  Has anything been done with it?  We just 
enabled 10Mb/s backups on our now 2.4.18-19.7.x test server, and we're seeing 
drifts on the order of 1 minute per hour.  The two devices being involved with 
backups would be an Intel Ethernet Pro 100 and a qla2200 FC card.

Any updates would be most helpful.

Comment 8 John DeDourek 2003-02-19 18:38:30 UTC

Ahah!  I have been trying to identify the cause of this problem because
of the need for relatively stable timebases for network measurements.
I am not an experienced kernel hacker, but I have stumbled upon some
source code that might illuminate this problem.  It would appear that
sometime between kernel update 2.4.18-10 and 2.4.18-18.7.x RedHat
"slipped in" a change of HZ.  (I don't currently have access to the
source for 2.4.18-17.7.x, but this may have introduced the change.)

The change appears to change the HZ (the rate of the clock interrupts)
from the standard 100 to 512 for the athlon, athlon-smp, i686,
i686-bigmem, i686-debug, and i686-smp kernels (leaving the remaining
kernels at 100).

This means that the clock interrupt is changed from once each 10 ms to
approximately once each 1.95 ms.  This would seem to reduce the
maximum interrupt disable time without lost clock interrupts from
10-20 ms to about 1.95-3.9 ms, thus potentially making lost clock
interrupts more probable.

Perhaps someone at RedHat could identify the patch which instituted
this change.

Comment 9 Mike Iglesias 2003-02-25 01:29:35 UTC

I'm seeing this on a SMP system with dual p3s.  The system receives syslog data
from a variety of systems and network equipment and also does some
authentication against a kerberos server (hence the need for accurate time)  It
appears to be losing at least .5 seconds in a 5 minute period, usually more.

The system is running RH 7.2, and keeps time just fine with kernel 2.4.9-13smp. 
Installing and running the 2.4.18-24.7.xsmp kernel causes the time to go off
wildly, and eventually the kerberos server stops talking to it.


Mike Iglesias
Univ of Calif Irvine

Comment 10 Laurence Tyler 2003-05-02 10:06:20 UTC

Just to confirm another observation of this. I found a discrepancy of 1.5 HOURS
on my system this week, which caused me some embarassement as I did not realise
the clock was wrong at first. I have two separate systems: One a dual PIII 550
running RH 7.3 with unmodified 2.4.18-27.7.xsmp kernel, the other (at home) a
Celeron 1GHz uniprocessor system running RH 7.3 and unmodified 2.4.18-27.7.x
kernel. Both have seen signicant clock drifts since the kernel update (10, 20,
30 minutes). The 1.5 hours difference was so big that I initially put it down to
timeserver problems and reset the clock by hand (only the dual processor system
uses timeservers).

I use both Zip and USB storage regularly, and CD reading. The single processor
system has a CD writer, but it's not used much at present.

I have to say I'm finding this problem *extremely* inconvenient. Please sort a
fix for it as soon as possible!

Comment 11 Stephen John Smoogen 2004-02-19 22:28:27 UTC

This bug seems to go away with later errata.