69906 – SMP kernel with multiple CPU's suffers serious clock skew

Bug 69906 - SMP kernel with multiple CPU's suffers serious clock skew

Summary: SMP kernel with multiple CPU's suffers serious clock skew

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i686
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-07-26 05:32 UTC by Dan Berger
Modified:	2005-10-31 22:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-08-10 15:31:23 UTC
Embargoed:

Attachments	(Terms of Use)
output of while `true`; do date ; sleep 1; done for 15 iterations witth 2.4.18-3 w/ two procs (493 bytes, text/plain) 2002-07-26 05:59 UTC, Dan Berger	no flags	Details
output of while `true`; do date ; sleep 1; done for 15 iterations on 2.4.18-5 w/ two procs (435 bytes, text/plain) 2002-07-26 06:00 UTC, Dan Berger	no flags	Details
same loop - under 2.4.18-5 w/ two procs annd "noapic" passed to the kernel (435 bytes, text/plain) 2002-07-26 06:02 UTC, Dan Berger	no flags	Details
dmesg output, 2.4.18-3smp kernel, as requested (deleted) 2002-07-31 04:31 UTC, Dan Berger	no flags	Details
dmesg output, 2.4.18-3smp kernel, as requested (deleted) 2002-07-31 04:31 UTC, Dan Berger	no flags	Details
dmesg output, 2.4.18-3smp kernel, as requested (deleted) 2002-07-31 04:38 UTC, Dan Berger	no flags	Details
View All

Description Dan Berger 2002-07-26 05:32:30 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020606

Description of problem:
I'm running a Dell Dimension 220 with 2xPIII/600 CPUs.  Upon upgrading to RedHat
7.3, and updating to 2.4.18-5smp, I noticed that the system clock is very
unreliable - specifically, it goes backwards often, which causes all sorts of
strange behavior (things like window managers don't like time to go backwards).


Version-Release number of selected component (if applicable): 2.4.18-5smp


How reproducible:
Always

Steps to Reproduce:
To observe the problem, one need only do this:

% while `true`; do date ; /sbin/hwclock ; sleep 1; done


Actual Results:  The actual behavior is that while the output of hwclock is
monotonically increasing, the output of date is all over the place - jumping
ahead by as much as 10 seconds, then falling back by just as much.

Expected Results:  You'd expect date and hwclock to be pretty close on each
iteration, and that neither one would travel backwards.  


Additional info:

Running a UP kernel, this symptom doesn't manifest, neither does it manifest if
I disable (via BIOS) the second CPU.  I've tried passing "no-hlt=1" to the
kernel during boot, it has no measurable effect.

Comment 1 Dan Berger 2002-07-26 05:58:03 UTC

ok - more data points.  I iinstalled 2.4.18-3smp - the behavior is the same.  I
tried passing noapic to the kernel - and it makes a difference - the clock
doesn't seem to travel backwards (at least I haven't observed it) but it still
jumps forwards by several seconds (which is more annoying than you might think
since it causes bursts of key repeats as the userlevel code thinks you've been
holding the key down for several seconds).

Comment 2 Dan Berger 2002-07-26 05:59:52 UTC

Created attachment 67145 [details]
output of while `true`; do date ; sleep 1; done for 15 iterations witth 2.4.18-3 w/ two procs

Comment 3 Dan Berger 2002-07-26 06:00:47 UTC

Created attachment 67146 [details]
output of while `true`; do date ; sleep 1; done for 15 iterations on 2.4.18-5 w/ two procs

Comment 4 Dan Berger 2002-07-26 06:02:11 UTC

Created attachment 67147 [details]
same loop -   under 2.4.18-5 w/ two procs annd "noapic" passed to the kernel

Comment 5 Dan Berger 2002-07-26 06:21:01 UTC

Ok - final data point - I just installed the latest 7.2 update kernel (2.4.9-34) and it does *not* seem to suffer from this problem.  The time has been monotonically increasing in one second increments for several minutes now (with the 2.4.18 series it took less than 1 minute to observe the problem).

Comment 6 Dan Berger 2002-07-31 02:18:28 UTC

Spoke to soon - the 2.4.9 kernel does have this problem - it's just less severe
- and not as reliable.  I'd love to blame this on hardware - the problem is that
both Win2k and WinXP run just fine on the same hardware - so it doesn't seem
likely that it's a hardware problem.  If anyone at RedHat is
listening/interested - I can make the machine available via SSH with the kernel
of your choice running.

Comment 7 Matt Domsch 2002-07-31 03:33:58 UTC

The interesting stuff will be in dmesg when you start.  Do the CPUs get their 
times synchronized, or not?  Can you post 'dmesg' and/or the relevant bits 
of /var/log/message after the system starts with SMP?

gettimeofday() can be sent to either CPU on an SMP system.  The reason the 
clock appears to jump forward and backward is likely that the TSCs in the CPUs 
aren't synchronized properly at startup for some reason.

Comment 8 Dan Berger 2002-07-31 04:39:58 UTC

Tried to attach the results to this bug - got a bugzilla error (which has been
reported to the appropriate folks).  I sent the results via email to Matt - with
the following text:


Both were generated off the "stock" RedHat 7.3 smp kernel (2.4.18-3smp) - as
you'd suspect, the "noapic" version had noapic passed to the kernel.  I'm
downloading 2.4.18-5 now - I can provide those results as well, if you think
they'll be enlightening.

Both dumps contain lines like the following:

BIOS BUG: CPU#0 improperly initialized, has 6217457 usecs TSC skew! FIXED.
BIOS BUG: CPU#1 improperly initialized, has -6217457 usecs TSC skew! FIXED.

Clearly whatever's being done to "fix" the initialization, isn't quite in this
case...

Aside from pelting Dell with a bug report (which I'm not even sure how to do -
and I suspect they'll ignore if I manage to find a way) - any suggestions?

Comment 9 Dan Berger 2002-07-31 05:08:00 UTC

http://www.uwsg.iu.edu/hypermail/linux/kernel/9902.0/0053.html

seems like some work was done back in 2.2.1 - very similar code exists in the
2.4.18 tree (in arch/i386/kernel/smpboot.c).

this same code appears in a pre-19 patch, but AFAIKT it's just a line-wrap change.

Comment 10 Arjan van de Ven 2002-07-31 08:29:26 UTC

does "notsc" on the kernel commandline help ?

Comment 11 Dan Berger 2002-07-31 15:18:04 UTC

re: notsc - no - it produces no observable change - the "BIOS BUG" lines are
still present, the hardware clock still advances correctly (as shown by
hwclock), and date still jumps around.

Comment 12 Dan Berger 2002-07-31 19:39:05 UTC

fyi - just for giggles, I physically swapped the CPU positions (i.e. CPU0 <->
CPU1).  No change.  

Another note that may (or may not) be relevant - the delay loop (and
cooresponding BogoMips) for the two (identical) CPU's is dramatically different.
 CPU0 clocks in at around 1200 BogoMips, while CPU1 is around 800.  

According to a bit of the SMP howto by Alan Cox, this "usually indicates that
the data cache is disabled" - but according to the MTRR docs, this is fixed if
MTRR support is enabled (which it is in the RH kernels).  I wonder what else
might cause this sort of timing skew - and if that's the actual underlying cause
of the problem I'm experiencing.

Comment 13 Alan Cox 2002-07-31 20:07:30 UTC

Your CPU's showing radically different bogomips is rather concerning. Those
measurements are pretty accurate

Comment 14 Dan Berger 2002-07-31 20:12:44 UTC

Another note on the topic of bogomips - while swapping the CPU positions, the
bogomips values do *not* follow the CPU's - i.e. whichever physical CPU is in
slot 0 is measured at 1200 bogomips - and which ever CPU is in slot 1 measures
at 800.

Comment 15 Arjan van de Ven 2002-07-31 20:14:39 UTC

makes me wonder if the second slot has it's multiplier jumpered differently

Comment 16 Dan Berger 2002-07-31 20:22:35 UTC

A reasonable question - but the board doesn't seem to have any jumpers to
control speed or multiplier (I popped the cover and looked).  

For the folks @ Dell who might still be listening - the service tag on the box
is FE4LL - that should tell you the model (and revision, etc). of the hardware.

Comment 17 Dan Berger 2002-08-10 15:31:17 UTC

I had the chance to look at another "identical" machine - another precision 220
w/ identical processors - and it does *not* display this symptom - suggesting
strongly that it's specific to this hardware.  

I'm on hold with Dell tech support now - wondering how the devil I'm going to
explain this problem to a first level support engineer...

Wish me luck...

Comment 18 Dan Berger 2002-08-10 18:50:50 UTC

Ok - the current theory is (as was eluded to here) that it's a voltage problem
to the second proc.  Dell is sending a mainboard out and we'll see what happens.

I'm going to mark this closed (NOTABUG) and I'll re-open if this theory doesn't
pan out.

Note You need to log in before you can comment on or make changes to this bug.