Red Hat Bugzilla – Bug 69906
SMP kernel with multiple CPU's suffers serious clock skew
Last modified: 2005-10-31 17:00:50 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020606
Description of problem:
I'm running a Dell Dimension 220 with 2xPIII/600 CPUs. Upon upgrading to RedHat
7.3, and updating to 2.4.18-5smp, I noticed that the system clock is very
unreliable - specifically, it goes backwards often, which causes all sorts of
strange behavior (things like window managers don't like time to go backwards).
Version-Release number of selected component (if applicable): 2.4.18-5smp
Steps to Reproduce:
To observe the problem, one need only do this:
% while `true`; do date ; /sbin/hwclock ; sleep 1; done
Actual Results: The actual behavior is that while the output of hwclock is
monotonically increasing, the output of date is all over the place - jumping
ahead by as much as 10 seconds, then falling back by just as much.
Expected Results: You'd expect date and hwclock to be pretty close on each
iteration, and that neither one would travel backwards.
Running a UP kernel, this symptom doesn't manifest, neither does it manifest if
I disable (via BIOS) the second CPU. I've tried passing "no-hlt=1" to the
kernel during boot, it has no measurable effect.
ok - more data points. I iinstalled 2.4.18-3smp - the behavior is the same. I
tried passing noapic to the kernel - and it makes a difference - the clock
doesn't seem to travel backwards (at least I haven't observed it) but it still
jumps forwards by several seconds (which is more annoying than you might think
since it causes bursts of key repeats as the userlevel code thinks you've been
holding the key down for several seconds).
Created attachment 67145 [details]
output of while `true`; do date ; sleep 1; done for 15 iterations witth 2.4.18-3 w/ two procs
Created attachment 67146 [details]
output of while `true`; do date ; sleep 1; done for 15 iterations on 2.4.18-5 w/ two procs
Created attachment 67147 [details]
same loop - under 2.4.18-5 w/ two procs annd "noapic" passed to the kernel
Ok - final data point - I just installed the latest 7.2 update kernel (2.4.9-34) and it does *not* seem to suffer from this problem. The time has been monotonically increasing in one second increments for several minutes now (with the 2.4.18 series it took less than 1 minute to observe the problem).
Spoke to soon - the 2.4.9 kernel does have this problem - it's just less severe
- and not as reliable. I'd love to blame this on hardware - the problem is that
both Win2k and WinXP run just fine on the same hardware - so it doesn't seem
likely that it's a hardware problem. If anyone at RedHat is
listening/interested - I can make the machine available via SSH with the kernel
of your choice running.
The interesting stuff will be in dmesg when you start. Do the CPUs get their
times synchronized, or not? Can you post 'dmesg' and/or the relevant bits
of /var/log/message after the system starts with SMP?
gettimeofday() can be sent to either CPU on an SMP system. The reason the
clock appears to jump forward and backward is likely that the TSCs in the CPUs
aren't synchronized properly at startup for some reason.
Tried to attach the results to this bug - got a bugzilla error (which has been
reported to the appropriate folks). I sent the results via email to Matt - with
the following text:
Both were generated off the "stock" RedHat 7.3 smp kernel (2.4.18-3smp) - as
you'd suspect, the "noapic" version had noapic passed to the kernel. I'm
downloading 2.4.18-5 now - I can provide those results as well, if you think
they'll be enlightening.
Both dumps contain lines like the following:
BIOS BUG: CPU#0 improperly initialized, has 6217457 usecs TSC skew! FIXED.
BIOS BUG: CPU#1 improperly initialized, has -6217457 usecs TSC skew! FIXED.
Clearly whatever's being done to "fix" the initialization, isn't quite in this
Aside from pelting Dell with a bug report (which I'm not even sure how to do -
and I suspect they'll ignore if I manage to find a way) - any suggestions?
seems like some work was done back in 2.2.1 - very similar code exists in the
2.4.18 tree (in arch/i386/kernel/smpboot.c).
this same code appears in a pre-19 patch, but AFAIKT it's just a line-wrap change.
does "notsc" on the kernel commandline help ?
re: notsc - no - it produces no observable change - the "BIOS BUG" lines are
still present, the hardware clock still advances correctly (as shown by
hwclock), and date still jumps around.
fyi - just for giggles, I physically swapped the CPU positions (i.e. CPU0 <->
CPU1). No change.
Another note that may (or may not) be relevant - the delay loop (and
cooresponding BogoMips) for the two (identical) CPU's is dramatically different.
CPU0 clocks in at around 1200 BogoMips, while CPU1 is around 800.
According to a bit of the SMP howto by Alan Cox, this "usually indicates that
the data cache is disabled" - but according to the MTRR docs, this is fixed if
MTRR support is enabled (which it is in the RH kernels). I wonder what else
might cause this sort of timing skew - and if that's the actual underlying cause
of the problem I'm experiencing.
Your CPU's showing radically different bogomips is rather concerning. Those
measurements are pretty accurate
Another note on the topic of bogomips - while swapping the CPU positions, the
bogomips values do *not* follow the CPU's - i.e. whichever physical CPU is in
slot 0 is measured at 1200 bogomips - and which ever CPU is in slot 1 measures
makes me wonder if the second slot has it's multiplier jumpered differently
A reasonable question - but the board doesn't seem to have any jumpers to
control speed or multiplier (I popped the cover and looked).
For the folks @ Dell who might still be listening - the service tag on the box
is FE4LL - that should tell you the model (and revision, etc). of the hardware.
I had the chance to look at another "identical" machine - another precision 220
w/ identical processors - and it does *not* display this symptom - suggesting
strongly that it's specific to this hardware.
I'm on hold with Dell tech support now - wondering how the devil I'm going to
explain this problem to a first level support engineer...
Wish me luck...
Ok - the current theory is (as was eluded to here) that it's a voltage problem
to the second proc. Dell is sending a mainboard out and we'll see what happens.
I'm going to mark this closed (NOTABUG) and I'll re-open if this theory doesn't