From Bugzilla Helper: User-Agent: Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020606 Description of problem: I'm running a Dell Dimension 220 with 2xPIII/600 CPUs. Upon upgrading to RedHat 7.3, and updating to 2.4.18-5smp, I noticed that the system clock is very unreliable - specifically, it goes backwards often, which causes all sorts of strange behavior (things like window managers don't like time to go backwards). Version-Release number of selected component (if applicable): 2.4.18-5smp How reproducible: Always Steps to Reproduce: To observe the problem, one need only do this: % while `true`; do date ; /sbin/hwclock ; sleep 1; done Actual Results: The actual behavior is that while the output of hwclock is monotonically increasing, the output of date is all over the place - jumping ahead by as much as 10 seconds, then falling back by just as much. Expected Results: You'd expect date and hwclock to be pretty close on each iteration, and that neither one would travel backwards. Additional info: Running a UP kernel, this symptom doesn't manifest, neither does it manifest if I disable (via BIOS) the second CPU. I've tried passing "no-hlt=1" to the kernel during boot, it has no measurable effect.
ok - more data points. I iinstalled 2.4.18-3smp - the behavior is the same. I tried passing noapic to the kernel - and it makes a difference - the clock doesn't seem to travel backwards (at least I haven't observed it) but it still jumps forwards by several seconds (which is more annoying than you might think since it causes bursts of key repeats as the userlevel code thinks you've been holding the key down for several seconds).
Created attachment 67145 [details] output of while `true`; do date ; sleep 1; done for 15 iterations witth 2.4.18-3 w/ two procs
Created attachment 67146 [details] output of while `true`; do date ; sleep 1; done for 15 iterations on 2.4.18-5 w/ two procs
Created attachment 67147 [details] same loop - under 2.4.18-5 w/ two procs annd "noapic" passed to the kernel
Ok - final data point - I just installed the latest 7.2 update kernel (2.4.9-34) and it does *not* seem to suffer from this problem. The time has been monotonically increasing in one second increments for several minutes now (with the 2.4.18 series it took less than 1 minute to observe the problem).
Spoke to soon - the 2.4.9 kernel does have this problem - it's just less severe - and not as reliable. I'd love to blame this on hardware - the problem is that both Win2k and WinXP run just fine on the same hardware - so it doesn't seem likely that it's a hardware problem. If anyone at RedHat is listening/interested - I can make the machine available via SSH with the kernel of your choice running.
The interesting stuff will be in dmesg when you start. Do the CPUs get their times synchronized, or not? Can you post 'dmesg' and/or the relevant bits of /var/log/message after the system starts with SMP? gettimeofday() can be sent to either CPU on an SMP system. The reason the clock appears to jump forward and backward is likely that the TSCs in the CPUs aren't synchronized properly at startup for some reason.
Tried to attach the results to this bug - got a bugzilla error (which has been reported to the appropriate folks). I sent the results via email to Matt - with the following text: Both were generated off the "stock" RedHat 7.3 smp kernel (2.4.18-3smp) - as you'd suspect, the "noapic" version had noapic passed to the kernel. I'm downloading 2.4.18-5 now - I can provide those results as well, if you think they'll be enlightening. Both dumps contain lines like the following: BIOS BUG: CPU#0 improperly initialized, has 6217457 usecs TSC skew! FIXED. BIOS BUG: CPU#1 improperly initialized, has -6217457 usecs TSC skew! FIXED. Clearly whatever's being done to "fix" the initialization, isn't quite in this case... Aside from pelting Dell with a bug report (which I'm not even sure how to do - and I suspect they'll ignore if I manage to find a way) - any suggestions?
http://www.uwsg.iu.edu/hypermail/linux/kernel/9902.0/0053.html seems like some work was done back in 2.2.1 - very similar code exists in the 2.4.18 tree (in arch/i386/kernel/smpboot.c). this same code appears in a pre-19 patch, but AFAIKT it's just a line-wrap change.
does "notsc" on the kernel commandline help ?
re: notsc - no - it produces no observable change - the "BIOS BUG" lines are still present, the hardware clock still advances correctly (as shown by hwclock), and date still jumps around.
fyi - just for giggles, I physically swapped the CPU positions (i.e. CPU0 <-> CPU1). No change. Another note that may (or may not) be relevant - the delay loop (and cooresponding BogoMips) for the two (identical) CPU's is dramatically different. CPU0 clocks in at around 1200 BogoMips, while CPU1 is around 800. According to a bit of the SMP howto by Alan Cox, this "usually indicates that the data cache is disabled" - but according to the MTRR docs, this is fixed if MTRR support is enabled (which it is in the RH kernels). I wonder what else might cause this sort of timing skew - and if that's the actual underlying cause of the problem I'm experiencing.
Your CPU's showing radically different bogomips is rather concerning. Those measurements are pretty accurate
Another note on the topic of bogomips - while swapping the CPU positions, the bogomips values do *not* follow the CPU's - i.e. whichever physical CPU is in slot 0 is measured at 1200 bogomips - and which ever CPU is in slot 1 measures at 800.
makes me wonder if the second slot has it's multiplier jumpered differently
A reasonable question - but the board doesn't seem to have any jumpers to control speed or multiplier (I popped the cover and looked). For the folks @ Dell who might still be listening - the service tag on the box is FE4LL - that should tell you the model (and revision, etc). of the hardware.
I had the chance to look at another "identical" machine - another precision 220 w/ identical processors - and it does *not* display this symptom - suggesting strongly that it's specific to this hardware. I'm on hold with Dell tech support now - wondering how the devil I'm going to explain this problem to a first level support engineer... Wish me luck...
Ok - the current theory is (as was eluded to here) that it's a voltage problem to the second proc. Dell is sending a mainboard out and we'll see what happens. I'm going to mark this closed (NOTABUG) and I'll re-open if this theory doesn't pan out.