From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901 Description of problem: The current SMP-kernel speeds up the system clock when data is transferred via LAN. This is the case as much for incoming as for ougoing data. Using the ordinary non-SMP-kernel makes this behaviour disappear. The acceleration factor is about 4 for up-/downloading entire blocks of data via FTP, and goes up to about 30 (!) when for instance the command 'ls -R /' is executed on the remote computer, and the output is displayed on the local host. This behaviour is independent of X and occurs already in a simple text console at runlevel 3. Furthermore, it does not depend on the type of connection (TELNET vs SSH). Version-Release number of selected component (if applicable): 2.4.9-7 How reproducible: Always Steps to Reproduce: 1. Boot system into SMP mode 2. Login into a remote computer via TELNET/SSH 3. Type 'ls -R /' Actual Results: The system clock runs forward in time like crazy! Expected Results: System clock keeps running at regular speed Additional info: The current system is an INTEL PR440FX based Dual Pentium Pro workstation with 512 MB of system memory and an integrated INTEL EtherExpress Pro 100B 10/100 MBit/s network adapter. I had reported the same bug already for kernel version 2.4.7-2 (Roswell) as bug #53914. It seems furthermore to be related (or identical) to bug #24680.
Created attachment 35241 [details] Actual output of the command 'lspci -vv'
Does it work if you run an SMP kernel with the "noapic" option. This looks like an IRQ routing table problem so that network interrupts are triggering timer interrupts.
As a matter of fact, the "noapic" workaround keeps the system clock running at normal speed. Nevertheless, there seems to be some major bug in the 2.4.x SMP kernels which is absent in 2.2.x SMP kernels, and moreover, it seems to be quite resistant (2.4.0 preview version was already available in Red Hat Linux 7.0 one year ago).
Should be ok in modern kernels - is it ?
No, unfortunately, this issue has still not been settled. The latest Fedora kernel version "kernel-smp-2.4.22-1.2115.nptl" still behaves in the way that I had described 2 years ago. Adding "noapic" cures this flaw, but that's only a workaround of course. I have reported a more recent, possibly related problem hitting the PR440FX dual Pentium Pro platform in bug #107446.
Due to the short amount of time before EOL of RHL 7.2, this is probably better being reassigned as a Fedora bug.
Ts, ts, ts ... . Testing Fedora Core 2 Test 3 and kernel-smp-2.6.5- 1.327.i686.rpm *still* showing the bug reported earlier. :-/
bizarre. still a problem under the 2.6.9 based kernel updates ?
Yes, it definitely is. I have tested kernel kernel-smp-2.6.9-1.667 after a fresh install of FC3. The console problem has decreased significantly. Executing a remote "ls -R /" now leads to an increase in the clock speed of a mere 15%. However, the transfer of large chunks of data still speeds up the clock by an enormous factor of 3. Booting with the noapic option restores normal operation. The two attached "dmesg" output files, one with one without the noapic option, show that when APIC is enabled, some assigned IRQ change significantly.
Created attachment 108876 [details] "dmesg" output for APIC-enabled 2.6.9-1.667 SMP kernel on PR440FX system
Created attachment 108877 [details] "dmesg" output for APIC-disabled 2.6.9-1.667 SMP kernel on PR440FX system
Currently, things are going from bad to worse. After upgrading to the 2.6.9-1.724_FC3 SMP kernel, I first thought, my PR440FX's RTC got finally broken, because it suddenly advanced at a 10% faster pace than my wristwatch (without any network traffic). Fortunately, the guilty is "only" my old friend, the APIC bug: the "noapic" kernel option suppresses the observed misbehaviour completely. Interrupts got reassigned by the new kernel, so, e.g. my USR PCI hardware modem got reconfigured by "kudzu". Anyway, compared to my earlier postings, something is screwed up even more severely than it used to be in the past. I cannot tell, if the 440FX chipset has a particular flaw itself. The mobo used by me is one of the very last pieces manufactured by INTEL. The bios corresponds to the final release. After all, the PR440FX is what I would call the P6 SMP reference platform. So, it is pretty amazing how badly things work.
Created attachment 109668 [details] "dmesg" output for APIC-enabled 2.6.9-1.724_FC3 SMP kernel on PR440FX system "dmesg" output for APIC-enabled 2.6.9-1.667 SMP kernel on PR440FX system
Created attachment 109669 [details] "dmesg" output for APIC-disabled 2.6.9-1.724_FC3 SMP kernel on PR440FX system
Fedora Core 2 has now reached end of life, and no further updates will be provided by Red Hat. The Fedora legacy project will be producing further kernel updates for security problems only. If this bug has not been fixed in the latest Fedora Core 2 update kernel, please try to reproduce it under Fedora Core 3, and reopen if necessary, changing the product version accordingly. Thank you.
Still not fixed in Fedora Core 3 for current 2.6.11-1.14_FC3 SMP kernel.
Created attachment 113606 [details] "dmesg" output for APIC-enabled 2.6.11-1.14_FC3 SMP kernel on PR440FX system
Still not fixed in Fedora Core 4 for (almost) current 2.6.12-1.1385_FC4 SMP kernel.
[This comment has been added as a mass update for all FC4 kernel bugs. If you have migrated this bug from an FC3 bug today, ignore this comment.] Please retest your problem with todays 2.6.12-1.1398_FC4 update. If your problem involved being unable to boot, or some hardware not being detected correctly, please make sure your /etc/modprobe.conf is correct *BEFORE* installing any kernel updates. If in doubt, you can recreate this file using.. mv /etc/sysconfig/hwconf /etc/sysconfig/hwconf.bak mv /etc/modprobe.conf /etc/modprobe.conf.bak kudzu Thank you.
Still not fixed in the 2.6.12-1.1398_FC4 SMP kernel. The impact is less severe than for version 2.6.12-1.1385_FC4 reducing to the level of kernels prior to 2.6.9-1.724_FC3. However, transferring large chunks of data over "eth0" still leads to a speed-up of the system clock by a factor of about 3 as reported before.
"Me too", DFI Lanparty nForce 4 board with Athon x2 4400. Only present on SMP kernel. 2.6.12-1.1456_FC4smp exhibits the problem. IIRC noapic causes the kernel to blow chunks with a "nobody cared" about the SATA interrupts very early on in the boot. I will try this again and report if this is not true. Other fun behaviours associated with the bug are keyboard autorepeats getting triggered far too early (presumably side effect of ticks getting advanced) and XP in vmware 5 really being unusable due to multicharacters for every keyboard event. The time is as reported by the others advancing at 10 - 30% or so faster than realtime. Kernel is reporting in dmesg a spew of rtc: lost some interrupts at 1024Hz. errors. I can't say that it is very correlated to network traffic, but it could be.
Mass update to all FC4 bugs: An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream kernel (2.6.13.2). As there were ~3500 changes upstream between this and the previous kernel, it's possible your bug has been fixed already. Please retest with this update, and update this bug if necessary. Thanks.
2.6.13-1.1526_FC4smp does improve the situation... the keyboard is usable again and the clock, with ntpd running, does not get to wander so much. But the core problem is still present, the clock is advancing ahead of realtime and something ugly is going on somewhere. The problem does not occur on the UP kernel. dmesg is chock full of rtc: lost some interrupts at 2048Hz. rtc: lost some interrupts at 2048Hz. (previously this was 1024Hz). Maybe this is normal, but when running vmware, the process vmware-rtc is seeing 1% of CPU on a permanent basis... since this process has no memory footprint according to top I assume this is due to the RTC interrupts. If there is any investigation I can usefully do I will try it.
The clock wander was as bad as ever after it was given some time to get way out of whack. However, since I had to reboot today I decided to try the following commandline ro root=/dev/VolGroup00/LogVol00 quiet noacpi acpi=off On previous kernels this generated errors early in boot and ended with a panic. Now this gives the following line on booting, and completes the boot otherwise fine: Oct 10 09:07:22 siamese kernel: ..MP-BIOS bug: 8254 timer not connected to IO-APIC Now the symptoms seem to have almost completely gone... no abnormal RTC drift at all, no messages in dmesg about missed interrupts... the only maybe symptom left is broken audio out of XP in vmware, which is not present on the UP kernel, but this can be something completely different.
Still not fixed in the 2.6.13-1.1532_FC4 SMP kernel. Leaving the computer idle for about 10 hours after syncing the clock, it is 3 hours in advance with respect to real time. As the PR440FX is not the very latest model, ACPI gets disabled automatically. APIC however, is up and running. To make the system clock work correctly, the "noapic" kernel option is required as before.
"Me too". I have a Gigabyte K8N Pro SLI, AMD X2 3800. Running FC4 with kernel 2.6.13-1.1532_FC4smp. My symptoms are - clock runs much faster under heavy CPU/disk load - sound in Gnome is horrible. It sounds very scratchy. - occasionally my keypresses multiply to 4 or 5. None of the mentioned kernel boot options help. "noapic", "apic=off" (not sure if this is even valid) have no effect. "acpi=off" hangs very, very early in the boot process. Disabling my AC97 support in the BIOS does seem to fix most of the problems, except of course that then I have no sound. Is there any resolution to this???
The bug was assigned to Dave Jones yesterday, that's a really good sign that the bug is now taken seriously and is under attack.
Hmm, looks like I have a similar problem. Here is my setup: Asus A8N-E mobo, Athlon(tm) 64 X2 Dual Core Processor 4400+, 2GB RAM, 2 x 150GB SATA HD, nVidia 7800 GT graphics and Creative Labs SB Audigy audio. kernel 2.6.13-1.1532_FC4smp After a while, I observe following in the dmesg: Losing some ticks... checking if CPU frequency changed. ... warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip acpi_processor_idle+0x12f/0x37f The clock runs too fast, but not in a consistent way. Sometimes, after a reboot, the clock will be stable for several hours, and there is no message in dmesg. Sometimes it will gain over ten minutes in a couple hours, and watching a DVD will produce a somewhat choppy sound. Keyboard starts to autorepeat... ** As a side note, a colleague of mine has a Suse distribution on an x86_64 ** laptop (opteron). The clock is perfectly stable when booted in Windows XP. ** When booted in Suse, it depends: sometimes it is stable, sometimes not. When ** it is not stable, the problems appear right after booting, like something is ** not quite well setup from the BIOS or early kernel steps... ** However, when the problem hits, his clock runs 2-4 times too fast... Googling around, I see that many people have clock troubles on athlon 64 machines. Many posts suggest the noacpi, no_timer_check and other options. I tried the no_timer_check with no success. The other options I was a bit more reluctant to try, as they seem to have nasty side effects... It does seem that rebooting changes the observed behaviour (rate of clock distortion): 1. I see my clock is bad and start googling around (also notice log messages) 2. add no_timer_check option and reboot 3. observe problem is still present 4. google some more, get tired, remove useless option and halt the machine 5. boot again, do some stuff and leave the machine on 6. return after several hours and expect the clock to be wrong... but no, the clock is fine, and there are no log messages 7. somehow a while later, the problem reappears, log messages appear, clock gets bad 8. tried several bugzilla searches until I hit this bug... hope it's the right one It looks like once the clock starts drifting, it keeps doing so...
Just some followup on my Comment #26. I don't think my sound issue was related. That was bug #140999; the workarounds listed there fixed my sound issue. However, the clock issue remains. It seems to be particularly sensitive to network traffic. Today I'm having significant trouble with repeating keys (to where my machine is almost unusable). I concur with the symptoms listed in Comment #28. One other thing I noticed is that as my clock speeds up, my hardware clock slows down by almost the same amount. So the two diverge from the true time. My current bandaid is to have a cron job do a 'hwclock --hwtosys' every 5 minutes. It's a bad solution, but at least it keeps my system clock somewhat close to the correct time.
(In reply to comment #26) Another me too report: in my case on an Asus A8N-sli de luxe motherboard with and AMD 4200 X2 processor. I have been running the X86_64 version of Fedora without problems (upto last weekend). So the problem appears to be specific for the i386 version of the kernel
Wrt Louis' comment, I am seeing the weirdo rtc behaviours (throbber thing on Firefox is speeding up 2X? 4X? and slowing down to normal) on the SMP x86_64 build of the current FC4 kernel. All of the problems I saw are on the SMP x86_64... but I didn't see ANY problems on *uniprocessor* x86_64. Was the i386 kernel you ran uniprocessor, Louis?
(In reply to comment #31) > Wrt Louis' comment, I am seeing the weirdo rtc behaviours (throbber thing on > Firefox is speeding up 2X? 4X? and slowing down to normal) on the SMP x86_64 > build of the current FC4 kernel. All of the problems I saw are on the SMP > x86_64... but I didn't see ANY problems on *uniprocessor* x86_64. Was the i386 > kernel you ran uniprocessor, Louis? no, both the X86_64 and my current i386 kernel are SMP. Current kernel: [louis@travel ~]$ uname -a Linux travel.pheasant 2.6.13-1.1532_FC4smp #1 SMP Thu Oct 20 01:51:51 EDT 2005 i686 athlon i386 GNU/Linux I am not sure however about the cpuspeed deamon: it ran on my X86_64 FC4 installation, on i386 i had the idea (but I am not sure) that things improved quite a bit when I made cpuspeed work (I had to define the driver in /etc/cpuspeed). >
Created attachment 120431 [details] dmseg of DFI Lanparty with Athlon X2 There are some intersting things in the dmesg I didn't notice before Oct 24 13:51:50 siamese kernel: Using IO-APIC 2 Oct 24 13:51:50 siamese hcid[2485]: Bluetooth HCI daemon Oct 24 13:51:50 siamese kernel: ..MP-BIOS bug: 8254 timer not connected to IO-APIC Oct 24 13:51:50 siamese sdpd[2487]: Bluetooth SDP daemon Oct 24 13:51:50 siamese hcid[2485]: Unable to get on D-BUS Oct 24 13:51:50 siamese kernel: works. Oct 24 13:51:50 siamese kernel: Using local APIC timer interrupts. Oct 24 13:51:50 siamese kernel: Detected 13.129 MHz APIC timer. Oct 24 13:51:54 siamese kernel: pci_hotplug: PCI Hot Plug PCI Core version: 0.5 Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid IRQ. Check vendor BIOS Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid IRQ. Check vendor BIOS Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid IRQ. Check vendor BIOS Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid IRQ. Check vendor BIOS Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability This PCI device is "nVidia Corporation PCIE bridge" Oct 24 13:51:55 siamese kernel: powernow-k8: Found 2 AMD Athlon 64 / Opteron processors (version 1.50.3) Oct 24 13:51:55 siamese kernel: powernow-k8: MP systems not supported by PSB BIOS structure Oct 24 13:51:55 siamese kernel: powernow-k8: MP systems not supported by PSB BIOS structure Attempting to use cpuspeed with powernow_k8 gives: #service cpuspeed start FATAL: Module powernow_k8 not found. Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: EHCI Host Controller Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: debug port 1 Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: BIOS handoff failed (160, 01010001) Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: continuing after BIOS bug... Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 1 Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: irq 10, io mem 0xfeb00000
More follow-up on Comment #26. Using the kernel option "pci=noacpi" seems to tame the clock problem. However, I sttill sufffer frroom repeeated keypressssses (I'm nnnot even goiiing to try fixing thhhe ones I'm getting right now)... Perhaps I need to add noapic andd others baaaack in? I also haave very high llateeeenccy in my network traffic (bzffflag is unppplayable)))) and sounnndddd is sometimes delayed. Reeegarding kernels, I'm using the x86_64smp. I uninstalled cpusppppeed since that only appeaars to be useful for mobile chips, and mine is a desktop. I tested the nonn-ssmmp kernel briefly, aaaand I don't recaall it having any of these probllems.
Aha! The 2.6.14-1.1633_FC4smp is looking very good indeed. I removed the noacpi and acpi=off crutches from the kernel commandline. The error messages from boot are gone, cpuspeed is up, vmware sound is now flawless and there are no mentions of lost interrupts in dmesg any more! It seems that the problem if solved.... you guys rock!
I have also logged this as 171554 and it is still happening on the 2.6.14-1633 kernel for me. I have to go back to 2.6.13-1526 for it to work properly.
So far 2.6.14-1633smp has appeared to fix my system as well (see comment #26). I have had it running for about 12 hours with the new kernel and with no modifiers other than the default. The clock is tracking right on, and the keyboard doesn't repeat. Thank you! I can use both CPUs now.
Created attachment 120632 [details] Gavins dmesg output
I just tried 2.6.14-1633 for a second time and it hasn't fixed this issue with my Asus A8N-E motherboard. I even tried the NOAPIC option to see if that would help but I had the same result. I have attached my dmesg output for anyone who is interested.
Louis Lagendijk points out here: https://www.redhat.com/archives/fedora-list/2005-November/msg00267.html that the problem is not truly ;and completely resolved. (notice the ; character just then, this is an example of the keyboard issue mentioned below, it is not a fat finger problem :-) ) And indeed I do see very much less, but still present in dmesg: rtc: lost some interrupts at 1024Hz. rtc: lost some interrupts at 1024Hz. rtc: lost some interrupts at 1024Hz. rtc: lost some interrupts at 1024Hz. rtc: lost some interrupts at 1024Hz. rtc: lost some interrupts at 1024Hz. That's it for 14 hours uptime. the clock problem has not reappeared here so far though: at least, not as badly as even 1 minute over 14hrs, where previously it would be out by an hour or more. I have also seen errors from this USB keyboard from time to time, an extra character, different from the pressed one is added, or a fake shift action on a character, once every 300 or more characters, say. Not sure if it is related or a keyboard / hub problem. Anyway the thing is hugely more usable and these remaining issues are very minor compared to before this kernel update.
2.6.14-1.1637_FC4 has been released as an update for FC4. Please retest with this update, as a large amount of code has been changed in this release, which may have fixed your problem. Thank you.
I have just installed 2.6.14-1.1637_FC4smp and the issue is still persisting. The keyboard isn't repeating hallllf aaaas much as it used to. Another good test is the Gnome Monitor. As I mentioned above: I started Gnome-System-Monitor and strangely enough, every time I moved a window, the CPU graph would just slip across the screen instead of just scrolling left once per second. - This iiiis ssssttttiiiilllllllllll hhhhhhhaaaaaaapppppppppppppppppppeeeeeeeeennnnnnnnniiiiiiiinnnnngggggg jjjuuuuussssstttt tttthhhheeeee ssssaaaaaammmeeee aaassssss iiittt wwwwwwaaaaaaaassss bbbbeeeeffffooorrrrrrrrrrrrrrrrrrre.
I just caught this bug myself. I'm running FC4, and earlier this week I did an update to kernel-smp-2.6.13-1.1532_FC4. Prior to that I had 2.6.12-1456, which worked fine. I begin to get the crazy keyboard and mouse behavior, and the clock racing, the Gnome-System-Monitor racing etc. Two days ago, when 2.6.14-1.1637 came out, I tried that but the problem persists. I have an Asus A8V, Athlon X2 4400+ Dual Core. The system runs fine if I boot the uni-processor kernel (that's what I'm running now, else I couldn't type an intelligible email). The smp kernel is unusable.
FWIW, both 2.6.14-1.1637 and 1633 fixed this problem on my system (see comment #26).
Still not fixed in the 2.6.14-1.1637_FC4 SMP kernel. On an idle system, the system clock pace is about 30% above normal. There are many dubious postings to this bug report. As the original reporter, I ask to be somewhat more picky about adding useless comments to this record. I especially point out, that this bug is about the APIC functionality in the 2.x SMP kernel. If your trouble does not go away after adding "noapic" as kernel option, then you certainly want to look elsehwere. Thanks.
2.6.14-1.1637_FC4smp behaves for me the same as the .1633 test kernel, problem largely solves but still lurking around and showing itself in broken extra characters from a USB keyboard and false clicks on a USB trackball while typing as well I believe. Joachim, the bug sat around since 2001 and after getting poked at with a stick a few times went quiet for four years except for your reminders that it still existed. 'dubious' or not at least the input from people suffering what appears to be very similar symptoms to your ill-understood bug has increased the profile of your problem. And unless the exact detail of the bug is understood (it which case I would expect it to be fixed), nobody is in a position to definitively say that these are not all coming from the same underlying problem.
I have also tried 2.6.14-1.1664_FC4smp and I even tried the kernel parameteres noapic & notsc but the problem still persists.
I have been running for well over 24 hours with the clock=pmtmr option and I have not had any problems at all with all the scenarios I use (read much ealier in the bug) to test. My DMESG read very much the same as Dannys.
Fixed in 2.6.14-1.1743_FC5smp of FC5/rawhide. This may also apply to the latest FC4 kernel updates or earlier FC5/rawhide kernels for which I hadn't tested. A happy day > 4 years after reporting the bug :)
After reverting my system to FC4, I have noticed that the bug is still present even in the latest update kernel 2.6.15-1.1831_FC4smp.
The issue has come back some time in 2006. I attach a current "dmesg" output for kernel "2.6.15-1.2009.4.2_FC5smp-apic".
Created attachment 125672 [details] "dmesg" output for APIC-enabled 2.6.15-1.2009.4.2_FC5 SMP kernel on PR440FX system
Issue still present for kernel "2.6.16-1.2069_FC5smp". System clock pace without any system or network is 30% above real time.
Still not fixed in update kernel "2.6.16-1.2080_FC5smp".
I found these symptoms when I installed 2.6.16-1.2069_FC4 in an Athlon XP2500+ system. Reverting to 2.6.15.1833_FC4 fixed it. Clock was gaining seconds per minute.
*** Bug 184593 has been marked as a duplicate of this bug. ***
On a uniprocessor machine, this was broken for me earlier, worked fine for the first time in a long time on 2.6.16-1.2080_FC5 even without "noapic," but is now broken again on 2.6.16-1.2096_FC5. I had reported it in Bug 184593, which I marked as a duplicate.
you may try, Kernel command line: no_timer_check or notsc and report_lost_ticks
Still not fixed in "rawhide" kernel 2.6.16-1.2196_FC6 (SMP).
can you try "rawhide" kernel with boot options: no_timer_check notsc report_lost_ticks and cat /proc/interrupts before and after apply boot options
Created attachment 128976 [details] content of /proc/interrupts for 2.6.16-1.2202_FC6 w/o APIC
Created attachment 128977 [details] content of /proc/interrupts for 2.6.16-1.2202_FC6 w/options "no_timer_check report_lost_ticks"
(In reply to comment #62) > Created an attachment (id=128977) [edit] > content of /proc/interrupts for 2.6.16-1.2202_FC6 w/options "no_timer_check > report_lost_ticks" > Looks fine to me , and what you say ?
The clock keeps speeding ahead ..
[root@node5 ~]# dmesg Losing some ticks... checking if CPU frequency changed. [root@node5 ~]# uname -a Linux node5.news.atman.pl 2.6.16-1.2122_FC5 #1 SMP Sun May 21 15:01:10 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root@node5 ~]# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 39 model name : AMD Opteron(tm) Processor 152 stepping : 1 cpu MHz : 2600.000 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm bogomips : 5234.66 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp [root@node5 ~]# w 15:43:09 up 1 day, 3:01, 1 user, load average: 1.25, 1.23, 1.35 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/2 host-4.noc.atman Wed16 0.00s 0.22s 0.00s w [root@node5 ~]# cat /proc/interrupts CPU0 0: 24347234 IO-APIC-edge timer 8: 0 IO-APIC-edge rtc 9: 0 IO-APIC-level acpi 14: 35 IO-APIC-edge ide0 16: 1774149 IO-APIC-level libata, ehci_hcd:usb2 17: 0 IO-APIC-level libata 18: 48280381 IO-APIC-level eth0 20: 278 IO-APIC-level ohci_hcd:usb1 NMI: 4387 LOC: 24348340 ERR: 0 MIS: 0
Still present in FC6T1 kernel "2.6.16-1.2289_FC6". Due to lack of activity by Red Hat kernel maintainers, I have finally posted the bug (which has been around for almost 5 years) upstream: http://bugzilla.kernel.org/show_bug.cgi?id=6748 To whom it may concern: this bug is about "APIC" problems with the "SMP" enabled Fedora kernels. If you system is uni-processor or single-core, or multi-processor -and- the "noapic" option does -not- make the issue go away, then -please- look elsewhere. Thanks!
added self, as I have a reproducing unit dhcp-63.josh.lan to test potential fixes against.
Still present in kernel "2.6.17-1.2339.fc6".
Still broken in kernel "2.6.17-1.2396.fc6".
Clock still 30% ahead of nominal speed in "2.6.17-1.2532.fc6". However, I have spotted some relevant entries in "dmesg" which partially appear to be of recent origin (file attached): "...trying to set up timer (IRQ0) through the 8259A ..." "Time: tsc clocksource has been installed." "TSC appears to be running slowly. Marking it as unstable" "Time: pit clocksource has been installed."
Created attachment 133918 [details] "dmesg" output for APIC-enabled 2.6.17-1.2532.fc6 SMP kernel on PR440FX system
I'm seeing this regularly in 2.6.17-1.2174_FC5 SMP x86_64 This is what I see in my dmesg Losing some ticks... checking if CPU frequency changed. warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip default_idle+0x2b/0x54 spurious 8259A interrupt: IRQ15. I have two nodes connected to my box via xdmcp all day, so there's always network traffic going on. My clock starts to run faster and faster just like everyone else says here. Also, when it happens, it seems that all my windows in X loose focus, and I have to alt-tab to get my mouse to work again. This is an almost brand new system with a dual core amd 64. I'm happy to provide more info if it would prove useful.
please try boot kernel with parameters notsc and report_lost_ticks
(In reply to comment #72) As you may have learnt from my initial report and following, this bug is about "APIC" functionality. So, if you reboot your system adding "noapic" to the kernel options and the issue goes away then this is probably the right place. Please check that first, please.
Created attachment 135843 [details] "dmesg" output for APIC-enabled 2.6.17-1.2630.fc6 SMP kernel on PR440FX system w/options "notsc" and "report_lost_ticks"
(In reply to comment #73) Option "notsc" is not enabled for current "Fedora" kernels according to the "dmesg" log file: "notsc: Kernel compiled with CONFIG_X86_TSC, cannot disable TSC." Moreover, at the end of the log file the kernel reports: "TSC appears to be running slowly. Marking it as unstable Time: pit clocksource has been installed." which seems to indicate that the "tsc" source has been superseded by the "pit" clock source which thus might be responsible. Needless to say that the already habitual 30% clock advance still applies.
Interestingly, *removing* noapic and acpi=off from my kernel boot options appears to have resolved this issue for me with the 2.6.17-1.2174_FC5 kernel.
(In reply to comment #77) By no means surprising as this reduces the probability of "IRQ" conflicts due to limited resources. In my case, "ACPI" simply cannot be enabled at all (mainboard manufactured in 1998) and the current "APIC" handling in the kernel doesn't seem to like this at all! I think that's what this bug report is all about.
Created attachment 140357 [details] "dmesg" output for APIC-enabled 2.6.18-1.2798.fc6 SMP kernel on PR440FX system No change for kernel "2.6.18-1.2798.fc6". As for the last kernel for which I had committed the "dmesg" log file, the current kernel recognizes that the "tsc" time source is unreliable and switches to "pit". Nevertheless, the speed-up is still of the order of 30%. I have also noticed that the frame rate delivered by "glxgears" drops from 360 fps to 130 fps [both at 1400x1050@24bpp] when "APIC" is enabled. Any suggestions how to proceed with this issue apart from scrapping my goog old "PR440FX"?
Created attachment 140358 [details] "dmesg" output for APIC-disabled 2.6.18-1.2798.fc6 SMP kernel on PR440FX system
Kernel "2.6.18-1.2849.fc6" finally seems to be stable. Even after several hours of operation with enabled "APIC", no time slip has occurred not even speaking about the usual 30% speed-up of the system clock: excellent! However, I will also try under network load to give a final conclusion. Btw: it might be a good idea to add the revision number in the changelog as is the case for most other packages.
(In reply to comment #81) Things change significantly as soon as the "DRI" interface is used. After launching "glxgears", the expected frame rate of about 360 is written exactly 2x to the console. After that, it drops to about 250 and stays at this level. There is a new message appended to "/var/log/dmesg": "TSC appears to be running slowly. Marking it as unstable Time: pit clocksource has been installed." which was absent before. And of course, the clock goes crazy again from that moment on .. PS: The graphics card is a "PCI" based "Radeon AIW 7200".
Created attachment 144728 [details] "dmesg" output for APIC-enabled 2.6.19-1.2887.fc6 SMP kernel on PR440FX system As in the case of "2.6.18-1.2849.fc6", "glxgears" triggers an instability of the "tsc" time source for "2.6.19-1.2887.fc6". Instead of the "pit" clock source being installed, "dmesg" now contains a message: "TSC appears to be running slowly. Marking it as unstable" "Time: jiffies clocksource has been installed." ^^^^^^^ The result, however, is the same. As soon as the new clock source has been installed, the system clock loses the right pace and advances faster. The interrupts have been remapped to the 16-18 range whereas for previous kernels the "APIC IRQ" range was 145-161.
Regarding clock speeding up. It is common problem with all 2.6 kernels. According to VmWare's knowledge base, the culprit is increase of HZ constant in 2.6 kernels. 2.4 and earlier kernels had HZ set to 100. 2.6 kernels bumped it to 1000. That means 2.6 kernels will request 10 times more timer interruptes from the hardware per CPU than 2.4 and earlier kernels. Other than clock problems, this also introduced the performance issues, especially on multi-CPU virtual machines. As I said previously, VMWare now needs to emulate 10 times more virtual interrupts per virtual CPU per virtual machine. This quickly adds up. Resulting in lost interrupts. Basically what happens is that number of virtual timer intrruptes gets into several thousand, or even tens of thousands range (cummulative on all virtual machines) and system (hardware, host OS, VMWare, virtual machines) is not able to keep the pace, loosing some of them. There's code in 2.6 kernels that's supposed to make adjustments for lost interrupts. However, it usually overdoes it, resulting in clock being too fast. The code behind "clock=pit" seems to make smallest overadjustment (but it still overdoes it, making system clock go too fast). As an example for performance degradations introduced by increase of HZ from 100 to 1000, few virtual machines running 2.6 kernel on one of my ESX servers are consuming an entire CPU when they are completely idle. Now this is very bad. According to one post on Nahant mailing list, there is a kernel patch to turn off timer when nothing is happening on the virtual machine. The patch is for mainframe architecture, but basically tackles the same problem experienced when running 2.6 kernels on i386/x86_64 under VMWare. Mainframe people seem to have hit this problem many years ago when running hundreds of virtual machines on the mainframe. Here's the URL for the post: https://www.redhat.com/archives/nahant-list/2007-January/msg00059.html There's couple of workarounds. The first thing VMWare suggest is to recompile kernel with HZ set to 100. Unfortunately, HZ is a define in the source. So you must manually change it in the source files and than recompile the kernel. Would be nice if it was variable that could be set from command line during boot. I've looked a bit into the source, and it might be a bit non-trivial to change HZ to be command line option (but I might be wrong). If recompiling kernel is not an option, 2.6 kernel should be booted with "clock=pit". In userspace, ntpd should be disabled (timer is way too unstable for it to work at all anyhow), vmware-tools installed and sync time with host OS option enabled in it. This combination seems to be able to keep system clock more or less stable. I'm using this on some of my virtual machines. It works OK, clock still wonders around a bit, but it seems to be able to keep it accurate within a second or two compared to wall clock. I guess the best solution would be to make HZ a boot time command line option (insted of having it hardcoded in the source code). "Normal" users could leave it at 1000 and get whatever questinable benefits there are from having it set that high. VMWare folks could decrease it back to the old default value of 100. Another interesting approach in solving these issues is implementing that mainframe timer patch on other architectures (mainly i386 and x86_64). Probably also introducing command line option to trigger it ("normal" users probably don't want their timers to get turned off when machine is idle).
(In reply to comment #84) However, on my system, adding "noapic" to the kernel boot options fully settles the issue, so I am not really sure whether your reasoning applies to my case. Moreover, according to comment #82, the switch from the "tsc" to the "pit" time source [also "jiffies" according to comment #83] is actually the moment when things do really go wrong.
REOPENED status has been deprecated. ASSIGNED with keyword of Reopened is preferred.
(In reply to comment #85) > (In reply to comment #84) > However, on my system, adding "noapic" to the kernel boot options fully > settles the issue, so I am not really sure whether your reasoning applies > to my case. Use noapic is use only one processador, you need apic to work on SMP, so it is not a solution. > Moreover, according to comment #82, the switch from the > "tsc" to the "pit" time source [also "jiffies" according to comment #83] > is actually the moment when things do really go wrong. In kernel 2.6.21-rc5-git4, I try 2.6.20-1.3036, for the fisrt time in a vanilla kernel (I think it is a affect of vamilla kernel) my computer detects more 2 clocksources: acpi_pm and tsc and processor just works on one state C1 cat /proc/acpi/processor/CPU1/power active state: C1 max_cstate: C8 bus master activity: 00000000 maximum allowed latency: 2000 usec states: *C1: type[C1] promotion[--] demotion[--] latency[000] usage[00000000] duration[00000000000000000000] and since than my computer works 100% correctly. So you may try kernel 2.6.21 and report your experince
(In reply to comment #87) > Use noapic is use only one processador, you need apic to work on SMP, so it is > not a solution. This is plain nonsense. I have been using my "PR440FX" board for years now, and it definitely runs in "SMP" mode when "APIC" is disabled.
(In reply to comment #88) > (In reply to comment #87) > > > Use noapic is use only one processador, you need apic to work on SMP, so it is > > not a solution. > > This is plain nonsense. I have been using my "PR440FX" board for years now, > and it definitely runs in "SMP" mode when "APIC" is disabled. if you do cat /proc/interrupts, you will see just one cpu is working. I think
After updating to "F7" with "kernel-2.6.21-1.3194.fc7" (2.6.21.2), I haven't observed any of the previous problems anymore.
Does this bug also apply to Red Hat Enterprise Linux ES 4 with kernel 2.6.9-42.ELsmp, or it refers only to the Fedora Core?