Bug 55223 - (APIC IRQ_ROUTING)SMP-kernel speeds up system clock under network load
(APIC IRQ_ROUTING)SMP-kernel speeds up system clock under network load
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Dave Jones
Brock Organ
: Reopened
: 184593 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-10-27 12:23 EDT by Joachim Frieben
Modified: 2015-01-04 17:01 EST (History)
17 users (show)

See Also:
Fixed In Version: kernel-2.6.21-1.3194.fc7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-06-12 12:19:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Actual output of the command 'lspci -vv' (3.54 KB, text/plain)
2001-10-27 12:26 EDT, Joachim Frieben
no flags Details
"dmesg" output for APIC-enabled 2.6.9-1.667 SMP kernel on PR440FX system (11.50 KB, text/plain)
2004-12-20 02:54 EST, Joachim Frieben
no flags Details
"dmesg" output for APIC-disabled 2.6.9-1.667 SMP kernel on PR440FX system (11.20 KB, text/plain)
2004-12-20 02:57 EST, Joachim Frieben
no flags Details
"dmesg" output for APIC-enabled 2.6.9-1.724_FC3 SMP kernel on PR440FX system (11.78 KB, text/plain)
2005-01-12 10:30 EST, Joachim Frieben
no flags Details
"dmesg" output for APIC-disabled 2.6.9-1.724_FC3 SMP kernel on PR440FX system (11.47 KB, text/plain)
2005-01-12 10:31 EST, Joachim Frieben
no flags Details
"dmesg" output for APIC-enabled 2.6.11-1.14_FC3 SMP kernel on PR440FX system (22.98 KB, text/plain)
2005-04-24 12:30 EDT, Joachim Frieben
no flags Details
dmseg of DFI Lanparty with Athlon X2 (29.02 KB, text/plain)
2005-10-26 16:09 EDT, Andy Green
no flags Details
Gavins dmesg output (26.13 KB, text/plain)
2005-11-02 02:46 EST, Gavin Graham
no flags Details
"dmesg" output for APIC-enabled 2.6.15-1.2009.4.2_FC5 SMP kernel on PR440FX system (13.54 KB, text/plain)
2006-03-05 11:15 EST, Joachim Frieben
no flags Details
content of /proc/interrupts for 2.6.16-1.2202_FC6 w/o APIC (663 bytes, text/plain)
2006-05-13 07:56 EDT, Joachim Frieben
no flags Details
content of /proc/interrupts for 2.6.16-1.2202_FC6 w/options "no_timer_check report_lost_ticks" (706 bytes, text/plain)
2006-05-13 07:59 EDT, Joachim Frieben
no flags Details
"dmesg" output for APIC-enabled 2.6.17-1.2532.fc6 SMP kernel on PR440FX system (16.20 KB, text/plain)
2006-08-10 05:19 EDT, Joachim Frieben
no flags Details
"dmesg" output for APIC-enabled 2.6.17-1.2630.fc6 SMP kernel on PR440FX system w/options "notsc" and "report_lost_ticks" (18.40 KB, text/plain)
2006-09-08 08:30 EDT, Joachim Frieben
no flags Details
"dmesg" output for APIC-enabled 2.6.18-1.2798.fc6 SMP kernel on PR440FX system (14.31 KB, text/plain)
2006-11-04 08:27 EST, Joachim Frieben
no flags Details
"dmesg" output for APIC-disabled 2.6.18-1.2798.fc6 SMP kernel on PR440FX system (13.38 KB, text/plain)
2006-11-04 08:28 EST, Joachim Frieben
no flags Details
"dmesg" output for APIC-enabled 2.6.19-1.2887.fc6 SMP kernel on PR440FX system (14.89 KB, text/plain)
2007-01-03 14:13 EST, Joachim Frieben
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 6419 None None None Never

  None (edit)
Description Joachim Frieben 2001-10-27 12:23:48 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
The current SMP-kernel speeds up the system clock when data is transferred
via LAN. This is the case as much for incoming as for ougoing data. Using
the ordinary non-SMP-kernel makes this behaviour disappear. The
acceleration factor is about 4 for up-/downloading entire blocks of data
via FTP, and goes up to about 30 (!) when for instance the command 'ls -R
/' is executed on the remote computer, and the output is displayed on the
local host. This behaviour is independent of X and occurs already in a
simple text console at runlevel 3. Furthermore, it does not depend on the
type of connection (TELNET vs SSH).

Version-Release number of selected component (if applicable):
2.4.9-7

How reproducible:
Always

Steps to Reproduce:
1. Boot system into SMP mode
2. Login into a remote computer via TELNET/SSH
3. Type 'ls -R /'
	

Actual Results:  The system clock runs forward in time like crazy!


Expected Results:  System clock keeps running at regular speed


Additional info:

The current system is an INTEL PR440FX based Dual Pentium Pro workstation
with 512 MB of system memory and an integrated INTEL EtherExpress Pro 100B
10/100 MBit/s network adapter.

I had reported the same bug already for kernel version 2.4.7-2 (Roswell) as
bug #53914. It seems furthermore to be related (or identical) to bug #24680.
Comment 1 Joachim Frieben 2001-10-27 12:26:07 EDT
Created attachment 35241 [details]
Actual output of the command 'lspci -vv'
Comment 2 Alan Cox 2001-10-27 12:29:31 EDT
Does it work if you run an SMP kernel with the "noapic" option. This looks like
an IRQ routing table problem so that network interrupts are triggering timer
interrupts.
Comment 3 Joachim Frieben 2001-10-27 14:08:08 EDT
As a matter of fact, the "noapic" workaround keeps the system clock running at
normal speed. Nevertheless, there seems to be some major bug in the 2.4.x SMP
kernels which is absent in 2.2.x SMP kernels, and moreover, it seems to be quite
resistant (2.4.0 preview version was already available in Red Hat Linux 7.0 one
year ago).
Comment 4 Alan Cox 2003-06-07 15:13:01 EDT
Should be ok in modern kernels - is it ?
Comment 5 Joachim Frieben 2003-11-14 03:46:30 EST
No, unfortunately, this issue has still not been settled. The latest
Fedora kernel version "kernel-smp-2.4.22-1.2115.nptl" still behaves in
the way that I had described 2 years ago. Adding "noapic" cures this
flaw, but that's only a workaround of course. I have reported a more
recent, possibly related problem hitting the PR440FX dual Pentium Pro
platform in bug #107446.
Comment 6 Dave Jones 2003-12-16 20:19:56 EST
Due to the short amount of time before EOL of RHL 7.2, this is
probably better being reassigned as a Fedora bug.
Comment 7 Joachim Frieben 2004-05-02 07:17:29 EDT
Ts, ts, ts ... . Testing Fedora Core 2 Test 3 and kernel-smp-2.6.5-
1.327.i686.rpm *still* showing the bug reported earlier. :-/
Comment 8 Dave Jones 2004-11-27 17:47:40 EST
bizarre. still a problem under the 2.6.9 based kernel updates ?
Comment 9 Joachim Frieben 2004-12-20 02:49:37 EST
Yes, it definitely is. I have tested kernel kernel-smp-2.6.9-1.667
after a fresh install of FC3. The console problem has decreased
significantly. Executing a remote "ls -R /" now leads to an increase
in the clock speed of a mere 15%. However, the transfer of large
chunks of data still speeds up the clock by an enormous factor of 3.
Booting with the noapic option restores normal operation. The two
attached "dmesg" output files, one with one without the noapic option,
show that when APIC is enabled, some assigned IRQ change significantly.
Comment 10 Joachim Frieben 2004-12-20 02:54:53 EST
Created attachment 108876 [details]
"dmesg" output for APIC-enabled 2.6.9-1.667 SMP kernel on PR440FX system
Comment 11 Joachim Frieben 2004-12-20 02:57:28 EST
Created attachment 108877 [details]
"dmesg" output for APIC-disabled 2.6.9-1.667 SMP kernel on PR440FX system
Comment 12 Joachim Frieben 2005-01-12 10:27:57 EST
Currently, things are going from bad to worse. After upgrading to the
2.6.9-1.724_FC3 SMP kernel, I first thought, my PR440FX's RTC got
finally broken, because it suddenly advanced at a 10% faster pace than
my wristwatch (without any network traffic). Fortunately, the guilty
is "only" my old friend, the APIC bug: the "noapic" kernel option
suppresses the observed misbehaviour completely. Interrupts got
reassigned by the new kernel, so, e.g. my USR PCI hardware modem got
reconfigured by "kudzu". Anyway, compared to my earlier postings,
something is screwed up even more severely than it used to be in the
past. I cannot tell, if the 440FX chipset has a particular flaw
itself. The mobo used by me is one of the very last pieces
manufactured by INTEL. The bios corresponds to the final release.
After all, the PR440FX is what I would call the P6 SMP reference
platform. So, it is pretty amazing how badly things work.
Comment 13 Joachim Frieben 2005-01-12 10:30:25 EST
Created attachment 109668 [details]
"dmesg" output for APIC-enabled 2.6.9-1.724_FC3 SMP kernel on PR440FX system

"dmesg" output for APIC-enabled 2.6.9-1.667 SMP kernel on PR440FX system
Comment 14 Joachim Frieben 2005-01-12 10:31:51 EST
Created attachment 109669 [details]
"dmesg" output for APIC-disabled 2.6.9-1.724_FC3 SMP kernel on PR440FX system
Comment 15 Dave Jones 2005-04-16 00:27:48 EDT
Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.
Comment 16 Joachim Frieben 2005-04-24 12:22:24 EDT
Still not fixed in Fedora Core 3 for current 2.6.11-1.14_FC3 SMP kernel.
Comment 17 Joachim Frieben 2005-04-24 12:30:48 EDT
Created attachment 113606 [details]
"dmesg" output for APIC-enabled 2.6.11-1.14_FC3 SMP kernel on PR440FX system
Comment 18 Joachim Frieben 2005-07-06 14:43:57 EDT
Still not fixed in Fedora Core 4 for (almost) current 2.6.12-1.1385_FC4
SMP kernel.
Comment 19 Dave Jones 2005-07-15 17:47:06 EDT
[This comment has been added as a mass update for all FC4 kernel bugs.
 If you have migrated this bug from an FC3 bug today, ignore this comment.]

Please retest your problem with todays 2.6.12-1.1398_FC4 update.

If your problem involved being unable to boot, or some hardware not being
detected correctly, please make sure your /etc/modprobe.conf is correct *BEFORE*
installing any kernel updates.
If in doubt, you can recreate this file using..

mv /etc/sysconfig/hwconf /etc/sysconfig/hwconf.bak
mv /etc/modprobe.conf /etc/modprobe.conf.bak
kudzu


Thank you.
Comment 20 Joachim Frieben 2005-08-21 15:23:56 EDT
Still not fixed in the 2.6.12-1.1398_FC4 SMP kernel. The impact is less severe
than for version 2.6.12-1.1385_FC4 reducing to the level of kernels prior to
2.6.9-1.724_FC3. However, transferring large chunks of data over "eth0" still
leads to a speed-up of the system clock by a factor of about 3 as reported before.
Comment 21 Andy Green 2005-09-26 06:03:53 EDT
"Me too", DFI Lanparty nForce 4 board with Athon x2 4400.  Only present on SMP 
kernel.  2.6.12-1.1456_FC4smp exhibits the problem.  IIRC noapic causes the 
kernel to blow chunks with a "nobody cared" about the SATA interrupts very 
early on in the boot.  I will try this again and report if this is not true. 
 
Other fun behaviours associated with the bug are keyboard autorepeats getting 
triggered far too early (presumably side effect of ticks getting advanced) and 
XP in vmware 5 really being unusable due to multicharacters for every keyboard 
event.  The time is as reported by the others advancing at 10 - 30% or so 
faster than realtime. 
 
Kernel is reporting in dmesg a spew of rtc: lost some interrupts at 1024Hz. 
errors. 
 
I can't say that it is very correlated to network traffic, but it could be. 
Comment 22 Dave Jones 2005-09-30 02:34:18 EDT
Mass update to all FC4 bugs:

An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.
Comment 23 Andy Green 2005-10-06 04:37:55 EDT
2.6.13-1.1526_FC4smp does improve the situation... the keyboard is usable again
and the clock, with ntpd running, does not get to wander so much.  But the core
problem is still present, the clock is advancing ahead of realtime and something
ugly is going on somewhere.  The problem does not occur on the UP kernel.

dmesg is chock full of

rtc: lost some interrupts at 2048Hz.
rtc: lost some interrupts at 2048Hz.

(previously this was 1024Hz).

Maybe this is normal, but when running vmware, the process vmware-rtc is seeing
1% of CPU on a permanent basis... since this process has no memory footprint
according to top I assume this is due to the RTC interrupts.

If there is any investigation I can usefully do I will try it.
Comment 24 Andy Green 2005-10-10 14:43:36 EDT
The clock wander was as bad as ever after it was given some time to get way out
of whack.  However, since I had to reboot today I decided to try the following
commandline

ro root=/dev/VolGroup00/LogVol00 quiet noacpi acpi=off

On previous kernels this generated errors early in boot and ended with a panic.
Now this gives the following line on booting, and completes the boot otherwise fine:

Oct 10 09:07:22 siamese kernel: ..MP-BIOS bug: 8254 timer not connected to IO-APIC

Now the symptoms seem to have almost completely gone... no abnormal RTC drift at
all, no messages in dmesg about missed interrupts... the only maybe symptom left
is broken audio out of XP in vmware, which is not present on the UP kernel, but
this can be something completely different.
Comment 25 Joachim Frieben 2005-10-22 14:54:36 EDT
Still not fixed in the 2.6.13-1.1532_FC4 SMP kernel. Leaving the computer
idle for about 10 hours after syncing the clock, it is 3 hours in advance
with respect to real time. As the PR440FX is not the very latest model,
ACPI gets disabled automatically. APIC however, is up and running. To make
the system clock work correctly, the "noapic" kernel option is required as
before.
Comment 26 Dan Hensley 2005-10-23 16:46:40 EDT
"Me too".  I have a Gigabyte K8N Pro SLI, AMD X2 3800.  Running FC4 with kernel
2.6.13-1.1532_FC4smp.  My symptoms are
  - clock runs much faster under heavy CPU/disk load
  - sound in Gnome is horrible.  It sounds very scratchy.
  - occasionally my keypresses multiply to 4 or 5.

None of the mentioned kernel boot options help.  "noapic", "apic=off" (not sure
if this is even valid) have no effect.  "acpi=off" hangs very, very early in the
boot process.

Disabling my AC97 support in the BIOS does seem to fix most of the problems,
except of course that then I have no sound.

Is there any resolution to this???
Comment 27 Andy Green 2005-10-23 16:51:57 EDT
The bug was assigned to Dave Jones yesterday, that's a really good sign that the
bug is now taken seriously and is under attack.
Comment 28 Christian Iseli 2005-10-25 17:16:04 EDT
Hmm, looks like I have a similar problem.  Here is my setup:
Asus A8N-E mobo, Athlon(tm) 64 X2 Dual Core Processor 4400+, 2GB RAM, 2 x 150GB
SATA HD, nVidia 7800 GT graphics and Creative Labs SB Audigy audio.
kernel 2.6.13-1.1532_FC4smp

After a while, I observe following in the dmesg:
Losing some ticks... checking if CPU frequency changed.

...

warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip acpi_processor_idle+0x12f/0x37f

The clock runs too fast, but not in a consistent way.  Sometimes, after a
reboot, the clock will be stable for several hours, and there is no message in
dmesg.  Sometimes it will gain over ten minutes in a couple hours, and watching
a DVD will produce a somewhat choppy sound.  Keyboard starts to autorepeat...

** As a side note, a colleague of mine has a Suse distribution on an x86_64
** laptop (opteron).  The clock is perfectly stable when booted in Windows XP.
** When booted in Suse, it depends: sometimes it is stable, sometimes not.  When
** it is not stable, the problems appear right after booting, like something is
** not quite well setup from the BIOS or early kernel steps...
** However, when the problem hits, his clock runs 2-4 times too fast...

Googling around, I see that many people have clock troubles on athlon 64
machines.  Many posts suggest the noacpi, no_timer_check and other options.
I tried the no_timer_check with no success.  The other options I was a bit
more reluctant to try, as they seem to have nasty side effects...
It does seem that rebooting changes the observed behaviour (rate of clock
distortion):
 1. I see my clock is bad and start googling around (also notice log messages)
 2. add no_timer_check option and reboot
 3. observe problem is still present
 4. google some more, get tired, remove useless option and halt the machine
 5. boot again, do some stuff and leave the machine on
 6. return after several hours and expect the clock to be wrong... but no, the
    clock is fine, and there are no log messages
 7. somehow a while later, the problem reappears, log messages appear, clock
    gets bad
 8. tried several bugzilla searches until I hit this bug... hope it's the right
    one

It looks like once the clock starts drifting, it keeps doing so...
Comment 29 Dan Hensley 2005-10-25 17:37:23 EDT
Just some followup on my Comment #26.  I don't think my sound issue was related.
 That was bug #140999; the workarounds listed there fixed my sound issue. 
However,  the clock issue remains.  It seems to be particularly sensitive to
network traffic.  Today I'm having significant trouble with repeating keys (to
where my machine is almost unusable).  I concur with the symptoms listed in
Comment #28.

One other thing I noticed is that as my clock speeds up, my hardware clock slows
down by almost the same amount.  So the two diverge from the true time.

My current bandaid is to have a cron job do a 'hwclock --hwtosys' every 5
minutes.  It's a bad solution, but at least it keeps my system clock somewhat
close to the correct time.
Comment 30 Louis Lagendijk 2005-10-26 15:27:31 EDT
(In reply to comment #26)
Another me too report:
in my case on an Asus A8N-sli de luxe motherboard with and AMD 4200 X2 processor.
I have been running the X86_64 version of Fedora without problems (upto last
weekend). So the problem appears to be specific for the i386 version of the kernel 
Comment 31 Andy Green 2005-10-26 15:34:36 EDT
Wrt Louis' comment, I am seeing the weirdo rtc behaviours (throbber thing on
Firefox is speeding up 2X? 4X? and slowing down to normal) on the SMP x86_64
build of the current FC4 kernel.  All of the problems I saw are on the SMP
x86_64... but I didn't see ANY problems on *uniprocessor* x86_64.  Was the i386
kernel you ran uniprocessor, Louis?
Comment 32 Louis Lagendijk 2005-10-26 15:52:57 EDT
(In reply to comment #31)
> Wrt Louis' comment, I am seeing the weirdo rtc behaviours (throbber thing on
> Firefox is speeding up 2X? 4X? and slowing down to normal) on the SMP x86_64
> build of the current FC4 kernel.  All of the problems I saw are on the SMP
> x86_64... but I didn't see ANY problems on *uniprocessor* x86_64.  Was the i386
> kernel you ran uniprocessor, Louis?

no, both the X86_64 and my current i386 kernel are SMP. Current kernel:
[louis@travel ~]$ uname -a
Linux travel.pheasant 2.6.13-1.1532_FC4smp #1 SMP Thu Oct 20 01:51:51 EDT 2005
i686 athlon i386 GNU/Linux

I am not sure however about the cpuspeed deamon: it ran on my X86_64 FC4
installation, on i386 i had the idea (but I am not sure) that things improved
quite a bit when I made cpuspeed work (I had to define the driver in
/etc/cpuspeed). 


> 

Comment 33 Andy Green 2005-10-26 16:09:37 EDT
Created attachment 120431 [details]
dmseg of DFI Lanparty with Athlon X2

There are some intersting things in the dmesg I didn't notice before

Oct 24 13:51:50 siamese kernel: Using IO-APIC 2
Oct 24 13:51:50 siamese hcid[2485]: Bluetooth HCI daemon
Oct 24 13:51:50 siamese kernel: ..MP-BIOS bug: 8254 timer not connected to
IO-APIC
Oct 24 13:51:50 siamese sdpd[2487]: Bluetooth SDP daemon 
Oct 24 13:51:50 siamese hcid[2485]: Unable to get on D-BUS
Oct 24 13:51:50 siamese kernel: works.
Oct 24 13:51:50 siamese kernel: Using local APIC timer interrupts.
Oct 24 13:51:50 siamese kernel: Detected 13.129 MHz APIC timer.


Oct 24 13:51:54 siamese kernel: pci_hotplug: PCI Hot Plug PCI Core version: 0.5

Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid
IRQ. Check vendor BIOS
Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability
Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid
IRQ. Check vendor BIOS
Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability
Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid
IRQ. Check vendor BIOS
Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability
Oct 24 13:51:54 siamese kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid
IRQ. Check vendor BIOS
Oct 24 13:51:54 siamese kernel: assign_interrupt_mode Found MSI capability

This PCI device is "nVidia Corporation PCIE bridge"

Oct 24 13:51:55 siamese kernel: powernow-k8: Found 2 AMD Athlon 64 / Opteron
processors (version 1.50.3)
Oct 24 13:51:55 siamese kernel: powernow-k8: MP systems not supported by PSB
BIOS structure
Oct 24 13:51:55 siamese kernel: powernow-k8: MP systems not supported by PSB
BIOS structure

Attempting to use cpuspeed with powernow_k8 gives:

#service cpuspeed start
FATAL: Module powernow_k8 not found.


Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: EHCI Host Controller
Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: debug port 1
Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: BIOS handoff failed
(160, 01010001)
Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: continuing after BIOS
bug...
Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: new USB bus registered,
assigned bus number 1
Oct 24 13:51:56 siamese kernel: ehci_hcd 0000:00:02.1: irq 10, io mem
0xfeb00000
Comment 34 Dan Hensley 2005-10-26 16:25:51 EDT
More follow-up on Comment #26.  Using the kernel option "pci=noacpi" seems to
tame the clock problem.  However, I sttill sufffer frroom repeeated
keypressssses (I'm nnnot even goiiing to  try fixing thhhe ones  I'm getting
right now)...  Perhaps     I need to add noapic andd     others baaaack in?  I
also haave very high llateeeenccy in my network traffic (bzffflag is
unppplayable)))) and sounnndddd is sometimes delayed.

Reeegarding kernels, I'm using the x86_64smp.  I uninstalled cpusppppeed since
that only appeaars  to be useful for mobile chips, and mine is a desktop.  I
tested the nonn-ssmmp kernel briefly, aaaand I don't recaall it having any of
these probllems.
Comment 35 Andy Green 2005-10-31 15:36:21 EST
Aha!  The 2.6.14-1.1633_FC4smp is looking very good indeed.  I removed the
noacpi and acpi=off crutches from the kernel commandline.

The error messages from boot are gone, cpuspeed is up, vmware sound is now
flawless and there are no mentions of lost interrupts in dmesg any more!

It seems that the problem if solved.... you guys rock!
Comment 36 Gavin Graham 2005-10-31 17:08:19 EST
I have also logged this as 171554 and it is still happening on the 2.6.14-1633
kernel for me. I have to go back to 2.6.13-1526 for it to work properly.

Comment 37 Dan Hensley 2005-11-02 00:25:36 EST
So far 2.6.14-1633smp has appeared to fix my system as well (see comment #26). 
I have had it running for about 12 hours with the new kernel and with no
modifiers other than the default.  The clock is tracking right on, and the
keyboard doesn't repeat.  Thank you!  I can use both CPUs now.
Comment 38 Gavin Graham 2005-11-02 02:46:04 EST
Created attachment 120632 [details]
Gavins dmesg output
Comment 39 Gavin Graham 2005-11-02 02:48:52 EST
I just tried 2.6.14-1633 for a second time and it hasn't fixed this issue with
my Asus A8N-E motherboard. I even tried the NOAPIC option to see if that would
help but I had the same result. I have attached my dmesg output for anyone who
is interested.
Comment 40 Andy Green 2005-11-02 07:03:23 EST
Louis Lagendijk points out here:

https://www.redhat.com/archives/fedora-list/2005-November/msg00267.html

that the problem is not truly ;and completely resolved.  (notice the ; character
just then, this is an example of the keyboard issue mentioned below, it is not a
fat finger problem :-) )

And indeed I do see very much less, but still present in dmesg:

rtc: lost some interrupts at 1024Hz.
rtc: lost some interrupts at 1024Hz.
rtc: lost some interrupts at 1024Hz.
rtc: lost some interrupts at 1024Hz.
rtc: lost some interrupts at 1024Hz.
rtc: lost some interrupts at 1024Hz.

That's it for 14 hours uptime.

the clock problem has not reappeared here so far though: at least, not as badly
as even 1 minute over 14hrs, where previously it would be out by an hour or more.

I have also seen errors from this USB keyboard from time to time, an extra
character, different from the pressed one is added, or a fake shift action on a
character, once every 300 or more characters, say.  Not sure if it is related or
a keyboard / hub problem.

Anyway the thing is hugely more usable and these remaining issues are very minor
compared to before this kernel update.
Comment 41 Dave Jones 2005-11-10 14:34:19 EST
2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.
Comment 42 Gavin Graham 2005-11-11 15:01:49 EST
I have just installed 2.6.14-1.1637_FC4smp and the issue is still persisting.
The keyboard isn't repeating hallllf    aaaas   much as it used to. Another good
test is the Gnome Monitor. As I mentioned above: I started Gnome-System-Monitor
and strangely enough, every time I moved a window, the CPU graph would just slip
across the screen instead of just scrolling left once per second. - This iiiis
ssssttttiiiilllllllllll       
hhhhhhhaaaaaaapppppppppppppppppppeeeeeeeeennnnnnnnniiiiiiiinnnnngggggg     
jjjuuuuussssstttt    tttthhhheeeee     ssssaaaaaammmeeee    aaassssss    iiittt
wwwwwwaaaaaaaassss  bbbbeeeeffffooorrrrrrrrrrrrrrrrrrre.
Comment 43 George Howitt 2005-11-12 15:13:44 EST
I just caught this bug myself.
I'm running FC4, and earlier this week I did an update to
kernel-smp-2.6.13-1.1532_FC4. Prior to that I had 2.6.12-1456, which worked fine.

I begin to get the crazy keyboard and mouse behavior, and the clock racing, the
Gnome-System-Monitor racing etc.

Two days ago, when 2.6.14-1.1637 came out, I tried that but the problem persists.

I have an Asus A8V, Athlon X2 4400+ Dual Core.

The system runs fine if I boot the uni-processor kernel (that's what I'm running
now, else I couldn't type an intelligible email). The smp kernel is unusable.
Comment 44 Dan Hensley 2005-11-12 16:16:22 EST
FWIW, both 2.6.14-1.1637 and 1633 fixed this problem on my system (see comment #26).
Comment 45 Joachim Frieben 2005-11-13 04:13:14 EST
Still not fixed in the 2.6.14-1.1637_FC4 SMP kernel. On an idle system,
the system clock pace is about 30% above normal.
There are many dubious postings to this bug report. As the original reporter,
I ask to be somewhat more picky about adding useless comments to this
record. I especially point out, that this bug is about the APIC
functionality in the 2.x SMP kernel. If your trouble does not go away
after adding "noapic" as kernel option, then you certainly want to look
elsehwere. Thanks.
Comment 46 Andy Green 2005-11-16 06:42:48 EST
2.6.14-1.1637_FC4smp behaves for me the same as the .1633 test kernel, problem
largely solves but still lurking around and showing itself in broken extra
characters from a USB keyboard and false clicks on a USB trackball while typing
as well I believe.

Joachim, the bug sat around since 2001 and after getting poked at with a stick a
few times went quiet for four years except for your reminders that it still
existed.  'dubious' or not at least the input from people suffering what appears
to be very similar symptoms to your ill-understood bug has increased the profile
of your problem.  And unless the exact detail of the bug is understood (it which
case I would expect it to be fixed), nobody is in a position to definitively say
that these are not all coming from the same underlying problem.
Comment 47 Gavin Graham 2005-11-29 17:23:23 EST
I have also tried 2.6.14-1.1664_FC4smp and I even tried the kernel parameteres
noapic & notsc but the problem still persists.

Comment 48 Gavin Graham 2005-12-09 15:14:50 EST
I have been running for well over 24 hours with the clock=pmtmr option and I
have not had any problems at all with all the scenarios I use (read much ealier
in the bug) to test.

My DMESG read very much the same as Dannys.
Comment 49 Joachim Frieben 2005-12-29 09:49:06 EST
Fixed in 2.6.14-1.1743_FC5smp of FC5/rawhide. This may also apply to
the latest FC4 kernel updates or earlier FC5/rawhide kernels for which
I hadn't tested. A happy day > 4 years after reporting the bug :)
Comment 50 Joachim Frieben 2006-02-10 17:12:38 EST
After reverting my system to FC4, I have noticed that the bug is still
present even in the latest update kernel 2.6.15-1.1831_FC4smp.
Comment 51 Joachim Frieben 2006-03-05 11:12:31 EST
The issue has come back some time in 2006. I attach a current "dmesg"
output for kernel "2.6.15-1.2009.4.2_FC5smp-apic".
Comment 52 Joachim Frieben 2006-03-05 11:15:33 EST
Created attachment 125672 [details]
"dmesg" output for APIC-enabled 2.6.15-1.2009.4.2_FC5 SMP kernel on PR440FX system
Comment 53 Joachim Frieben 2006-03-30 12:50:29 EST
Issue still present for kernel "2.6.16-1.2069_FC5smp". System clock
pace without any system or network is 30% above real time.
Comment 54 Joachim Frieben 2006-04-01 03:10:28 EST
Still not fixed in update kernel "2.6.16-1.2080_FC5smp".
Comment 55 Michael Godfrey 2006-04-01 11:38:15 EST
I found these symptoms when I installed 2.6.16-1.2069_FC4 in an Athlon XP2500+
system. Reverting to 2.6.15.1833_FC4 fixed it.  Clock was gaining seconds per
minute.
Comment 56 John Thacker 2006-04-20 19:49:28 EDT
*** Bug 184593 has been marked as a duplicate of this bug. ***
Comment 57 John Thacker 2006-04-20 19:52:29 EDT
On a uniprocessor machine, this was broken for me earlier, worked fine for the
first time in a long time on 2.6.16-1.2080_FC5 even without "noapic," but is now
broken again on 2.6.16-1.2096_FC5.  I had reported it in Bug 184593, which I
marked as a duplicate.
Comment 58 Sergio Monteiro Basto 2006-05-04 15:28:14 EDT
you may try, Kernel command line:
no_timer_check
or 
notsc
and 
report_lost_ticks
Comment 59 Joachim Frieben 2006-05-08 03:16:16 EDT
Still not fixed in "rawhide" kernel 2.6.16-1.2196_FC6 (SMP).
Comment 60 Sergio Monteiro Basto 2006-05-08 21:36:43 EDT
can you try "rawhide" kernel with boot options: no_timer_check notsc
report_lost_ticks and cat /proc/interrupts before and after apply boot options
Comment 61 Joachim Frieben 2006-05-13 07:56:04 EDT
Created attachment 128976 [details]
content of /proc/interrupts for 2.6.16-1.2202_FC6 w/o APIC
Comment 62 Joachim Frieben 2006-05-13 07:59:56 EDT
Created attachment 128977 [details]
content of /proc/interrupts for 2.6.16-1.2202_FC6 w/options "no_timer_check report_lost_ticks"
Comment 63 Sergio Monteiro Basto 2006-05-14 16:36:36 EDT
(In reply to comment #62)
> Created an attachment (id=128977) [edit]
> content of /proc/interrupts for 2.6.16-1.2202_FC6 w/options "no_timer_check
> report_lost_ticks"
> 

Looks fine to me , and what you say ? 
Comment 64 Joachim Frieben 2006-05-14 17:59:46 EDT
The clock keeps speeding ahead ..
Comment 65 Łukasz Trąbiński 2006-05-25 09:40:07 EDT
[root@node5 ~]# dmesg 
Losing some ticks... checking if CPU frequency changed.

[root@node5 ~]# uname -a
Linux node5.news.atman.pl 2.6.16-1.2122_FC5 #1 SMP Sun May 21 15:01:10 EDT 2006
x86_64 x86_64 x86_64 GNU/Linux

[root@node5 ~]# cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 39
model name      : AMD Opteron(tm) Processor 152
stepping        : 1
cpu MHz         : 2600.000
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow
pni lahf_lm
bogomips        : 5234.66
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp


[root@node5 ~]# w
 15:43:09 up 1 day,  3:01,  1 user,  load average: 1.25, 1.23, 1.35
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/2    host-4.noc.atman Wed16    0.00s  0.22s  0.00s w
[root@node5 ~]# cat /proc/interrupts
           CPU0       
  0:   24347234    IO-APIC-edge  timer
  8:          0    IO-APIC-edge  rtc
  9:          0   IO-APIC-level  acpi
 14:         35    IO-APIC-edge  ide0
 16:    1774149   IO-APIC-level  libata, ehci_hcd:usb2
 17:          0   IO-APIC-level  libata
 18:   48280381   IO-APIC-level  eth0
 20:        278   IO-APIC-level  ohci_hcd:usb1
NMI:       4387 
LOC:   24348340 
ERR:          0
MIS:          0
Comment 66 Joachim Frieben 2006-06-25 13:57:43 EDT
Still present in FC6T1 kernel "2.6.16-1.2289_FC6". Due to lack of activity
by Red Hat kernel maintainers, I have finally posted the bug (which has
been around for almost 5 years) upstream:

  http://bugzilla.kernel.org/show_bug.cgi?id=6748

To whom it may concern: this bug is about "APIC" problems with the "SMP"
enabled Fedora kernels. If you system is uni-processor or single-core,
or multi-processor -and- the "noapic" option does -not- make the issue go
away, then -please- look elsewhere. Thanks!
Comment 67 R P Herrold 2006-06-30 12:44:06 EDT
added self, as I have a reproducing unit dhcp-63.josh.lan to test potential fixes against.
Comment 68 Joachim Frieben 2006-07-04 07:03:21 EDT
Still present in kernel "2.6.17-1.2339.fc6".
Comment 69 Joachim Frieben 2006-07-15 08:17:44 EDT
Still broken in kernel "2.6.17-1.2396.fc6".
Comment 70 Joachim Frieben 2006-08-10 05:17:09 EDT
Clock still 30% ahead of nominal speed in "2.6.17-1.2532.fc6".
However, I have spotted some relevant entries in "dmesg" which
partially appear to be of recent origin (file attached):

  "...trying to set up timer (IRQ0) through the 8259A ..."
  "Time: tsc clocksource has been installed."
  "TSC appears to be running slowly. Marking it as unstable"
  "Time: pit clocksource has been installed."
Comment 71 Joachim Frieben 2006-08-10 05:19:54 EDT
Created attachment 133918 [details]
"dmesg" output for APIC-enabled 2.6.17-1.2532.fc6 SMP kernel on PR440FX system
Comment 72 Clint Goudie 2006-09-07 11:11:49 EDT
I'm seeing this regularly in 2.6.17-1.2174_FC5 SMP x86_64

This is what I see in my dmesg

Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip default_idle+0x2b/0x54
spurious 8259A interrupt: IRQ15.

I have two nodes connected to my box via xdmcp all day, so there's always
network traffic going on. 

My clock starts to run faster and faster just like everyone else says here.

Also, when it happens, it seems that all my windows in X loose focus, and I have
to alt-tab to get my mouse to work again.

This is an almost brand new system with a dual core amd 64. I'm happy to provide
more info if it would prove useful.
Comment 73 Sergio Monteiro Basto 2006-09-07 11:29:37 EDT
please try boot kernel with parameters notsc and report_lost_ticks
Comment 74 Joachim Frieben 2006-09-07 13:14:39 EDT
(In reply to comment #72)
As you may have learnt from my initial report and following, this bug is
about "APIC" functionality. So, if you reboot your system adding "noapic"
to the kernel options and the issue goes away then this is probably the
right place. Please check that first, please.
Comment 75 Joachim Frieben 2006-09-08 08:30:24 EDT
Created attachment 135843 [details]
"dmesg" output for APIC-enabled 2.6.17-1.2630.fc6 SMP kernel on PR440FX system w/options "notsc" and "report_lost_ticks"
Comment 76 Joachim Frieben 2006-09-08 08:37:28 EDT
(In reply to comment #73)
Option "notsc" is not enabled for current "Fedora" kernels according to
the "dmesg" log file:

  "notsc: Kernel compiled with CONFIG_X86_TSC, cannot disable TSC."

Moreover, at the end of the log file the kernel reports:

   "TSC appears to be running slowly. Marking it as unstable
    Time: pit clocksource has been installed."

which seems to indicate that the "tsc" source has been superseded by
the "pit" clock source which thus might be responsible. Needless to
say that the already habitual 30% clock advance still applies.
Comment 77 Clint Goudie 2006-09-08 15:32:26 EDT
Interestingly, *removing* noapic and acpi=off from my kernel boot options
appears to have resolved this issue for me with the 2.6.17-1.2174_FC5 kernel.
Comment 78 Joachim Frieben 2006-09-09 11:05:48 EDT
(In reply to comment #77)
By no means surprising as this reduces the probability of "IRQ" conflicts
due to limited resources. In my case, "ACPI" simply cannot be enabled at
all (mainboard manufactured in 1998) and the current "APIC" handling in
the kernel doesn't seem to like this at all! I think that's what this
bug report is all about.
Comment 79 Joachim Frieben 2006-11-04 08:27:19 EST
Created attachment 140357 [details]
"dmesg" output for APIC-enabled 2.6.18-1.2798.fc6 SMP kernel on PR440FX system

No change for kernel "2.6.18-1.2798.fc6". As for the last kernel
for which I had committed the "dmesg" log file, the current kernel
recognizes that the "tsc" time source is unreliable and switches
to "pit". Nevertheless, the speed-up is still of the order of 30%.
I have also noticed that the frame rate delivered by "glxgears" drops
from 360 fps to 130 fps [both at 1400x1050@24bpp] when "APIC" is
enabled. Any suggestions how to proceed with this issue apart from
scrapping my goog old "PR440FX"?
Comment 80 Joachim Frieben 2006-11-04 08:28:25 EST
Created attachment 140358 [details]
"dmesg" output for APIC-disabled 2.6.18-1.2798.fc6 SMP kernel on PR440FX system
Comment 81 Joachim Frieben 2006-12-16 12:56:43 EST
Kernel "2.6.18-1.2849.fc6" finally seems to be stable. Even after several
hours of operation with enabled "APIC", no time slip has occurred not
even speaking about the usual 30% speed-up of the system clock: excellent!
However, I will also try under network load to give a final conclusion.
Btw: it might be a good idea to add the revision number in the changelog
as is the case for most other packages.
Comment 82 Joachim Frieben 2007-01-01 11:50:22 EST
(In reply to comment #81)
Things change significantly as soon as the "DRI" interface is used. After
launching "glxgears", the expected frame rate of about 360 is written
exactly 2x to the console. After that, it drops to about 250 and stays
at this level. There is a new message appended to "/var/log/dmesg":

  "TSC appears to be running slowly. Marking it as unstable
   Time: pit clocksource has been installed."

which was absent before. And of course, the clock goes crazy again from
that moment on ..

PS: The graphics card is a "PCI" based "Radeon AIW 7200".
Comment 83 Joachim Frieben 2007-01-03 14:13:10 EST
Created attachment 144728 [details]
"dmesg" output for APIC-enabled 2.6.19-1.2887.fc6 SMP kernel on PR440FX system

As in the case of "2.6.18-1.2849.fc6", "glxgears" triggers an
instability of the "tsc" time source for "2.6.19-1.2887.fc6".
Instead of the "pit" clock source being installed, "dmesg" now
contains a message:

  "TSC appears to be running slowly. Marking it as unstable"
  "Time: jiffies clocksource has been installed."
	 ^^^^^^^

The result, however, is the same. As soon as the new clock source
has been installed, the system clock loses the right pace and
advances faster. The interrupts have been remapped to the 16-18
range whereas for previous kernels the "APIC IRQ" range was
145-161.
Comment 84 Aleksandar Milivojevic 2007-01-09 11:15:22 EST
Regarding clock speeding up.

It is common problem with all 2.6 kernels.  According to VmWare's knowledge
base, the culprit is increase of HZ constant in 2.6 kernels.  2.4 and earlier
kernels had HZ set to 100.  2.6 kernels bumped it to 1000.  That means 2.6
kernels will request 10 times more timer interruptes from the hardware per CPU
than 2.4 and earlier kernels.

Other than clock problems, this also introduced the performance issues,
especially on multi-CPU virtual machines.  As I said previously, VMWare now
needs to emulate 10 times more virtual interrupts per virtual CPU per virtual
machine.  This quickly adds up.  Resulting in lost interrupts.  Basically what
happens is that number of virtual timer intrruptes gets into several thousand,
or even tens of thousands range (cummulative on all virtual machines) and system
(hardware, host OS, VMWare, virtual machines) is not able to keep the pace,
loosing some of them.  There's code in 2.6 kernels that's supposed to make
adjustments for lost interrupts.  However, it usually overdoes it, resulting in
clock being too fast.  The code behind "clock=pit" seems to make smallest
overadjustment (but it still overdoes it, making system clock go too fast).

As an example for performance degradations introduced by increase of HZ from 100
to 1000, few virtual machines running 2.6 kernel on one of my ESX servers are
consuming an entire CPU when they are completely idle.  Now this is very bad.

According to one post on Nahant mailing list, there is a kernel patch to turn
off timer when nothing is happening on the virtual machine.  The patch is for
mainframe architecture, but basically tackles the same problem experienced when
running 2.6 kernels on i386/x86_64 under VMWare.  Mainframe people seem to have
hit this problem many years ago when running hundreds of virtual machines on the
mainframe.  Here's the URL for the post:

https://www.redhat.com/archives/nahant-list/2007-January/msg00059.html

There's couple of workarounds.

The first thing VMWare suggest is to recompile kernel with HZ set to 100. 
Unfortunately, HZ is a define in the source.  So you must manually change it in
the source files and than recompile the kernel.  Would be nice if it was
variable that could be set from command line during boot.  I've looked a bit
into the source, and it might be a bit non-trivial to change HZ to be command
line option (but I might be wrong).

If recompiling kernel is not an option, 2.6 kernel should be booted with
"clock=pit".  In userspace, ntpd should be disabled (timer is way too unstable
for it to work at all anyhow), vmware-tools installed and sync time with host OS
option enabled in it.  This combination seems to be able to keep system clock
more or less stable.  I'm using this on some of my virtual machines.  It works
OK, clock still wonders around a bit, but it seems to be able to keep it
accurate within a second or two compared to wall clock.

I guess the best solution would be to make HZ a boot time command line option
(insted of having it hardcoded in the source code).  "Normal" users could leave
it at 1000 and get whatever questinable benefits there are from having it set
that high.  VMWare folks could decrease it back to the old default value of 100.

Another interesting approach in solving these issues is implementing that
mainframe timer patch on other architectures (mainly i386 and x86_64).  Probably
also introducing command line option to trigger it ("normal" users probably
don't want their timers to get turned off when machine is idle).
Comment 85 Joachim Frieben 2007-01-09 14:01:25 EST
(In reply to comment #84)
However, on my system, adding "noapic" to the kernel boot options fully
settles the issue, so I am not really sure whether your reasoning applies
to my case. Moreover, according to comment #82, the switch from the
"tsc" to the "pit" time source [also "jiffies" according to comment #83]
is actually the moment when things do really go wrong.
Comment 86 Red Hat Bugzilla 2007-02-05 14:11:27 EST
REOPENED status has been deprecated. ASSIGNED with keyword of Reopened is preferred.
Comment 87 Sergio Monteiro Basto 2007-05-08 23:02:36 EDT
(In reply to comment #85)
> (In reply to comment #84)
> However, on my system, adding "noapic" to the kernel boot options fully
> settles the issue, so I am not really sure whether your reasoning applies
> to my case. 

Use noapic is use only one processador, you need apic to work on SMP, so it is
not a solution.

> Moreover, according to comment #82, the switch from the
> "tsc" to the "pit" time source [also "jiffies" according to comment #83]
> is actually the moment when things do really go wrong.

In kernel 2.6.21-rc5-git4, I try 2.6.20-1.3036, for the fisrt time in a vanilla
kernel (I think it is a affect of vamilla kernel) my computer detects  more 2
clocksources: acpi_pm and tsc and processor just works on one state C1 
cat /proc/acpi/processor/CPU1/power
active state:            C1
max_cstate:              C8
bus master activity:     00000000
maximum allowed latency: 2000 usec
states:
   *C1:                  type[C1] promotion[--] demotion[--] latency[000]
usage[00000000] duration[00000000000000000000]

and since than my computer works 100% correctly.
So you may try kernel 2.6.21 and report your experince 
Comment 88 Joachim Frieben 2007-05-09 11:22:52 EDT
(In reply to comment #87)

> Use noapic is use only one processador, you need apic to work on SMP, so it is
> not a solution.

This is plain nonsense. I have been using my "PR440FX" board for years now,
and it definitely runs in "SMP" mode when "APIC" is disabled.
Comment 89 Sergio Monteiro Basto 2007-05-09 20:22:06 EDT
(In reply to comment #88)
> (In reply to comment #87)
> 
> > Use noapic is use only one processador, you need apic to work on SMP, so it is
> > not a solution.
> 
> This is plain nonsense. I have been using my "PR440FX" board for years now,
> and it definitely runs in "SMP" mode when "APIC" is disabled.

if you do cat /proc/interrupts, you will see just one cpu is working. I think 
Comment 90 Joachim Frieben 2007-06-12 12:19:32 EDT
After updating to "F7" with "kernel-2.6.21-1.3194.fc7" (2.6.21.2), I haven't 
observed any of the previous problems anymore.
Comment 91 Jim Koukoutsis 2008-11-27 05:31:43 EST
Does this bug also apply to Red Hat Enterprise Linux ES 4 with kernel 2.6.9-42.ELsmp, or it refers only to the Fedora Core?

Note You need to log in before you can comment on or make changes to this bug.