Bug 152630 - timer interrupt received twice on ATI chipset motherboard, clock runs at double speed
timer interrupt received twice on ATI chipset motherboard, clock runs at doub...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Brian Maly
Brian Brock
:
Depends On:
Blocks: RHEL3U8CanFix 186960 192915
  Show dependency treegraph
 
Reported: 2005-03-30 16:16 EST by wingc
Modified: 2008-08-04 22:13 EDT (History)
6 users (show)

See Also:
Fixed In Version: RHSA-2006-0437
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-07-20 09:21:47 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dmesg from 2.4.21-27.0.2.EL (failure case- clock runs at double speed) (12.22 KB, text/plain)
2005-03-30 16:19 EST, wingc
no flags Details
dmesg output from patched 2.4.21-27.0.2.EL (clock runs properly) (12.21 KB, text/plain)
2005-03-30 16:20 EST, wingc
no flags Details
[PATCH 1/4] [ACPI] enhance intr-src-override parsing to handle ES7000 (3.48 KB, patch)
2005-03-30 16:24 EST, wingc
no flags Details | Diff
[PATCH 2/4] [ACPI] handle SCI override to nth IOAPIC (895 bytes, patch)
2005-03-30 16:25 EST, wingc
no flags Details | Diff
[PATCH 3/4] [PATCH] i386 and x86_64 ACPI mpparse timer bug (1.18 KB, patch)
2005-03-30 16:27 EST, wingc
no flags Details | Diff
[PATCH 4/4] revert part of changeset #1 (arch/x86_64/kernel/acpi.c) (716 bytes, patch)
2005-03-30 16:29 EST, wingc
no flags Details | Diff
output of 'lspci' on my ATI Radeon Xpress 200 motherboard (1.40 KB, text/plain)
2005-03-30 16:31 EST, wingc
no flags Details
dmesg from 2.4.21-31.EL (RHEL3 U5 beta) (12.49 KB, text/plain)
2005-03-31 11:11 EST, wingc
no flags Details
patch to disable IRQ 0 (3.97 KB, patch)
2006-03-29 16:46 EST, Brian Maly
no flags Details | Diff

  None (edit)
Description wingc 2005-03-30 16:16:41 EST
Description of problem:

Something is going wrong with interrupt routing on my ATI Radeon Xpress 200
based motherboard. The clock runs at double the expected rate, presumably
because every clock interrupt is being received twice.

I assume that this is an ACPI problem. It sounds like the bug reported in the
thread on linux-kernel with the subject:

"linux-2.6.7-bk2 runs faster than linux-2.6.7 ;)"

(original email from June 2004)

See:

http://marc.theaimsgroup.com/?w=2&r=1&s=linux-2.6.7-bk2+runs+faster&q=t

for more information. The patch that came out of this thread, however, does not
apply to the ACPI code in RHEL3, because RHEL3 is using an older code base.



I ended up cobbling together a patch from several related changes. My patch
seems to fix the problem, however I do not believe that it is fully correct.


Version-Release number of selected component (if applicable):

The problem exists in the current RHEL3 update kernel (2.4.21-27.0.2.EL)

How reproducible:

always

Steps to Reproduce:
1. use the current RHEL3 kernel
  
Actual results:

The clock runs at double the expected rate. (twice the number of interrupts are
received per second as expected) NTP cannot synchronize.

Expected results:

The clock runs at the proper rate.


Additional info:

I created a patch based on the following ACPI changesets. These changesets all
modify the file arch/x86_64/kernel/mpparse.c in such a way that the RHEL3 code
moves closer to the current linux 2.4 code.

My patch is based on the following 3 changesets:


[ACPI] enhance intr-src-override parsing to handle ES7000
http://linux.bkbits.net:8080/linux-2.4/cset@4085c7237X3p0GB-qUMTF6YhNnxSTA

[ACPI] handle SCI override to nth IOAPIC
http://linux.bkbits.net:8080/linux-2.4/cset@40d27b61Ia-EhvtZw9wtHiKwJl6krQ

[PATCH] i386 and x86_64 ACPI mpparse timer bug
http://linux.bkbits.net:8080/linux-2.4/cset@40d6d35ddJnSnjsuIq54YJLwar1vhA


However, to get things to work properly, I had to revert the changes that the
first changeset made to the file 'arch/x86_64/kernel/acpi.c'. Thus, I don't
believe that my composite patch (which is essentially the above three changesets
applied only to arch/x86_64/kernel/mpparse.c) is correct.

It does fix the problem, though, and makes my clock run at normal speed.



I don't know what the correct way to proceed is. If you are planning on
importing newer ACPI code into RHEL3 we could just test that. Otherwise if you
only want a minimal patch to fix the problem you'll need to get someone who
understands what the ACPI interrupt routing does and can debug it.

I will attach the relevent dmesg information and the patch I created which fixes
the problem.


Thanks,

Chris Wing
wingc@engin.umich.edu
Comment 1 wingc 2005-03-30 16:19:30 EST
Created attachment 112479 [details]
dmesg from 2.4.21-27.0.2.EL (failure case- clock runs at double speed)

This is the boot log from an unpatched 2.4.21-27.0.2.EL
(kernel-kernel-2.4.21-27.EL.x86_64.rpm)

The clock runs at double normal speed.
Comment 2 wingc 2005-03-30 16:20:43 EST
Created attachment 112480 [details]
dmesg output from patched 2.4.21-27.0.2.EL (clock runs properly)

This is the boot log from 2.4.21-27.0.2.EL with my patch.

The clock runs at the correct rate, and NTP can synchronize properly.
Comment 3 wingc 2005-03-30 16:24:02 EST
Created attachment 112481 [details]
[PATCH 1/4] [ACPI] enhance intr-src-override parsing to handle ES7000

Apply this patch first to RHEL3 2.4.21-27.0.2.EL kernel.

This is based on the changeset:

[ACPI] enhance intr-src-override parsing to handle ES7000
http://linux.bkbits.net:8080/linux-2.4/cset@4085c7237X3p0GB-qUMTF6YhNnxSTA
Comment 4 wingc 2005-03-30 16:25:45 EST
Created attachment 112483 [details]
[PATCH 2/4] [ACPI] handle SCI override to nth IOAPIC

Apply this patch second to 2.4.21-27.0.2.EL after applying patch #1.

This is based on the linux-2.4 changeset:

[ACPI] handle SCI override to nth IOAPIC
http://linux.bkbits.net:8080/linux-2.4/cset@40d27b61Ia-EhvtZw9wtHiKwJl6krQ
Comment 5 wingc 2005-03-30 16:27:22 EST
Created attachment 112485 [details]
[PATCH 3/4] [PATCH] i386 and x86_64 ACPI mpparse timer bug

Apply this patch third to 2.4.21-27.0.2.EL, after applying patches #1 and #2.

This patch is based on the linux-2.4 changeset:

[PATCH] i386 and x86_64 ACPI mpparse timer bug
http://linux.bkbits.net:8080/linux-2.4/cset@40d6d35ddJnSnjsuIq54YJLwar1vhA
Comment 6 wingc 2005-03-30 16:29:25 EST
Created attachment 112486 [details]
[PATCH 4/4] revert part of changeset #1 (arch/x86_64/kernel/acpi.c)

Finally, apply this patch to 2.4.21-27.0.2.EL after applying #1, #2, #3.

This reverts the change that the first changeset:

[ACPI] enhance intr-src-override parsing to handle ES7000
http://linux.bkbits.net:8080/linux-2.4/cset@4085c7237X3p0GB-qUMTF6YhNnxSTA

made to 'arch/x86_64/kernel/acpi.c'.
If this change is not reverted, the clock still runs at the incorrect rate
(double normal speed)
Comment 7 wingc 2005-03-30 16:31:00 EST
Created attachment 112487 [details]
output of 'lspci' on my ATI Radeon Xpress 200 motherboard

PCI devices on my motherboard.
Comment 8 wingc 2005-03-31 11:08:38 EST
problem still exists in most recent RHEL3 U5 beta (kernel 2.4.21-31.EL); not
that I'd expect it to be fixed based on the changes in U5 so far.
Comment 9 wingc 2005-03-31 11:11:12 EST
Created attachment 112515 [details]
dmesg from 2.4.21-31.EL (RHEL3 U5 beta)

boot log from U5 beta kernel (2.4.21-31.EL).
Nothing significant has changed; the 'Setting APIC routing to flat' message has
appeared but this shouldn't make a difference on single CPU, PC style machines
anyway.

The clock still runs at double speed.
Comment 10 wingc 2005-03-31 11:52:05 EST
More information. It looks like my patch changes the routing of the timer
interrupt, although according to /proc/interrupts the number of timer interrupts
received per second is the same with or without the patch.

Something is different with the local APIC timer interrupt with the patches.


On a bad kernel (2.4.21-27.0.2.EL); clock runs at double speed:

% cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:      15247    IO-APIC-edge  timer
LOC:       7596

  0:      16251    IO-APIC-edge  timer
LOC:       8097

(100 timer ints/second, but only 50 local APIC timer ints/second?)


On a kernel with my patches (2.4.21-27.0.2.EL):

% cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:      55115          XT-PIC  timer
LOC:      55067

  0:      56116          XT-PIC  timer
LOC:      56068

(100 timer ints/second, and 100 local APIC timer ints/second)


The U5 beta kernel (2.4.21-31.EL) acts the same way as unpached
2.4.21-27.0.2.EL. (timer interrupt shows up as IO-APIC-edge, 100 timer ints/sec,
50 LOC ints/sec)


My patch causes the timer interrupt to be handled via 'XT-PIC' and the clock
works properly. I don't understand what's going on well enough to know what this
means...
Comment 11 wingc 2005-03-31 11:58:30 EST
Also fails with the SMP kernel. (2.4.21-27.0.2.ELsmp)

The clock runs twice as fast.

Behavior of /proc/interrupts on 2.4.21-27.0.2.ELsmp:


% cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:      47872    IO-APIC-edge  timer
LOC:      23913

  0:      48874    IO-APIC-edge  timer
LOC:      24413

(100 timer ints/sec, 50 local APIC timer ints/sec)
Comment 12 wingc 2005-04-01 15:38:18 EST
I just realized that I was an idiot when it comes to interpreting the results of
cat /proc/interrupts.

Since the clock is running at double the normal rate, 'sleep 10' completes in 5
seconds, not 10.


So, the correct analysis should have been:

On a broken kernel:

200 timer ints/second, 100 local APIC timer ints/second

On a working kernel:

100 timer ints/second, 100 local APIC timer ints/second.



So, to confirm, the machine receives twice as many timer interrupts per second
as it should, and this is why the clock runs twice as fast.
Comment 13 wingc 2005-04-01 15:42:18 EST
Confirmed that the bug still exists in RHEL4. (in the initial release kernel
2.6.9-5.EL):

$ cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:    2728975    IO-APIC-edge  timer
LOC:    1364203

  0:    2738985    IO-APIC-edge  timer
LOC:    1369207


(the 'sleep 10' completes in 5 seconds, so there are 2000 timer ints/sec, and
1000 local APIC ints/second.
Comment 14 John Haxby 2005-07-15 14:42:55 EDT
Bug 163347 describes a similar problem.
Comment 23 Brian Maly 2006-03-15 15:53:29 EST
could someone with an affected system possibly test this kernel:

http://people.redhat.com/bmaly/linux-2.4.21-ATIfixes.tar.gz 

This kernel has a patch already applied, which is a backport of a 2.6 kernel
patch. the 2.6 patch resolves this issue on all AMD64 systems with ATI and
hopefully will have positive results on a 2.4 kernel. "disable_timer_pin_1" may
need to be passed in as a boot arg on some systems.
Comment 31 Brian Maly 2006-03-29 16:36:44 EST
is this bug affecting RHEL4 as well? 

bug 173236 is the exact same issue (affecting ATI chipsets) but for RHEL4.
Does the ServerWorks chipset also behave badly on RHEL4? If so I can add a check
for ServerWorks and re-post the RHEL4 patch as well.
Comment 32 Brian Maly 2006-03-29 16:46:56 EST
Created attachment 127024 [details]
patch to disable IRQ 0
Comment 38 Ernie Petrides 2006-04-06 23:18:21 EDT
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.6.EL).
Comment 39 Bob Johnson 2006-04-11 11:57:54 EDT
This issue is on Red Hat Engineering's list of planned work items 
for the upcoming Red Hat Enterprise Linux 3.8 release.  Engineering 
resources have been assigned and barring unforeseen circumstances, Red 
Hat intends to include this item in the 3.8 release.
Comment 41 Joshua Giles 2006-05-30 12:20:58 EDT
A kernel has been released that contains a patch for this problem.  Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at rhn.redhat.com and post your results to this bugzilla.
Comment 42 wingc 2006-05-30 12:30:41 EDT
Sorry, I no longer have the hardware in question for which this bug was
originally reported.  I am unable to test.
Comment 44 Ernie Petrides 2006-05-30 16:24:42 EDT
Reverting to ON_QA.
Comment 46 Red Hat Bugzilla 2006-07-20 09:21:47 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html

Note You need to log in before you can comment on or make changes to this bug.