Bug 152630

Summary: timer interrupt received twice on ATI chipset motherboard, clock runs at double speed
Product: Red Hat Enterprise Linux 3 Reporter: wingc
Component: kernelAssignee: Brian Maly <bmaly>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: bmaly, jbaron, petrides, riel, starlight, tao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0437 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-20 13:21:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 181405, 186960, 192915    
Attachments:
Description Flags
dmesg from 2.4.21-27.0.2.EL (failure case- clock runs at double speed)
none
dmesg output from patched 2.4.21-27.0.2.EL (clock runs properly)
none
[PATCH 1/4] [ACPI] enhance intr-src-override parsing to handle ES7000
none
[PATCH 2/4] [ACPI] handle SCI override to nth IOAPIC
none
[PATCH 3/4] [PATCH] i386 and x86_64 ACPI mpparse timer bug
none
[PATCH 4/4] revert part of changeset #1 (arch/x86_64/kernel/acpi.c)
none
output of 'lspci' on my ATI Radeon Xpress 200 motherboard
none
dmesg from 2.4.21-31.EL (RHEL3 U5 beta)
none
patch to disable IRQ 0 none

Description wingc 2005-03-30 21:16:41 UTC
Description of problem:

Something is going wrong with interrupt routing on my ATI Radeon Xpress 200
based motherboard. The clock runs at double the expected rate, presumably
because every clock interrupt is being received twice.

I assume that this is an ACPI problem. It sounds like the bug reported in the
thread on linux-kernel with the subject:

"linux-2.6.7-bk2 runs faster than linux-2.6.7 ;)"

(original email from June 2004)

See:

http://marc.theaimsgroup.com/?w=2&r=1&s=linux-2.6.7-bk2+runs+faster&q=t

for more information. The patch that came out of this thread, however, does not
apply to the ACPI code in RHEL3, because RHEL3 is using an older code base.



I ended up cobbling together a patch from several related changes. My patch
seems to fix the problem, however I do not believe that it is fully correct.


Version-Release number of selected component (if applicable):

The problem exists in the current RHEL3 update kernel (2.4.21-27.0.2.EL)

How reproducible:

always

Steps to Reproduce:
1. use the current RHEL3 kernel
  
Actual results:

The clock runs at double the expected rate. (twice the number of interrupts are
received per second as expected) NTP cannot synchronize.

Expected results:

The clock runs at the proper rate.


Additional info:

I created a patch based on the following ACPI changesets. These changesets all
modify the file arch/x86_64/kernel/mpparse.c in such a way that the RHEL3 code
moves closer to the current linux 2.4 code.

My patch is based on the following 3 changesets:


[ACPI] enhance intr-src-override parsing to handle ES7000
http://linux.bkbits.net:8080/linux-2.4/cset@4085c7237X3p0GB-qUMTF6YhNnxSTA

[ACPI] handle SCI override to nth IOAPIC
http://linux.bkbits.net:8080/linux-2.4/cset@40d27b61Ia-EhvtZw9wtHiKwJl6krQ

[PATCH] i386 and x86_64 ACPI mpparse timer bug
http://linux.bkbits.net:8080/linux-2.4/cset@40d6d35ddJnSnjsuIq54YJLwar1vhA


However, to get things to work properly, I had to revert the changes that the
first changeset made to the file 'arch/x86_64/kernel/acpi.c'. Thus, I don't
believe that my composite patch (which is essentially the above three changesets
applied only to arch/x86_64/kernel/mpparse.c) is correct.

It does fix the problem, though, and makes my clock run at normal speed.



I don't know what the correct way to proceed is. If you are planning on
importing newer ACPI code into RHEL3 we could just test that. Otherwise if you
only want a minimal patch to fix the problem you'll need to get someone who
understands what the ACPI interrupt routing does and can debug it.

I will attach the relevent dmesg information and the patch I created which fixes
the problem.


Thanks,

Chris Wing
wingc.edu

Comment 1 wingc 2005-03-30 21:19:30 UTC
Created attachment 112479 [details]
dmesg from 2.4.21-27.0.2.EL (failure case- clock runs at double speed)

This is the boot log from an unpatched 2.4.21-27.0.2.EL
(kernel-kernel-2.4.21-27.EL.x86_64.rpm)

The clock runs at double normal speed.

Comment 2 wingc 2005-03-30 21:20:43 UTC
Created attachment 112480 [details]
dmesg output from patched 2.4.21-27.0.2.EL (clock runs properly)

This is the boot log from 2.4.21-27.0.2.EL with my patch.

The clock runs at the correct rate, and NTP can synchronize properly.

Comment 3 wingc 2005-03-30 21:24:02 UTC
Created attachment 112481 [details]
[PATCH 1/4] [ACPI] enhance intr-src-override parsing to handle ES7000

Apply this patch first to RHEL3 2.4.21-27.0.2.EL kernel.

This is based on the changeset:

[ACPI] enhance intr-src-override parsing to handle ES7000
http://linux.bkbits.net:8080/linux-2.4/cset@4085c7237X3p0GB-qUMTF6YhNnxSTA

Comment 4 wingc 2005-03-30 21:25:45 UTC
Created attachment 112483 [details]
[PATCH 2/4] [ACPI] handle SCI override to nth IOAPIC

Apply this patch second to 2.4.21-27.0.2.EL after applying patch #1.

This is based on the linux-2.4 changeset:

[ACPI] handle SCI override to nth IOAPIC
http://linux.bkbits.net:8080/linux-2.4/cset@40d27b61Ia-EhvtZw9wtHiKwJl6krQ

Comment 5 wingc 2005-03-30 21:27:22 UTC
Created attachment 112485 [details]
[PATCH 3/4] [PATCH] i386 and x86_64 ACPI mpparse timer bug

Apply this patch third to 2.4.21-27.0.2.EL, after applying patches #1 and #2.

This patch is based on the linux-2.4 changeset:

[PATCH] i386 and x86_64 ACPI mpparse timer bug
http://linux.bkbits.net:8080/linux-2.4/cset@40d6d35ddJnSnjsuIq54YJLwar1vhA

Comment 6 wingc 2005-03-30 21:29:25 UTC
Created attachment 112486 [details]
[PATCH 4/4] revert part of changeset #1 (arch/x86_64/kernel/acpi.c)

Finally, apply this patch to 2.4.21-27.0.2.EL after applying #1, #2, #3.

This reverts the change that the first changeset:

[ACPI] enhance intr-src-override parsing to handle ES7000
http://linux.bkbits.net:8080/linux-2.4/cset@4085c7237X3p0GB-qUMTF6YhNnxSTA

made to 'arch/x86_64/kernel/acpi.c'.
If this change is not reverted, the clock still runs at the incorrect rate
(double normal speed)

Comment 7 wingc 2005-03-30 21:31:00 UTC
Created attachment 112487 [details]
output of 'lspci' on my ATI Radeon Xpress 200 motherboard

PCI devices on my motherboard.

Comment 8 wingc 2005-03-31 16:08:38 UTC
problem still exists in most recent RHEL3 U5 beta (kernel 2.4.21-31.EL); not
that I'd expect it to be fixed based on the changes in U5 so far.

Comment 9 wingc 2005-03-31 16:11:12 UTC
Created attachment 112515 [details]
dmesg from 2.4.21-31.EL (RHEL3 U5 beta)

boot log from U5 beta kernel (2.4.21-31.EL).
Nothing significant has changed; the 'Setting APIC routing to flat' message has
appeared but this shouldn't make a difference on single CPU, PC style machines
anyway.

The clock still runs at double speed.

Comment 10 wingc 2005-03-31 16:52:05 UTC
More information. It looks like my patch changes the routing of the timer
interrupt, although according to /proc/interrupts the number of timer interrupts
received per second is the same with or without the patch.

Something is different with the local APIC timer interrupt with the patches.


On a bad kernel (2.4.21-27.0.2.EL); clock runs at double speed:

% cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:      15247    IO-APIC-edge  timer
LOC:       7596

  0:      16251    IO-APIC-edge  timer
LOC:       8097

(100 timer ints/second, but only 50 local APIC timer ints/second?)


On a kernel with my patches (2.4.21-27.0.2.EL):

% cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:      55115          XT-PIC  timer
LOC:      55067

  0:      56116          XT-PIC  timer
LOC:      56068

(100 timer ints/second, and 100 local APIC timer ints/second)


The U5 beta kernel (2.4.21-31.EL) acts the same way as unpached
2.4.21-27.0.2.EL. (timer interrupt shows up as IO-APIC-edge, 100 timer ints/sec,
50 LOC ints/sec)


My patch causes the timer interrupt to be handled via 'XT-PIC' and the clock
works properly. I don't understand what's going on well enough to know what this
means...

Comment 11 wingc 2005-03-31 16:58:30 UTC
Also fails with the SMP kernel. (2.4.21-27.0.2.ELsmp)

The clock runs twice as fast.

Behavior of /proc/interrupts on 2.4.21-27.0.2.ELsmp:


% cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:      47872    IO-APIC-edge  timer
LOC:      23913

  0:      48874    IO-APIC-edge  timer
LOC:      24413

(100 timer ints/sec, 50 local APIC timer ints/sec)

Comment 12 wingc 2005-04-01 20:38:18 UTC
I just realized that I was an idiot when it comes to interpreting the results of
cat /proc/interrupts.

Since the clock is running at double the normal rate, 'sleep 10' completes in 5
seconds, not 10.


So, the correct analysis should have been:

On a broken kernel:

200 timer ints/second, 100 local APIC timer ints/second

On a working kernel:

100 timer ints/second, 100 local APIC timer ints/second.



So, to confirm, the machine receives twice as many timer interrupts per second
as it should, and this is why the clock runs twice as fast.

Comment 13 wingc 2005-04-01 20:42:18 UTC
Confirmed that the bug still exists in RHEL4. (in the initial release kernel
2.6.9-5.EL):

$ cat /proc/interrupts; sleep 10; cat /proc/interrupts
  0:    2728975    IO-APIC-edge  timer
LOC:    1364203

  0:    2738985    IO-APIC-edge  timer
LOC:    1369207


(the 'sleep 10' completes in 5 seconds, so there are 2000 timer ints/sec, and
1000 local APIC ints/second.

Comment 14 John Haxby 2005-07-15 18:42:55 UTC
Bug 163347 describes a similar problem.

Comment 23 Brian Maly 2006-03-15 20:53:29 UTC
could someone with an affected system possibly test this kernel:

http://people.redhat.com/bmaly/linux-2.4.21-ATIfixes.tar.gz 

This kernel has a patch already applied, which is a backport of a 2.6 kernel
patch. the 2.6 patch resolves this issue on all AMD64 systems with ATI and
hopefully will have positive results on a 2.4 kernel. "disable_timer_pin_1" may
need to be passed in as a boot arg on some systems.

Comment 31 Brian Maly 2006-03-29 21:36:44 UTC
is this bug affecting RHEL4 as well? 

bug 173236 is the exact same issue (affecting ATI chipsets) but for RHEL4.
Does the ServerWorks chipset also behave badly on RHEL4? If so I can add a check
for ServerWorks and re-post the RHEL4 patch as well.

Comment 32 Brian Maly 2006-03-29 21:46:56 UTC
Created attachment 127024 [details]
patch to disable IRQ 0

Comment 38 Ernie Petrides 2006-04-07 03:18:21 UTC
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.6.EL).


Comment 39 Bob Johnson 2006-04-11 15:57:54 UTC
This issue is on Red Hat Engineering's list of planned work items 
for the upcoming Red Hat Enterprise Linux 3.8 release.  Engineering 
resources have been assigned and barring unforeseen circumstances, Red 
Hat intends to include this item in the 3.8 release.

Comment 41 Joshua Giles 2006-05-30 16:20:58 UTC
A kernel has been released that contains a patch for this problem.  Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at rhn.redhat.com and post your results to this bugzilla.

Comment 42 wingc 2006-05-30 16:30:41 UTC
Sorry, I no longer have the hardware in question for which this bug was
originally reported.  I am unable to test.

Comment 44 Ernie Petrides 2006-05-30 20:24:42 UTC
Reverting to ON_QA.

Comment 46 Red Hat Bugzilla 2006-07-20 13:21:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html