Bug 203423 - ACPI: IRQ routing regression cauees dma_timer_expiry hang on 2.6.9-42
ACPI: IRQ routing regression cauees dma_timer_expiry hang on 2.6.9-42
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.4
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Konrad Rzeszutek
Brian Brock
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-08-21 16:18 EDT by Brian Smith
Modified: 2007-11-16 20:14 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-10-04 12:38:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
output from requested commands (25.01 KB, text/plain)
2006-08-24 12:18 EDT, Jim Waldram
no flags Details

  None (edit)
Description Brian Smith 2006-08-21 16:18:57 EDT
Description of problem:

After updating to Update 4 and kernel 2.6.9-42 from 2.6.9-34.0.2.EL, the 
system prints out:
 
hda: dma_timer_expiry: dma status == 0x64
hda: DMA interrupt recovery
hda: lost interrupt

gets very slow, very quickly, culminating in a non-pingable state.

The system works fine when running the old kernel.

This system is an nforce3 chipset, running an Athlon 64, but with the 32 bit
version of RHE4.

Version-Release number of selected component (if applicable):

2.6.9-42

How reproducible:

Always.
Comment 1 Jason Baron 2006-08-22 14:06:53 EDT
Hi Brian,

i'd really like to get to the bottom what is causing this issue. would you mind
if we binary search the kernels between U3 and U4 to determine which patch is
causing this issue? ie can you reboot this 7-8 times?

thanks.
Comment 2 Jason Baron 2006-08-22 14:10:46 EDT
Also which i686 kernel are you using? ie UP, SMP or hugemem?
Comment 3 Jason Baron 2006-08-22 15:07:43 EDT
Also, could we get the output of 'cat /proc/interrupts' on the two kernels, the
one that works and the U4 kernel that fails. thanks.
Comment 4 Alan Cox 2006-08-23 06:16:19 EDT
From the IDE side it would be useful to know

IDE controller in question
Devices attached to it
lspci -vxxx [as root]

dmesg from boot if possible (old or new kernel)

My initial feeling however is that this is an interrupt routing problem
triggered somehow by U4. The fact your short trace shows the timer expiry and
recovery working suggests this strongly.
Comment 5 Jim Waldram 2006-08-24 12:18:16 EDT
Created attachment 134828 [details]
output from requested commands

This is a second reporter for this bug.
Comment 6 Jim Waldram 2006-08-24 12:20:34 EDT
I have a similar problem with an Abit VP6 motherboard, single processor, booting
off the third IDE controller. During boot: 
hde: interrupt lost
is issued several times before the errors: 
hde: dma_timer_expiry: dma status == 0x64
hde: DMA interrupt recovery
hde: lost interrupt
These are a dual processor capable motherboard.  There are other systems with
the same motherboard, with 2 processors running smp on the -42 and -42.0.2
kernel.  Booting the 2 processor system uniprocessor works fine.
Attached are info. as requested of poster.
Comment 7 Brian Smith 2006-08-24 12:43:05 EDT
Thank you Jim, I will try to get the info as well.  The machine goes tharn
pretty fast, so may have to add them to rc.local to a temp file.

This happens to be my personal desktop, and I have really needed it the past
couple days, so I've been running the old kernel.

I will say it fails on -42.0.2 as well, but complaining about USB not dma..
Comment 8 Alan Cox 2006-08-24 12:50:48 EDT
Case #2 appears to be an IRQ routing bug when in uniprocessor mode.  Does U3
work and U4 fail. If both fail then please file a separate bug as its probably
unrelated in cause.  (and add me to the cc line or email me the bug id)

Also try acpi=off in this case and see what happens then. That will help pin it down
Comment 9 Jim Waldram 2006-08-24 13:14:47 EDT
Case #2 on Via VP6 motherboard.  U3 (2.6.9-34.0.2) works fine.  U4 fails.
Booting with acpi=off of 2.6.9-42.0.2 works!
kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 acpi=off
Comment 10 Len Brown 2006-08-24 13:33:11 EDT
> Booting with acpi=off of 2.6.9-42.0.2 works

Please try also "pci=noacpi".  Please paste the /proc/interrupts for each case.
Comment 11 Len Brown 2006-08-24 14:18:58 EDT
Jim, The dmesg and interrupt in comment #5 appears to be
2.6.9-34.0.2.EL booting the uni-processor kernel in PIC mode.
This configuration works properly, yes?

What happens when you boot the multi-processor 2.6.9-34.0.2.EL kernel?
The system has an MADT, so it should come up with 2 CPUs using the IOAPIC.
Can you attach the dmesg and /proc/interrupts for that case?

You mentioned that in the failure case hde is the issue,
and that is on ide2, which uses LNKA for its interrupt,
and in the UP case that ends up on PIC irq10.

Probing IDE interface ide1...
HPT370: IDE controller at PCI slot 0000:00:0e.0
ACPI: PCI interrupt 0000:00:0e.0[A] -> GSI 10 (level, low) -> IRQ 10
HPT370: chipset revision 3
HPT37X: using 33MHz PCI clock
HPT370: 100% native mode on irq 10
    ide2: BM-DMA at 0xe800-0xe807, BIOS settings: hde:DMA, hdf:pio
    ide3: BM-DMA at 0xe808-0xe80f, BIOS settings: hdg:DMA, hdh:pio
Probing IDE interface ide2...
hde: IBM-DTLA-305020, ATA DISK drive
ide2 at 0xd800-0xd807,0xdc02 on irq 10
Probing IDE interface ide3...
hdg: IBM-DTLA-307075, ATA DISK drive
ide3 at 0xe000-0xe007,0xe402 on irq 10
Probing IDE interface ide1...
Probing IDE interface ide4...
Probing IDE interface ide5...

There is a couple of "100"% native mode" ide interrupt failures upstream,
it would be interesting if you ran the very latest kernel.org kernel
in SMP IOAPIC mode and reported if that works.

In any case, the IRQ at hand is on somewhat thin ice because
the BIOS gives it to Linux on IRQ9, but at the same time tells
us that IRQ9 is in the list of legal settings for that link:

ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *9
ACPI: PCI Interrupt Link [LNKB] (IRQs 1 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 1 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 1 3 4 *5 6 7 10 11 12 14 15)

We generally try to touch PIC interrupt routing as little as possible,
but illegal is illegal, so we drop it onto IRQ10 -- I'm glad that
is working -- at least in PIC mode.

BTW. the acpi interrupt is on IRQ11 on this box, can you confirm
that ACPI interrupts work properly?  (eg /etc/rc.d/acpid stop;
cat /proc/acpi/event, and press the power button a few times
and you should see event strings come out, and should see
the acpi line in /proc/interrupts increment.)  It might be interesting
to see if there is a BIOS setup option on this box regarding the
default location of interrupts, and if you "return to defaults"
in SETUP if it puts ACPI on IRQ9 and puts PCI devices on IRQ10,
which is a much more traditional setup.  It is possible that the
BIOS system is okay in its default setting but that the BIOS mis-behaves
otherwise.
Comment 12 Jim Waldram 2006-08-24 15:56:35 EDT
In answer to comment #10:
kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 acpi=off
[root@hare ~]# cat /proc/interrupts
           CPU0
  0:    6762281    IO-APIC-edge  timer
  1:         11    IO-APIC-edge  i8042
  2:          0          XT-PIC  cascade
  5:      15304   IO-APIC-level  uhci_hcd, uhci_hcd, eth0
  8:          1    IO-APIC-edge  rtc
  9:     399142   IO-APIC-level  mga@pci:0000:01:00.0
 10:       8431   IO-APIC-level  ide2, ide3
 11:         14   IO-APIC-level  aic7xxx, SoundBlaster
 12:        532    IO-APIC-edge  i8042
 14:      60081    IO-APIC-edge  ide0
NMI:          0
LOC:    6763078
ERR:          0
MIS:          2

kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 acpi=off pci=noacpi
[root@hare ~]# cat /proc/interrupts
           CPU0
  0:     968925    IO-APIC-edge  timer
  1:         11    IO-APIC-edge  i8042
  2:          0          XT-PIC  cascade
  5:       3401   IO-APIC-level  uhci_hcd, uhci_hcd, eth0
  8:          1    IO-APIC-edge  rtc
  9:      51342   IO-APIC-level  mga@pci:0000:01:00.0
 10:       7895   IO-APIC-level  ide2, ide3
 11:         14   IO-APIC-level  aic7xxx, SoundBlaster
 12:         67    IO-APIC-edge  i8042
 14:       7935    IO-APIC-edge  ide0
NMI:          0
LOC:     968646
ERR:          0
MIS:          0

kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 pci=noacpi
[root@hare ~]# cat /proc/interrupts
           CPU0
  0:     114881    IO-APIC-edge  timer
  1:          9    IO-APIC-edge  i8042
  2:          0          XT-PIC  cascade
  5:        765   IO-APIC-level  uhci_hcd, uhci_hcd, eth0
  8:          1    IO-APIC-edge  rtc
  9:         92   IO-APIC-level  mga@pci:0000:01:00.0
 10:       7168   IO-APIC-level  ide2, ide3
 11:         14   IO-APIC-level  acpi, aic7xxx, SoundBlaster
 12:         67    IO-APIC-edge  i8042
 14:        249    IO-APIC-edge  ide0
NMI:          0
LOC:     114543
ERR:          0
MIS:          2
Comment 13 Jim Waldram 2006-08-24 16:29:07 EDT
In response to comment #11:
The dmesg and interrupt in comment #5 appears to be
2.6.9-34.0.2.EL booting the uni-processor kernel in PIC mode.
This configuration works properly, yes?           YES this works.

The system is a single processor on a dual processor motherboard.  No smp kernel
is installed.

can you confirm that ACPI interrupts work properly?
kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 pci=noacpi
/etc/rc.d/acpid stop;
cat /proc/acpi/event
button/power PWRF 00000080 00000001
button/power PWRF 00000080 00000002
button/power PWRF 00000080 00000003
Of course with acpi=off no /proc/acpi is created.

A quick check of the BIOS (Award 6.00PG, 01/17/2001!) shows the IRQs are set to
automatically be assigned.  You can go to manual assignment.  The BIOS ECSD?
configuration has been reset (with no effect) while trying to boot the -42.0.2
kernel with no acpi or pci options.  BIOS IRQ 9 is assigned to IRQ 2 cascade,
IRQ10 is reserved, IRQ11 is reserved.
Comment 14 Brian Smith 2006-08-29 10:14:50 EDT
booting with acpi=off also fixes my problem, in both kernels
Comment 15 Aristeu Rozanski 2006-08-30 13:25:26 EDT
is there a updated BIOS version for this board?
Comment 16 Brian Smith 2006-08-30 16:51:38 EDT
indeed there was one from 7/2005, however it did not fix the problem
Comment 17 Len Brown 2006-08-31 00:29:04 EDT
The issue here seems to be that the experiments so far with 2.6.9-42
are with the IOAPIC enabled, when this system was previously used
with the IOAPIC disabled.

>> The dmesg and interrupt in comment #5 appears to be
>> 2.6.9-34.0.2.EL booting the uni-processor kernel in PIC mode.
>> This configuration works properly, yes?
> YES this works.

> The system is a single processor on a dual processor motherboard.
> No smp kernel is installed.

Please verify that the 2.6.9-42 single-processor kernel
works exactly like the 2.6.9-34 single-processor kernel that you used before.
/proc/interrutps should look the same with ide2 on IRQ 10.
If you have only the SMP 2.6.9-42 kernel installed (and if so,
the question should be why the _type_ of kernel changed with the upgrade)
then you should be able to make it act like the UP kernel by using
"maxcpus=1" with "noapic".
Comment 18 Jason Baron 2006-08-31 08:51:14 EDT
We did set:

CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y

in U4 for the up kernel where they had been previously not set in U3. The
details of this are in bug 168584. It sounds like these changes are what are
causing the new behavior. So, this change should only affect the UP kernel, and
if apics are incorrectly reported by the bios (which seems to be most cases), or
setup incorrectly for some other reason, the workaround is to pass 'noapic' at
the command line, until we figure out why things have gone awry (bios or kernel
problem). 
Comment 19 Konrad Rzeszutek 2006-08-31 10:23:58 EDT
Have you tried installing the SMP kernel? I wonder if the IO-APIC problems also
appear on the board if you install a CONFIG_SMP compiled kernel.
Comment 20 Jim Waldram 2006-08-31 17:49:43 EDT
Case #2: Abit VP6 motherboard, single processor in dual processor motherboard. 
Award BIOS WK hangs as above.  BIOS upgrade to YT fixes problem.  BIOS upgrade
to DR (upgrades the RAID controller BIOS as well) also works.  Abit VP6
motherboards with the WK BIOS and two processors exhibit no problems.
Comment 21 Konrad Rzeszutek 2006-10-04 12:38:10 EDT
Closing bug as NOTAFIX since the BIOS update fixes the problem.

Note You need to log in before you can comment on or make changes to this bug.