Description of problem: After updating to Update 4 and kernel 2.6.9-42 from 2.6.9-34.0.2.EL, the system prints out: hda: dma_timer_expiry: dma status == 0x64 hda: DMA interrupt recovery hda: lost interrupt gets very slow, very quickly, culminating in a non-pingable state. The system works fine when running the old kernel. This system is an nforce3 chipset, running an Athlon 64, but with the 32 bit version of RHE4. Version-Release number of selected component (if applicable): 2.6.9-42 How reproducible: Always.
Hi Brian, i'd really like to get to the bottom what is causing this issue. would you mind if we binary search the kernels between U3 and U4 to determine which patch is causing this issue? ie can you reboot this 7-8 times? thanks.
Also which i686 kernel are you using? ie UP, SMP or hugemem?
Also, could we get the output of 'cat /proc/interrupts' on the two kernels, the one that works and the U4 kernel that fails. thanks.
From the IDE side it would be useful to know IDE controller in question Devices attached to it lspci -vxxx [as root] dmesg from boot if possible (old or new kernel) My initial feeling however is that this is an interrupt routing problem triggered somehow by U4. The fact your short trace shows the timer expiry and recovery working suggests this strongly.
Created attachment 134828 [details] output from requested commands This is a second reporter for this bug.
I have a similar problem with an Abit VP6 motherboard, single processor, booting off the third IDE controller. During boot: hde: interrupt lost is issued several times before the errors: hde: dma_timer_expiry: dma status == 0x64 hde: DMA interrupt recovery hde: lost interrupt These are a dual processor capable motherboard. There are other systems with the same motherboard, with 2 processors running smp on the -42 and -42.0.2 kernel. Booting the 2 processor system uniprocessor works fine. Attached are info. as requested of poster.
Thank you Jim, I will try to get the info as well. The machine goes tharn pretty fast, so may have to add them to rc.local to a temp file. This happens to be my personal desktop, and I have really needed it the past couple days, so I've been running the old kernel. I will say it fails on -42.0.2 as well, but complaining about USB not dma..
Case #2 appears to be an IRQ routing bug when in uniprocessor mode. Does U3 work and U4 fail. If both fail then please file a separate bug as its probably unrelated in cause. (and add me to the cc line or email me the bug id) Also try acpi=off in this case and see what happens then. That will help pin it down
Case #2 on Via VP6 motherboard. U3 (2.6.9-34.0.2) works fine. U4 fails. Booting with acpi=off of 2.6.9-42.0.2 works! kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 acpi=off
> Booting with acpi=off of 2.6.9-42.0.2 works Please try also "pci=noacpi". Please paste the /proc/interrupts for each case.
Jim, The dmesg and interrupt in comment #5 appears to be 2.6.9-34.0.2.EL booting the uni-processor kernel in PIC mode. This configuration works properly, yes? What happens when you boot the multi-processor 2.6.9-34.0.2.EL kernel? The system has an MADT, so it should come up with 2 CPUs using the IOAPIC. Can you attach the dmesg and /proc/interrupts for that case? You mentioned that in the failure case hde is the issue, and that is on ide2, which uses LNKA for its interrupt, and in the UP case that ends up on PIC irq10. Probing IDE interface ide1... HPT370: IDE controller at PCI slot 0000:00:0e.0 ACPI: PCI interrupt 0000:00:0e.0[A] -> GSI 10 (level, low) -> IRQ 10 HPT370: chipset revision 3 HPT37X: using 33MHz PCI clock HPT370: 100% native mode on irq 10 ide2: BM-DMA at 0xe800-0xe807, BIOS settings: hde:DMA, hdf:pio ide3: BM-DMA at 0xe808-0xe80f, BIOS settings: hdg:DMA, hdh:pio Probing IDE interface ide2... hde: IBM-DTLA-305020, ATA DISK drive ide2 at 0xd800-0xd807,0xdc02 on irq 10 Probing IDE interface ide3... hdg: IBM-DTLA-307075, ATA DISK drive ide3 at 0xe000-0xe007,0xe402 on irq 10 Probing IDE interface ide1... Probing IDE interface ide4... Probing IDE interface ide5... There is a couple of "100"% native mode" ide interrupt failures upstream, it would be interesting if you ran the very latest kernel.org kernel in SMP IOAPIC mode and reported if that works. In any case, the IRQ at hand is on somewhat thin ice because the BIOS gives it to Linux on IRQ9, but at the same time tells us that IRQ9 is in the list of legal settings for that link: ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *9 ACPI: PCI Interrupt Link [LNKB] (IRQs 1 3 4 5 6 7 10 *11 12 14 15) ACPI: PCI Interrupt Link [LNKC] (IRQs 1 3 4 5 6 7 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKD] (IRQs 1 3 4 *5 6 7 10 11 12 14 15) We generally try to touch PIC interrupt routing as little as possible, but illegal is illegal, so we drop it onto IRQ10 -- I'm glad that is working -- at least in PIC mode. BTW. the acpi interrupt is on IRQ11 on this box, can you confirm that ACPI interrupts work properly? (eg /etc/rc.d/acpid stop; cat /proc/acpi/event, and press the power button a few times and you should see event strings come out, and should see the acpi line in /proc/interrupts increment.) It might be interesting to see if there is a BIOS setup option on this box regarding the default location of interrupts, and if you "return to defaults" in SETUP if it puts ACPI on IRQ9 and puts PCI devices on IRQ10, which is a much more traditional setup. It is possible that the BIOS system is okay in its default setting but that the BIOS mis-behaves otherwise.
In answer to comment #10: kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 acpi=off [root@hare ~]# cat /proc/interrupts CPU0 0: 6762281 IO-APIC-edge timer 1: 11 IO-APIC-edge i8042 2: 0 XT-PIC cascade 5: 15304 IO-APIC-level uhci_hcd, uhci_hcd, eth0 8: 1 IO-APIC-edge rtc 9: 399142 IO-APIC-level mga@pci:0000:01:00.0 10: 8431 IO-APIC-level ide2, ide3 11: 14 IO-APIC-level aic7xxx, SoundBlaster 12: 532 IO-APIC-edge i8042 14: 60081 IO-APIC-edge ide0 NMI: 0 LOC: 6763078 ERR: 0 MIS: 2 kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 acpi=off pci=noacpi [root@hare ~]# cat /proc/interrupts CPU0 0: 968925 IO-APIC-edge timer 1: 11 IO-APIC-edge i8042 2: 0 XT-PIC cascade 5: 3401 IO-APIC-level uhci_hcd, uhci_hcd, eth0 8: 1 IO-APIC-edge rtc 9: 51342 IO-APIC-level mga@pci:0000:01:00.0 10: 7895 IO-APIC-level ide2, ide3 11: 14 IO-APIC-level aic7xxx, SoundBlaster 12: 67 IO-APIC-edge i8042 14: 7935 IO-APIC-edge ide0 NMI: 0 LOC: 968646 ERR: 0 MIS: 0 kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 pci=noacpi [root@hare ~]# cat /proc/interrupts CPU0 0: 114881 IO-APIC-edge timer 1: 9 IO-APIC-edge i8042 2: 0 XT-PIC cascade 5: 765 IO-APIC-level uhci_hcd, uhci_hcd, eth0 8: 1 IO-APIC-edge rtc 9: 92 IO-APIC-level mga@pci:0000:01:00.0 10: 7168 IO-APIC-level ide2, ide3 11: 14 IO-APIC-level acpi, aic7xxx, SoundBlaster 12: 67 IO-APIC-edge i8042 14: 249 IO-APIC-edge ide0 NMI: 0 LOC: 114543 ERR: 0 MIS: 2
In response to comment #11: The dmesg and interrupt in comment #5 appears to be 2.6.9-34.0.2.EL booting the uni-processor kernel in PIC mode. This configuration works properly, yes? YES this works. The system is a single processor on a dual processor motherboard. No smp kernel is installed. can you confirm that ACPI interrupts work properly? kernel /vmlinuz-2.6.9-42.0.2.EL ro root=/dev/hde2 pci=noacpi /etc/rc.d/acpid stop; cat /proc/acpi/event button/power PWRF 00000080 00000001 button/power PWRF 00000080 00000002 button/power PWRF 00000080 00000003 Of course with acpi=off no /proc/acpi is created. A quick check of the BIOS (Award 6.00PG, 01/17/2001!) shows the IRQs are set to automatically be assigned. You can go to manual assignment. The BIOS ECSD? configuration has been reset (with no effect) while trying to boot the -42.0.2 kernel with no acpi or pci options. BIOS IRQ 9 is assigned to IRQ 2 cascade, IRQ10 is reserved, IRQ11 is reserved.
booting with acpi=off also fixes my problem, in both kernels
is there a updated BIOS version for this board?
indeed there was one from 7/2005, however it did not fix the problem
The issue here seems to be that the experiments so far with 2.6.9-42 are with the IOAPIC enabled, when this system was previously used with the IOAPIC disabled. >> The dmesg and interrupt in comment #5 appears to be >> 2.6.9-34.0.2.EL booting the uni-processor kernel in PIC mode. >> This configuration works properly, yes? > YES this works. > The system is a single processor on a dual processor motherboard. > No smp kernel is installed. Please verify that the 2.6.9-42 single-processor kernel works exactly like the 2.6.9-34 single-processor kernel that you used before. /proc/interrutps should look the same with ide2 on IRQ 10. If you have only the SMP 2.6.9-42 kernel installed (and if so, the question should be why the _type_ of kernel changed with the upgrade) then you should be able to make it act like the UP kernel by using "maxcpus=1" with "noapic".
We did set: CONFIG_X86_UP_APIC=y CONFIG_X86_UP_IOAPIC=y in U4 for the up kernel where they had been previously not set in U3. The details of this are in bug 168584. It sounds like these changes are what are causing the new behavior. So, this change should only affect the UP kernel, and if apics are incorrectly reported by the bios (which seems to be most cases), or setup incorrectly for some other reason, the workaround is to pass 'noapic' at the command line, until we figure out why things have gone awry (bios or kernel problem).
Have you tried installing the SMP kernel? I wonder if the IO-APIC problems also appear on the board if you install a CONFIG_SMP compiled kernel.
Case #2: Abit VP6 motherboard, single processor in dual processor motherboard. Award BIOS WK hangs as above. BIOS upgrade to YT fixes problem. BIOS upgrade to DR (upgrades the RAID controller BIOS as well) also works. Abit VP6 motherboards with the WK BIOS and two processors exhibit no problems.
Closing bug as NOTAFIX since the BIOS update fixes the problem.