Bug 189829
Summary: | Network interfaces die with "nobody cared! (screaming interrupt?)" msg | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Norman Elton <normelton> | ||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.0 | CC: | alexander.laamanen, jbaron, john.lists, mvoelker, ssnodgra | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | U4 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-08-21 17:16:59 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Norman Elton
2006-04-24 22:09:14 UTC
Created attachment 128167 [details]
Screenshot of full error msg
For clarification, the Sun x4100 is an AMD-based system, not a SPARC system. Died again overnight. Rolled back to previous kernel (2.6.9-22.0.2). Will report if the problem persists. Did you try booting w/ "acpi=off"? Does that avoid the issue? Please try that and post the results here...thanks! Rebooting with ACPI=off does not appear to fix the problem. The issue reappeared overnight. I'm going to confirm that the box is running the latest firmware from Sun. Any other ideas? Thanks Upgrading to the latest firmware from Sun has not addressed the issue, nor has booting with acpi=off. I have noticed that irqbalance fails to start correctly; however, I can't get any debug information out of it. To my knowledge, it's not a daemon that continues to run; however, I get a "failed" error when the box shuts down and tries to stop the service. Look forward to any suggestions. Thanks. Please attach the output of running "sysreport" on the box in question...thanks! Sysreport e-mailed to linville FYI, this happens almost immediatelly for us, when the standard installation kernel is used (32bit). We are doing a kickstart installation over the network. ...and we are also using the same hardware: Sun x4100. Update e1000 to 7.0.38-k4 available here: http://people.redhat.com/linville/kernels/rhel4/ I don't know if it has any effect on this problem, but I'd like to hear reports from trying it? Thanks! I tried to boot the kernel, but it fails to initialize the mptscsi (raid1 configuration): Linux version 2.6.9-37.EL.jwltest.140smp (bhcompile.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-2)) #1 SMP Sat May 27 17:26:43 EDT 2006 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009e000 (usable) BIOS-e820: 000000000009e000 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007fff0000 (usable) BIOS-e820: 000000007fff0000 - 000000007ffff000 (ACPI data) BIOS-e820: 000000007ffff000 - 0000000080000000 (ACPI NVS) BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved) 1151MB HIGHMEM available. 896MB LOWMEM available. found SMP MP-table at 000ff780 NX (Execute Disable) protection: active DMI 2.3 present. Using APIC driver default ACPI: PM-Timer IO Port: 0x5008 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 15:5 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 15:5 APIC version 16 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled) ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled) Enabling APIC mode: Flat. Using 0 I/O APICs ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x03] address[0xfe6ff000] gsi_base[24]) IOAPIC[1]: apic_id 3, version 17, address 0xfe6ff000, GSI 24-27 ACPI: IOAPIC (id[0x04] address[0xfe6fe000] gsi_base[28]) IOAPIC[2]: apic_id 4, version 17, address 0xfe6fe000, GSI 28-31 ACPI: IOAPIC (id[0x05] address[0xfeaff000] gsi_base[32]) IOAPIC[3]: apic_id 5, version 17, address 0xfeaff000, GSI 32-35 ACPI: IOAPIC (id[0x06] address[0xfeafe000] gsi_base[36]) IOAPIC[4]: apic_id 6, version 17, address 0xfeafe000, GSI 36-39 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) Using ACPI (MADT) for SMP configuration information Allocating PCI resources starting at 88000000 (gap: 80000000:7fb00000) Kernel command line: ro root=LABEL=/ console=tty0 console=ttyS0,38400n8 CPU 0 irqstacks, hard=c03ef000 soft=c03cf000 PID hash table entries: 4096 (order: 12, 65536 bytes) Detected 2593.898 MHz processor. Using pmtmr for high-res timesource Console: colour VGA+ 80x25 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 2073620k/2097088k available (1880k kernel code, 22492k reserved, 760k data, 184k init, 1179584k highmem) Calibrating delay using timer specific routine.. 5189.93 BogoMIPS (lpj=2594968) Security Scaffold v1.0.0 initialized SELinux: Initializing. SELinux: Starting in permissive mode There is already a security framework initialized, register_security failed. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 512 (order: 0, 4096 bytes) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. CPU0: AMD Opteron(tm) Processor 252 stepping 01 per-CPU timeslice cutoff: 2926.60 usecs. task migration cache decay timeout: 3 msecs. Booting processor 1/1 eip 3000 CPU 1 irqstacks, hard=c03f0000 soft=c03d0000 Initializing CPU#1 Calibrating delay using timer specific routine.. 5184.51 BogoMIPS (lpj=2592257) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. CPU1: AMD Opteron(tm) Processor 252 stepping 01 Total of 2 processors activated (10374.45 BogoMIPS). ENABLING IO-APIC IRQs ..TIMER: vector=0x31 pin1=2 pin2=-1 checking TSC synchronization across 2 CPUs: passed. Brought up 2 CPUs zapping low mappings. checking if image is initramfs... it is Freeing initrd memory: 530k freed NET: Registered protocol family 16 PCI: Using configuration type 1 mtrr: v2.0 (20020519) ACPI: Subsystem revision 20040816 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (00:00) PCI: Probing PCI hardware (bus 00) ACPI: PCI Root Bridge [PCIB] (00:04) PCI: Probing PCI hardware (bus 04) ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *9 10 11 12 14 15) ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11 12 14 15) Linux Plug and Play Support v0.97 (c) Adam Belay usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing ACPI: PCI interrupt 0000:00:07.2[D] -> GSI 19 (level, low) -> IRQ 169 ACPI: PCI interrupt 0000:01:01.0[A] -> GSI 26 (level, low) -> IRQ 177 ACPI: PCI interrupt 0000:01:01.1[B] -> GSI 27 (level, low) -> IRQ 185 ACPI: PCI interrupt 0000:01:02.0[A] -> GSI 24 (level, low) -> IRQ 193 ACPI: PCI interrupt 0000:01:02.1[B] -> GSI 25 (level, low) -> IRQ 201 ACPI: PCI interrupt 0000:02:03.0[A] -> GSI 28 (level, low) -> IRQ 209 ACPI: PCI interrupt 0000:03:00.0[D] -> GSI 19 (level, low) -> IRQ 169 ACPI: PCI interrupt 0000:03:00.1[D] -> GSI 19 (level, low) -> IRQ 169 ACPI: PCI interrupt 0000:03:03.0[A] -> GSI 16 (level, low) -> IRQ 217 apm: BIOS not found. audit: initializing netlink socket (disabled) audit(1149064141.323:1): initialized highmem bounce pool size: 64 pages Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 1024 (order 0, 4096 bytes) SELinux: Registering netfilter hooks Initializing Cryptographic API ksign: Installing public key data Loading keyring - Added public key 31A23B22CAC2A0B3 - User ID: Red Hat, Inc. (Kernel Module GPG key) pci_hotplug: PCI Hot Plug PCI Core version: 0.5 ACPI: Processor [CPU1] (supports C1, 8 throttling states) ACPI: Processor [CPU2] (supports C1) Real Time Clock Driver v1.12 Linux agpgart interface v0.100 (c) Dave Jones serio: i8042 AUX port at 0x60,0x64 irq 12 serio: i8042 KBD port at 0x60,0x64 irq 1 Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing enabled ÿÿttyS0 at I/O 0x3f8 (irq = 4) is a 16550A RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD8111: IDE controller at PCI slot 0000:00:07.1 AMD8111: chipset revision 3 AMD8111: not 100% native mode: will probe irqs later AMD8111: 0000:00:07.1 (rev 03) UDMA133 controller ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio hda: DV-28SL, ATAPI CD/DVD-ROM drive Using cfq io scheduler ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: ATAPI 24X DVD-ROM drive, 256kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.20 ide-floppy driver 0.99.newide usbcore: registered new driver hiddev usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.0:USB HID core driver mice: PS/2 mouse device common for all mice md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 NET: Registered protocol family 2 IP route cache hash table entries: 131072 (order: 7, 524288 bytes) TCP established hash table entries: 262144 (order: 10, 4194304 bytes) TCP bind hash table entries: 262144 (order: 9, 3145728 bytes) TCP: Hash tables configured (established 262144 bind 262144) Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 ACPI: (supports S0 S1 S4 S5) ACPI wakeup devices: PCI1 USB0 ETHR USB1 Freeing unused kernel memory: 184k freed Red Hat nash version 4.2.1.6 starting Mounted /proc filesystem SCSI subsystem initialized Creating /dev Starting udev LFusion MPT base driver 3.02.62.01rh oading scsi_mod.Copyright (c) 1999-2005 LSI Logic Corporation ko module Loading sd_mod.ko modFusion MPT FC Host driver 3.02.62.01rh ule Loading mptFusion MPT SPI Host driver 3.02.62.01rh base.ko module Fusion MPT SAS Host driver 3.02.62.01rh Loading mptscsi.ACPI: PCI interrupt 0000:02:03.0[A] -> GSI 28 (level, low) -> IRQ 209 ko module Loadimptbase: Initiating ioc0 bringup ng mptfc.ko module Loading mptspi.ko module Loading mptsas.ko module ioc0: SAS1064: Capabilities={Initiator} mptbase: Initiating ioc0 recovery mptbase: Initiating ioc0 recovery mptbase: Initiating ioc0 recovery mptbase: Initiating ioc0 recovery mptbase: Initiating ioc0 recovery mptbase: Initiating ioc0 recovery scsi0 : ioc0: LSISAS1064, FwRev=01040000h, Ports=1, MaxQ=267, IRQ=209 mptscsi: ioc0: attempting task abort! (sc=c2328c40) scsi0 : destination target 0, lun 0 command = Inquiry 00 00 00 24 00 mptbase: Initiating ioc0 recovery mptscsi: ioc0: task abort: SUCCESS (sc=c2328c40) mptscsi: ioc0: attempting task abort! (sc=c2328c40) scsi0 : destination target 0, lun 0 command = Test Unit Ready 00 00 00 00 00 mptbase: Initiating ioc0 recovery mptscsi: ioc0: task abort: SUCCESS (sc=c2328c40) mptscsi: ioc0: attempting target reset! (sc=c2328c40) scsi0 : destination target 0, lun 0 command = Inquiry 00 00 00 24 00 mptbase: Initiating ioc0 recovery mptscsi: ioc0: target reset: SUCCESS (sc=c2328c40) mptscsi: ioc0: attempting task abort! (sc=c2328c40) scsi0 : destination target 0, lun 0 command = Test Unit Ready 00 00 00 00 00 ...(looping) Alexander, that looks like it is dying while executing from the initrd. It is hard to guess why this would happen, since I don't have any patches that would obviously touch the mptscsi driver. Are you able to successfully install and boot other recent rhel4 kernels? I'm experiencing the same issue as Alexander. It gets to the same place, and fails to boot farther. Yes, the following RHEL4U3 kernels boot up just fine: 2.6.9-34.ELsmp 2.6.9-34.0.1.ELsmp I'm seeing very similar symptoms on Penguin Computing Altus 1300 systems running RHEL4 AS U3. These machines, however, use Broadcom BCM5721 NICs (wired in via the PCIe bus, 1x lane) rather than Intel boards. Within a couple of minutes after bootup, I see: irq 7: nobody cared! (screaming interrupt?) irq 7: Please try booting with acpi=off and report a bug [<c01074d6>] __report_bad_irq+0x3a/0x77 [<c010774d>] note_interrupt+0xea/0x115 [<c01079f9>] do_IRQ+0x143/0x1ae [<c02d3014>] common_interrupt+0x18/0x20 [<c0104018>] default_idle+0x0/0x2f [<c0104041>] default_idle+0x29/0x2f [<c01040a0>] cpu_idle+0x26/0x3b handlers: [<f88caa5d>] (tg3_interrupt_tagged+0x0/0xf4 [tg3] After this we have no network connectivity, of course. Booting with acpi=off doesn't help. I rolled back to a U2 kernel (kernel-smp-2.6.9-34.EL) and the problem went away. This has been easily replicated on 8 different Altus 1300's. Still noo idea why you would be having mptscsi problems w/ that kernel...but in the meantime... Have you tried booting w/ "irqpoll" on the kernel command line? Please do so and post the results here...thanks! With "irqpoll" enabled the network installation succeeds every time. Interesting...now I need to figure-out what that means... :-) If it helps, it appears that my problems are eliminated when I pass the "irqpoll" parameter to the kernel during booting. The system does not exhibit the "nobody cared/screaming interrupt" message. Let me know any other information you need. Thanks, Norman irqpoll [HW] When an interrupt is not handled search all handlers for it. Also check all handlers each timer interrupt. Intended to get systems with badly broken firmware running. So, it sounds like your BIOS is assigning interrupts in a bad way. Do you have the latest BIOS upgrade for your motherboard? I upgraded to the latest firmwares from SUN: - ILOM 1.0.1 - BIOS 31 - LSI MPT SAS firmware 1.08.01 and MPT BIOS 6.04.07. but the installation still fails. Does "irqpoll" still work for you (as in comment 18)? Yes, "irqpoll" still works with the upgraded firmware. I work at Penguin so Mark Voelker and I debugged comment #16 via email and it turned out that that was caused by bug 193937. That's got a straight forward work around. I really doubt this a NIC driver bug. I think it's something at a lower level. My guess is that mptscsi broke because the new kernel changed the timing or something. What does the output from /proc/interrupts look like? It's weird that the bugs are so similar when the screen shot shows that Norman Elton's system is running a UP kernel and Alexander Laamanen's system is running the SMP kernel. The Sun x4100 is an AMD64 system I believe, does this work if you use a 64 bit OS instead of your current 32 bit OS? I did some further testing on the "irq 10: nobody cared!, Disabling IRQ #10" -issue. For us it happens immediately if a 32bit UP kernel is used. When IRQ 10 is disabled /proc/interrupts shows that it got 100000 interrupts. The 32bit SMP, 64bit UP and 64bit SMP kernels seems to work ok (at least they don't disable the interrupt immediately). I also found out that the e1000_watchdog is able to re-enable the NIC after the IRQ has been disabled. I just tested the latest EL4 Update 4 kernel (2.6.9-42) and it seems to fix this issue. No problems anymore with the 32bit UP kernel. Is the same true for others? Norman? FYI, I just saw this same problem trying to kickstart a Sun X4100 using the RHEL4 Update 3 ISO (32-bit). I'm going to try the irqpoll workaround for the moment but I have another to install that I will use the Update 4 CD with. I'll report back the results. OK I can confirm both items: 1. The "irqpoll" keyword allows me to successfully kickstart when booting with the 32-bit RHEL4 Update 3 ISO. 2. I can kickstart fine when booting with RHEL4 Update 4 without any special keywords. I can confirm that rebuilding the box with the RHEL 64-bit kernel has fixed our problem. It sounds like this is CURRENTRELEASE. |