Bug 189829

Summary: Network interfaces die with "nobody cared! (screaming interrupt?)" msg
Product: Red Hat Enterprise Linux 4 Reporter: Norman Elton <normelton>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: alexander.laamanen, jbaron, john.lists, mvoelker, ssnodgra
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: U4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-21 17:16:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot of full error msg none

Description Norman Elton 2006-04-24 22:09:14 UTC
System is a Sun x4100. It's been running RHEL4 since January with no hiccups.
This morning, I upgraded to the latest bugfixes and rebooted. System booted
fine. A few hours later, the following error appeared on the console...

irq 10: nobody cared! (screaming interrupt?)

... and some of my network interfaces died. I've attached a screenshot with the
full error text.

The only thing running on IRQ 10 is one of the gig interfaces, using the e1000
driver. There are three other active gig interfaces (one on IRQ9 & two on IRQ 11).

The box was responsive from the local console. I was able to login and restart
the system.

If the box dies again overnight, I'm going to rollback to the previous kernel.

Comment 1 Norman Elton 2006-04-24 22:09:14 UTC
Created attachment 128167 [details]
Screenshot of full error msg

Comment 2 Norman Elton 2006-04-24 22:13:06 UTC
For clarification, the Sun x4100 is an AMD-based system, not a SPARC system.

Comment 3 Norman Elton 2006-04-25 13:14:56 UTC
Died again overnight. Rolled back to previous kernel (2.6.9-22.0.2). Will report
if the problem persists.

Comment 4 John W. Linville 2006-04-25 13:39:01 UTC
Did you try booting w/ "acpi=off"?  Does that avoid the issue? 
 
Please try that and post the results here...thanks! 

Comment 5 Norman Elton 2006-04-26 13:58:04 UTC
Rebooting with ACPI=off does not appear to fix the problem. The issue reappeared
overnight. I'm going to confirm that the box is running the latest firmware from
Sun.

Any other ideas? Thanks

Comment 6 Norman Elton 2006-04-27 14:55:09 UTC
Upgrading to the latest firmware from Sun has not addressed the issue, nor has
booting with acpi=off.

I have noticed that irqbalance fails to start correctly; however, I can't get
any debug information out of it. To my knowledge, it's not a daemon that
continues to run; however, I get a "failed" error when the box shuts down and
tries to stop the service.

Look forward to any suggestions. Thanks.

Comment 7 John W. Linville 2006-05-16 20:30:32 UTC
Please attach the output of running "sysreport" on the box in 
question...thanks! 

Comment 8 Norman Elton 2006-05-18 13:34:55 UTC
Sysreport e-mailed to linville

Comment 9 Alexander Laamanen 2006-05-23 07:52:43 UTC
FYI, this happens almost immediatelly for us, when the standard installation
kernel is used (32bit). We are doing a kickstart installation over the network.

Comment 10 Alexander Laamanen 2006-05-23 07:56:30 UTC
...and we are also using the same hardware: Sun x4100.

Comment 11 John W. Linville 2006-05-30 17:55:07 UTC
Update e1000 to 7.0.38-k4 available here:

   http://people.redhat.com/linville/kernels/rhel4/

I don't know if it has any effect on this problem, but I'd like to hear 
reports from trying it?  Thanks!

Comment 12 Alexander Laamanen 2006-05-31 10:32:23 UTC
I tried to boot the kernel, but it fails to initialize the mptscsi (raid1
configuration):

Linux version 2.6.9-37.EL.jwltest.140smp (bhcompile.redhat.com)
(gcc version 3.4.6 20060404 (Red Hat 3.4.6-2)) #1 SMP Sat May 27 17:26:43 EDT 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009e000 (usable)
 BIOS-e820: 000000000009e000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
 BIOS-e820: 000000007fff0000 - 000000007ffff000 (ACPI data)
 BIOS-e820: 000000007ffff000 - 0000000080000000 (ACPI NVS)
 BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
1151MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000ff780
NX (Execute Disable) protection: active
DMI 2.3 present.
Using APIC driver default
ACPI: PM-Timer IO Port: 0x5008
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
Enabling APIC mode:  Flat.  Using 0 I/O APICs
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfe6ff000] gsi_base[24])
IOAPIC[1]: apic_id 3, version 17, address 0xfe6ff000, GSI 24-27
ACPI: IOAPIC (id[0x04] address[0xfe6fe000] gsi_base[28])
IOAPIC[2]: apic_id 4, version 17, address 0xfe6fe000, GSI 28-31
ACPI: IOAPIC (id[0x05] address[0xfeaff000] gsi_base[32])
IOAPIC[3]: apic_id 5, version 17, address 0xfeaff000, GSI 32-35
ACPI: IOAPIC (id[0x06] address[0xfeafe000] gsi_base[36])
IOAPIC[4]: apic_id 6, version 17, address 0xfeafe000, GSI 36-39
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:7fb00000)
Kernel command line: ro root=LABEL=/ console=tty0 console=ttyS0,38400n8 
CPU 0 irqstacks, hard=c03ef000 soft=c03cf000
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 2593.898 MHz processor.
Using pmtmr for high-res timesource
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 2073620k/2097088k available (1880k kernel code, 22492k reserved, 760k
data, 184k init, 1179584k highmem)
Calibrating delay using timer specific routine.. 5189.93 BogoMIPS (lpj=2594968)
Security Scaffold v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
There is already a security framework initialized, register_security failed.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
CPU0: AMD Opteron(tm) Processor 252 stepping 01
per-CPU timeslice cutoff: 2926.60 usecs.
task migration cache decay timeout: 3 msecs.
Booting processor 1/1 eip 3000
CPU 1 irqstacks, hard=c03f0000 soft=c03d0000
Initializing CPU#1
Calibrating delay using timer specific routine.. 5184.51 BogoMIPS (lpj=2592257)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: AMD Opteron(tm) Processor 252 stepping 01
Total of 2 processors activated (10374.45 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 pin1=2 pin2=-1
checking TSC synchronization across 2 CPUs: passed.
Brought up 2 CPUs
zapping low mappings.
checking if image is initramfs... it is
Freeing initrd memory: 530k freed
NET: Registered protocol family 16
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20040816
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
ACPI: PCI Root Bridge [PCIB] (00:04)
PCI: Probing PCI hardware (bus 04)
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
Linux Plug and Play Support v0.97 (c) Adam Belay
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
ACPI: PCI interrupt 0000:00:07.2[D] -> GSI 19 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:01:01.0[A] -> GSI 26 (level, low) -> IRQ 177
ACPI: PCI interrupt 0000:01:01.1[B] -> GSI 27 (level, low) -> IRQ 185
ACPI: PCI interrupt 0000:01:02.0[A] -> GSI 24 (level, low) -> IRQ 193
ACPI: PCI interrupt 0000:01:02.1[B] -> GSI 25 (level, low) -> IRQ 201
ACPI: PCI interrupt 0000:02:03.0[A] -> GSI 28 (level, low) -> IRQ 209
ACPI: PCI interrupt 0000:03:00.0[D] -> GSI 19 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:03:00.1[D] -> GSI 19 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:03:03.0[A] -> GSI 16 (level, low) -> IRQ 217
apm: BIOS not found.
audit: initializing netlink socket (disabled)
audit(1149064141.323:1): initialized
highmem bounce pool size: 64 pages
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
- Added public key 31A23B22CAC2A0B3
- User ID: Red Hat, Inc. (Kernel Module GPG key)
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: Processor [CPU1] (supports C1, 8 throttling states)
ACPI: Processor [CPU2] (supports C1)
Real Time Clock Driver v1.12
Linux agpgart interface v0.100 (c) Dave Jones
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing enabled
ÿÿttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
AMD8111: IDE controller at PCI slot 0000:00:07.1
AMD8111: chipset revision 3
AMD8111: not 100% native mode: will probe irqs later
AMD8111: 0000:00:07.1 (rev 03) UDMA133 controller
    ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
hda: DV-28SL, ATAPI CD/DVD-ROM drive
Using cfq io scheduler
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: ATAPI 24X DVD-ROM drive, 256kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.20
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
mice: PS/2 mouse device common for all mice
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
NET: Registered protocol family 2
IP route cache hash table entries: 131072 (order: 7, 524288 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 262144 (order: 9, 3145728 bytes)
TCP: Hash tables configured (established 262144 bind 262144)
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
ACPI: (supports S0 S1 S4 S5)
ACPI wakeup devices:
PCI1 USB0 ETHR USB1
Freeing unused kernel memory: 184k freed
Red Hat nash version 4.2.1.6 starting
Mounted /proc filesystem
SCSI subsystem initialized

Creating /dev
Starting udev
LFusion MPT base driver 3.02.62.01rh
oading scsi_mod.Copyright (c) 1999-2005 LSI Logic Corporation
ko module
Loading sd_mod.ko modFusion MPT FC Host driver 3.02.62.01rh
ule
Loading mptFusion MPT SPI Host driver 3.02.62.01rh
base.ko module
Fusion MPT SAS Host driver 3.02.62.01rh
Loading mptscsi.ACPI: PCI interrupt 0000:02:03.0[A] -> GSI 28 (level, low) ->
IRQ 209
ko module
Loadimptbase: Initiating ioc0 bringup
ng mptfc.ko module
Loading mptspi.ko module
Loading mptsas.ko module
ioc0: SAS1064: Capabilities={Initiator}
mptbase: Initiating ioc0 recovery
mptbase: Initiating ioc0 recovery
mptbase: Initiating ioc0 recovery
mptbase: Initiating ioc0 recovery
mptbase: Initiating ioc0 recovery
mptbase: Initiating ioc0 recovery
scsi0 : ioc0: LSISAS1064, FwRev=01040000h, Ports=1, MaxQ=267, IRQ=209
mptscsi: ioc0: attempting task abort! (sc=c2328c40)
scsi0 : destination target 0, lun 0
        command = Inquiry 00 00 00 24 00
mptbase: Initiating ioc0 recovery
mptscsi: ioc0: task abort: SUCCESS (sc=c2328c40)
mptscsi: ioc0: attempting task abort! (sc=c2328c40)
scsi0 : destination target 0, lun 0
        command = Test Unit Ready 00 00 00 00 00
mptbase: Initiating ioc0 recovery
mptscsi: ioc0: task abort: SUCCESS (sc=c2328c40)
mptscsi: ioc0: attempting target reset! (sc=c2328c40)
scsi0 : destination target 0, lun 0
        command = Inquiry 00 00 00 24 00
mptbase: Initiating ioc0 recovery
mptscsi: ioc0: target reset: SUCCESS (sc=c2328c40)
mptscsi: ioc0: attempting task abort! (sc=c2328c40)
scsi0 : destination target 0, lun 0
        command = Test Unit Ready 00 00 00 00 00
...(looping)

Comment 13 John W. Linville 2006-05-31 14:08:47 UTC
Alexander, that looks like it is dying while executing from the initrd.  It is 
hard to guess why this would happen, since I don't have any patches that would 
obviously touch the mptscsi driver. 
 
Are you able to successfully install and boot other recent rhel4 kernels? 

Comment 14 Norman Elton 2006-05-31 20:28:16 UTC
I'm experiencing the same issue as Alexander. It gets to the same place, and
fails to boot farther.

Comment 15 Alexander Laamanen 2006-06-01 12:24:17 UTC
Yes, the following RHEL4U3 kernels boot up just fine:
2.6.9-34.ELsmp
2.6.9-34.0.1.ELsmp


Comment 16 Mark T. Voelker 2006-06-01 13:23:48 UTC
I'm seeing very similar symptoms on Penguin Computing Altus 1300 systems running
RHEL4 AS U3.  These machines, however, use Broadcom BCM5721 NICs (wired in via
the PCIe bus, 1x lane) rather than Intel boards.  Within a couple of minutes
after bootup, I see:

irq 7: nobody cared! (screaming interrupt?)
irq 7: Please try booting with acpi=off and report a bug
 [<c01074d6>] __report_bad_irq+0x3a/0x77
 [<c010774d>] note_interrupt+0xea/0x115
 [<c01079f9>] do_IRQ+0x143/0x1ae
 [<c02d3014>] common_interrupt+0x18/0x20
 [<c0104018>] default_idle+0x0/0x2f
 [<c0104041>] default_idle+0x29/0x2f
 [<c01040a0>] cpu_idle+0x26/0x3b
handlers:
[<f88caa5d>] (tg3_interrupt_tagged+0x0/0xf4 [tg3]

After this we have no network connectivity, of course.  Booting with acpi=off
doesn't help.  I rolled back to a U2 kernel (kernel-smp-2.6.9-34.EL) and the
problem went away.  This has been easily replicated on 8 different Altus 1300's.

Comment 17 John W. Linville 2006-06-02 14:21:33 UTC
Still noo idea why you would be having mptscsi problems w/ that kernel...but 
in the meantime... 
 
Have you tried booting w/ "irqpoll" on the kernel command line?  Please do so 
and post the results here...thanks! 

Comment 18 Alexander Laamanen 2006-06-05 11:12:32 UTC
With "irqpoll" enabled the network installation succeeds every time.

Comment 19 John W. Linville 2006-06-05 16:30:05 UTC
Interesting...now I need to figure-out what that means... :-) 

Comment 20 Norman Elton 2006-06-27 16:50:34 UTC
If it helps, it appears that my problems are eliminated when I pass the
"irqpoll" parameter to the kernel during booting. The system does not exhibit
the "nobody cared/screaming interrupt" message.

Let me know any other information you need. Thanks,

Norman

Comment 21 John W. Linville 2006-06-29 18:34:28 UTC
        irqpoll         [HW]
                        When an interrupt is not handled search all handlers
                        for it. Also check all handlers each timer
                        interrupt. Intended to get systems with badly broken
                        firmware running.

So, it sounds like your BIOS is assigning interrupts in a bad way.  Do you 
have the latest BIOS upgrade for your motherboard?

Comment 22 Alexander Laamanen 2006-08-08 08:10:10 UTC
I upgraded to the latest firmwares from SUN:
- ILOM 1.0.1
- BIOS 31
- LSI MPT SAS firmware 1.08.01 and MPT BIOS 6.04.07.
but the installation still fails.


Comment 23 John W. Linville 2006-08-08 13:18:43 UTC
Does "irqpoll" still work for you (as in comment 18)?

Comment 24 Alexander Laamanen 2006-08-09 07:29:50 UTC
Yes, "irqpoll" still works with the upgraded firmware.

Comment 25 Dan Carpenter 2006-08-09 08:17:38 UTC
I work at Penguin so Mark Voelker and I debugged comment #16 via email and it
turned out that that was caused by bug 193937.  That's got a straight forward
work around.  

I really doubt this a NIC driver bug.  I think it's something at a lower level.
 My guess is that mptscsi broke because the new kernel changed the timing or
something.  What does the output from /proc/interrupts look like?

It's weird that the bugs are so similar when the screen shot shows that Norman
Elton's system is running a UP kernel and Alexander Laamanen's system is running
the SMP kernel.

The Sun x4100 is an AMD64 system I believe, does this work if you use a 64 bit
OS instead of your current 32 bit OS?


Comment 26 Alexander Laamanen 2006-08-09 13:54:06 UTC
I did some further testing on the "irq 10: nobody cared!, Disabling IRQ #10"
-issue. For us it happens immediately if a 32bit UP kernel is used. When IRQ 10
is disabled /proc/interrupts shows that it got 100000 interrupts.

The 32bit SMP, 64bit UP and 64bit SMP kernels seems to work ok (at least they
don't disable the interrupt immediately).

I also found out that the e1000_watchdog is able to re-enable the NIC after the
IRQ has been disabled.

Comment 27 Alexander Laamanen 2006-08-11 11:32:56 UTC
I just tested the latest EL4 Update 4 kernel (2.6.9-42) and it seems to fix this
issue. No problems anymore with the 32bit UP kernel.

Comment 28 John W. Linville 2006-08-11 15:05:20 UTC
Is the same true for others?  Norman?

Comment 29 Steve Snodgrass 2006-08-18 17:12:41 UTC
FYI, I just saw this same problem trying to kickstart a Sun X4100 using the
RHEL4 Update 3 ISO (32-bit).  I'm going to try the irqpoll workaround for the
moment but I have another to install that I will use the Update 4 CD with.  I'll
report back the results.

Comment 30 Steve Snodgrass 2006-08-18 19:00:10 UTC
OK I can confirm both items:

1. The "irqpoll" keyword allows me to successfully kickstart when booting with
the 32-bit RHEL4 Update 3 ISO.

2. I can kickstart fine when booting with RHEL4 Update 4 without any special
keywords.

Comment 31 Norman Elton 2006-08-21 13:18:05 UTC
I can confirm that rebuilding the box with the RHEL 64-bit kernel has fixed our
problem.

Comment 32 John W. Linville 2006-08-21 17:16:59 UTC
It sounds like this is CURRENTRELEASE.