Bug 227961 - 2.6.19 and above, hard hang on SMP _x86 (do_IRQ vector)
Summary: 2.6.19 and above, hard hang on SMP _x86 (do_IRQ vector)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard: bzcl34nup
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-02-09 04:07 UTC by Naoki
Modified: 2008-05-07 01:08 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-07 01:08:59 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Naoki 2007-02-09 04:07:41 UTC
Description of problem:
System hang (hard) - Requires power cycle.

System fails with this error message :
"do_IRQ 0.177 No irq handler for vector"

This happens only on SMP boxes.  I've tested on IBM x3550 with four cores. Also
have seen on a Dell Optiplex intel 2-core system.  Both were running X86_64
installs.

This has been reported on fedora-devel & fedora-list.
This is an upstream kernel problem, there is a conversation regarding this issue
 here : http://lkml.org/lkml/2007/1/22/121

Version-Release number of selected component (if applicable):
2.6.19 kernel and above. Seen under both kernel-2.6.20-1.2922.fc7 &
kernel-2.6.19-1.2895.fc6.

How reproducible:
Consistantly.

Additional info:
There seems to be a work around of disabling "irqbalance".  I've done this and
am running 2.6.20-1.2922.fc7 seemingly ok for now, but will keep an eye on it
and run some performance tests.

Comment 1 Chuck Ebbert 2007-02-09 19:16:23 UTC
bug 225399 is related to this one


Comment 2 Naoki 2007-02-15 08:32:34 UTC
Same issue with - 2.6.20-1.2925.fc7

Comment 3 Naoki 2007-02-16 02:55:58 UTC
The 100% effective workaround for this is to disable the IRQBalance service.
Enabling it kills the system (for me) in only a few minutes.

There is an upstream patch (as discussed in the linked thread) and this ticket
will be open until that makes its way into the FC kernel.

Sadly 2.6.20-1.2930.fc7 has just failed to boot for me so I am unable to test it.

Comment 4 Bill Nottingham 2007-03-02 17:42:51 UTC
Moving to 'devel' as discussed on
https://www.redhat.com/archives/fedora-devel-list/2007-March/msg00095.html.

Comment 5 tibyke 2007-06-01 13:48:21 UTC
i had this very issue on an HP DL385g1 box with with two dualcore AMD processors
on Fedora Core 6 with kernel version: 2.6.19-1.2895.fc6 (x86_64)


this was the last message on the login prompt:
do_IRQ: 1.57 No irq handler for vector

00:03.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
00:04.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
00:04.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
00:04.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:07.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:07.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:08.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:08.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTra
nsport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Con
troller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscella
neous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTra
nsport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Con
troller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscella
neous Control
01:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
01:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
01:02.0 System peripheral: Compaq Computer Corporation Integrated Lights Out Con
troller (rev 01)
01:02.2 System peripheral: Compaq Computer Corporation Integrated Lights Out  Pr
ocessor (rev 01)
01:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
02:04.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 0
1)
03:06.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethe
rnet (rev 10)
03:06.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethe
rnet (rev 10)
04:09.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
04:09.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
04:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
04:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
06:09.0 RAID bus controller: Hewlett-Packard Company Smart Array P600

t

Comment 6 Michael Nielsen 2007-09-11 13:45:23 UTC
I think  this is related to this bug, I have an ASUS P5ND2, with a Quad 2.4GHz
Intel processor, with 4 GB RAM, I've been having apparent problems related to
do_IRQ ever since I got this machine, I've already turned off IRQ Balance, tried
setting smp_affinity, all to no avail.    Running Fedora 7, not development
version though.

I am running with 2 500GB disk in a RAID stripe, using the hardware on the
motherboard.

The hardware runs smooth, for hours on end with XP service pack 2, which makes
me think it may not be a fault in the hardware.

However the system seems to rarely survive heavy network traffic combined with
I/O (acts as a file server), while being used as a workstation.   

The symptoms are generally the system starts to slow down, then X stops
responding to events, though the mouse still works, network services are still
working, though will not spawn any processes - i.e. ssh accepts the connection,
but never allows you to log in, the system acts as if the system had an extreme
load level (may be the disks I/O is blocked), SMB and NFS transfers seem to keep
working for quite some time after the event start.   Rebooting, either by
pressing the power switch, or issuing the command, is rarely effective, even if
the command can be executed.

It is hard to get a dump of the event, because the first devices that seem
impacted is the hard drives, and therefore the logs are generally not recorded,
I've managed to capture some events.  In this particular log, I managed to start
a reboot before the system got in to a hard lock.   

It's hard to determine if it's the event causing the problem or just some random
noise.  Therefore these logs may or may not be directly related to the hard
crash.  The log segment may just be follow up errors, I am unable to tell.

Feel free to contact me, if there is any specific thing you wish me to try.

Boot kernel info :
Sep  3 17:36:34 taipan kernel: Linux version 2.6.21-1.3194.fc7
(kojibuilder.redhat.com) (gcc version 4.1.2 20070502 (Red Hat
4.1.2-12)) #1 SMP Wed May 23 22:47:07 EDT 2007



Event:
Sep  3 17:36:44 taipan smartd[3455]: smartd has fork()ed into background mode.
New PID=3455.
Sep  3 17:36:45 taipan pcscd: winscard.c:219:SCardConnect() Reader E-Gate 0 0
Not Found
Sep  3 17:36:45 taipan last message repeated 3 times
Sep  3 17:36:57 taipan kernel: BUG: warning at
kernel/softirq.c:138/local_bh_enable() (Tainted: P      )
Sep  3 17:36:58 taipan kernel:
Sep  3 17:36:58 taipan kernel: Call Trace:
Sep  3 17:36:58 taipan kernel:  [<ffffffff80229e7b>] local_bh_enable+0x42/0x98
Sep  3 17:36:58 taipan kernel:  [<ffffffff8025c008>] cond_resched_softirq+0x35/0x4b
Sep  3 17:36:58 taipan kernel:  [<ffffffff8022e9f5>] release_sock+0x59/0xaa
Sep  3 17:36:58 taipan kernel:  [<ffffffff8021baab>] tcp_recvmsg+0x3d1/0xadf
Sep  3 17:36:58 taipan kernel:  [<ffffffff8022f84e>] sock_common_recvmsg+0x30/0x45
Sep  3 17:36:58 taipan kernel:  [<ffffffff803e75b9>] sock_aio_read+0x10c/0x124
Sep  3 17:36:58 taipan kernel:  [<ffffffff8020c716>] do_sync_read+0xc9/0x10c
Sep  3 17:36:58 taipan kernel:  [<ffffffff80293107>]
autoremove_wake_function+0x0/0x2e
Sep  3 17:36:58 taipan kernel:  [<ffffffff8020af1d>] vfs_read+0xde/0x173
Sep  3 17:36:58 taipan kernel:  [<ffffffff80210606>] sys_read+0x45/0x6e
Sep  3 17:36:58 taipan kernel:  [<ffffffff8025729c>] tracesys+0xdc/0xe1
Sep  3 17:36:58 taipan kernel:
Sep  3 17:38:18 taipan gconfd (mike-3810): starting (version 2.18.0.1), pid 3810
user 'mike'
Sep  3 17:38:18 taipan gconfd (mike-3810): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only configuration
source at position 0
Sep  3 17:38:18 taipan gconfd (mike-3810): Resolved address
"xml:readwrite:/home/mike/.gconf" to a writable configuration source at position 1
Sep  3 17:38:18 taipan gconfd (mike-3810): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration source
at position 2
Sep  3 17:40:52 taipan ntpd[3038]: synchronized to 62.75.136.76, stratum 2
Sep  3 17:40:52 taipan ntpd[3038]: kernel time sync status change 0001
Sep  3 18:02:20 taipan ntpd[3038]: synchronized to 32.112.56.88, stratum 2
Sep  3 18:19:24 taipan ntpd[3038]: synchronized to 62.75.136.76, stratum 2
Sep  3 19:00:26 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:00:26 taipan last message repeated 9 times
Sep  3 19:00:31 taipan kernel: printk: 2293121 messages suppressed.
Sep  3 19:00:31 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:00:36 taipan kernel: printk: 2359668 messages suppressed.
Sep  3 19:00:36 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:00:41 taipan kernel: printk: 2241271 messages suppressed.
Sep  3 19:00:41 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:00:46 taipan kernel: printk: 2386358 messages suppressed.
Sep  3 19:00:46 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:00:51 taipan kernel: printk: 2536816 messages suppressed.
Sep  3 19:00:51 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:00:56 taipan kernel: printk: 2211906 messages suppressed.
Sep  3 19:00:56 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:01:01 taipan kernel: printk: 2203052 messages suppressed.
Sep  3 19:01:01 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:01:06 taipan kernel: printk: 2209702 messages suppressed.
Sep  3 19:01:06 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:01:10 taipan shutdown[5416]: shutting down for system reboot
Sep  3 19:01:11 taipan kernel: printk: 2192509 messages suppressed.
Sep  3 19:01:14 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:01:15 taipan gconfd (mike-3810): Received signal 15, shutting down cleanly
Sep  3 19:01:15 taipan gconfd (mike-3810): Exiting
Sep  3 19:01:16 taipan kernel: printk: 2233562 messages suppressed.
Sep  3 19:01:16 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:01:21 taipan kernel: printk: 2090653 messages suppressed.
Sep  3 19:01:21 taipan kernel: do_IRQ: 0.167 No irq handler for vector
Sep  3 19:01:21 taipan smartd[3455]: smartd received signal 15: Terminated
Sep  3 19:01:21 taipan smartd[3455]: smartd is exiting (exit status 0)
Sep  3 19:01:21 taipan avahi-daemon[3341]: Got SIGTERM, quitting.


Comment 7 Chuck Ebbert 2007-09-11 23:19:08 UTC
2.6.22 kernels for Fedora 7 are available. now. Do they fix these problems?


Comment 8 Michael Nielsen 2007-09-12 07:46:54 UTC
Though I cannot be entirely certain that the cause is the same, I have had q
hard freeze with 2.6.22 - I upgraded to 2.6.22 immediately after the event which
I recorded in my last message (I have a habit of running yum update on my
machine after a crash to see if there are patches available, which may solve the
issue).

Sep  3 19:41:19 taipan kernel: Linux version 2.6.22.4-65.fc7
(kojibuilder.redhat.com) (gcc version 4.1.2 20070502 (Red Hat
4.1.2-12)) #1 SMP Tue Aug 21 21:50:50 EDT 2007

[mike@taipan mike] > uname -a
Linux taipan 2.6.22.4-65.fc7 #1 SMP Tue Aug 21 21:50:50 EDT 2007 x86_64 x86_64
x86_64 GNU/Linux

Unfortunately I had a hard hang, on the 11th of September (ironically), running
with that kernel.  Unfortunately I was not able to capture a log of the event,
due to the aforementioned issues.  If someone can tell me a way of getting a log
of such an event, I'm willing to try.

However the frequency of the event seems to have decreased, with the new kernel.
  I'm able to run for much longer without a hang (up to several days), while
with the older .21 kernel version, the event would tend to happen multiple times
 during a day.  Of course if the event is dependent on the system load, it is
entirely possible, that this is because the system has not been heavily loaded
during this time.



Comment 9 Bug Zapper 2008-04-03 19:03:57 UTC
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 10 Bug Zapper 2008-05-07 01:08:58 UTC
This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.

If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp


Note You need to log in before you can comment on or make changes to this bug.