Bug 242836 - kernel panic in handle_edge_irq
Summary: kernel panic in handle_edge_irq
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 7
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-06-06 03:15 UTC by D. Hugh Redelmeier
Modified: 2008-01-09 06:52 UTC (History)
3 users (show)

Fixed In Version: 2.6.23.8-34.fc7.x86_64
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-01-09 06:52:34 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
screen capture of second panic (2.07 MB, image/jpeg)
2007-06-06 03:15 UTC, D. Hugh Redelmeier
no flags Details
screen capture of third panic (2.09 MB, image/jpeg)
2007-06-06 03:16 UTC, D. Hugh Redelmeier
no flags Details
right portion of screen dump of third panic (1.96 MB, image/jpeg)
2007-06-06 03:19 UTC, D. Hugh Redelmeier
no flags Details
dmesg output (24.96 KB, text/plain)
2007-06-06 03:24 UTC, D. Hugh Redelmeier
no flags Details
result of kdump/crash(8) from an instance of this panic (41.21 KB, text/plain)
2007-06-17 04:00 UTC, D. Hugh Redelmeier
no flags Details
another kdump/crash(8) result (39.98 KB, text/plain)
2007-06-17 04:01 UTC, D. Hugh Redelmeier
no flags Details
Another panic screen (95.71 KB, image/jpeg)
2007-06-21 04:51 UTC, Philip Walden
no flags Details
Yet another panic screen (88.88 KB, image/jpeg)
2007-06-21 04:52 UTC, Philip Walden
no flags Details
another kdump/crash(8) result (47.84 KB, text/plain)
2007-07-02 06:02 UTC, D. Hugh Redelmeier
no flags Details
kdump/crash(8) output; irqbalance was disabled (47.54 KB, text/plain)
2007-07-03 05:30 UTC, D. Hugh Redelmeier
no flags Details
dmesg after boot with apic=debug (25.23 KB, text/plain)
2007-07-05 18:03 UTC, D. Hugh Redelmeier
no flags Details
kdump/crash(8) results for vanilla 2.6.22-rc7 (1.44 MB, text/plain)
2007-07-07 18:42 UTC, D. Hugh Redelmeier
no flags Details
kdump/crash(8) output from newest F7 kernel, kernel-2.6.22.1-33.fc7 (70.70 KB, text/plain)
2007-07-30 04:43 UTC, D. Hugh Redelmeier
no flags Details

Description D. Hugh Redelmeier 2007-06-06 03:15:03 UTC
Description of problem:

My system freezes once in a while.  When I manage to catch the panic message,
handle_edge_irq seems to be implicated.  (Normally I use X so I cannot normally
see the panic message.  To observe the message, I've intentionally run with a
text console and used another machine as an X server to access the applications.)

This system has an Athlon X2 system with an ATI chipset.  It has an nVidia video
card using the open-source nv driver.

First captured panic: during installation, progress stalled during "checking
dependencies".  I switched to the console screen.  Shortly later, a panic
appeared.  Note that the console had only 25 lines so some of the text was
probably lost.  The report is here:
https://www.redhat.com/archives/fedora-list/2007-June/msg01076.html
Top of call stack: handle_edge_irq+0x5c/0x128

I managed to install by using "maxcpus=1".

Second panic: after installation, during "normal" use (web browsing on X
server).  Only 25 line console.  I think that I saw "unable to handle null
pointer deref" scroll off the screen.  I will attach a picture of the console
"cimg0254.jpg"
Top of call stack 0:
  _raw_spin_lock+0xc5/0xeb
  _spin_lock+0x2d/0x31
  handle_edge_irq+0x10d/0x135
Top of call stack 1:
  __trigger_all_cpu_backtrace+0x71/0x92
  _raw_spin_lock+0xca/0xeb
  handle_edge_irq+0x10d/0x135

Third panic: using kernel-debug, during normal use.  More lines in console, but
still not enough.  Stack seems similar see picture cimg0256.jpg (and
cimg0257.jpg which captures the right columns of the screen which has part of
the list of loaded modules).


Version-Release number of selected component (if applicable):
kernel-2.6.21-1.3194.fc7.x86_64
kernel-debug-2.6.21-1.3194.fc7.x86_64

How reproducible:
Takes time, but happens too often.

Additional info:
I'm using the system to type this.  maxcpus=1 seems to prevent the problem
(can't be sure).

Comment 1 D. Hugh Redelmeier 2007-06-06 03:15:05 UTC
Created attachment 156305 [details]
screen capture of second panic

Comment 2 D. Hugh Redelmeier 2007-06-06 03:16:53 UTC
Created attachment 156306 [details]
screen capture of third panic

Comment 3 D. Hugh Redelmeier 2007-06-06 03:19:14 UTC
Created attachment 156307 [details]
right portion of screen dump of third panic

Comment 4 D. Hugh Redelmeier 2007-06-06 03:24:56 UTC
Created attachment 156308 [details]
dmesg output

note BUG in dmesg output.  This seems to be an example of
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=240982

Comment 5 D. Hugh Redelmeier 2007-06-17 04:00:47 UTC
Created attachment 157212 [details]
result of kdump/crash(8) from an instance of this panic

I fed this script to crash(8):
bt -a
bt -a -l -f
log -m
irq
q

Comment 6 D. Hugh Redelmeier 2007-06-17 04:01:32 UTC
Created attachment 157213 [details]
another kdump/crash(8) result

Comment 7 Chuck Ebbert 2007-06-19 21:59:48 UTC
What does /proc/inteerupts look like after the system has been running for a while?

Comment 8 Philip Walden 2007-06-21 04:51:43 UTC
Created attachment 157515 [details]
Another panic screen

Happened during boot

Comment 9 Philip Walden 2007-06-21 04:52:52 UTC
Created attachment 157516 [details]
Yet another panic screen

Happened during boot

Comment 10 Philip Walden 2007-06-21 04:58:18 UTC
The symptoms of this problem sound just like the problem I am having. Normally
the system just freezes and no kernel messages are logged. I have to do a hard
power reset to recover.

They always seem to happen either during or shortly after a yum update or
install. Sometimes they when one occurs, I have many repeats shortly after I
boot up. The above two panics appeared during boot after an update triggerred
freeze event, which is why I was able to get a picture.

Comment 11 D. Hugh Redelmeier 2007-06-21 06:59:19 UTC
Philip:

Your crashes look different from mine (but I'm not an expert).

Your first crash seems to be in ACPI code.  Have you tried booting with the
kernel parameter acpi=off?  ACPI code seems to get into trouble often enough
that there is a rich literature on avoiding it with kernel parameters.  acpi=off
is the most blunt.

I'm not sure what code crashed in your second case.  The stack dump is short and
the few routines on it seem to be for dumping -- a recursive failure?

In both cases it looks like the kernel code tried to dereference an invalid (but
not NULL) pointer.

You don't mention what hardware you are using.  My problems are in X86_64 and I
think that yours must be i386.  My problem is only with a dual-core CPU (I can
cure it with maxcpus=1) but I think that you are using a single-core CPU.

If you get far enough along, consider using kdump and crash(8). 
https://www.redhat.com/archives/fedora-list/2007-June/msg03592.html

You might want to look at
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=242369

Good luck!

Comment 12 Chuck Ebbert 2007-06-21 19:48:39 UTC
(In reply to comment #6)
> Created an attachment (id=157213) [edit]
> another kdump/crash(8) result


Hugh, what is in /proc/interrupts after the system has been running for a while?

Try disabling irqbalance (service irqbalance stop) and see if it helps.




Comment 13 D. Hugh Redelmeier 2007-06-28 21:52:08 UTC
[It took a while to crash again.]  [I'm at OLS this week -- are you?]

Thu Jun 28 13:37:47 EDT 2007
           CPU0       CPU1       
  0:  295747891    2707466    <NULL>-edge      timer
  1:       9271      11788   IO-APIC-edge      i8042
  7:    2704331  295728662   IO-APIC-edge      parport0
  8:          0          0   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 12:      43985     345016   IO-APIC-edge      i8042
 14:     997473     341068   IO-APIC-edge      libata
 15:          0          0   IO-APIC-edge      libata
 17:          1          0   IO-APIC-fasteoi   ATI IXP
 19:     316108    3364718   IO-APIC-fasteoi   ohci_hcd:usb1, ohci_hcd:usb2,
ehci_hcd:us
b3
 20:         71     887867   IO-APIC-fasteoi   eth0
 21:    6732120   66707796   IO-APIC-fasteoi   fw_ohci, bttv0
 22:      77780     188100   IO-APIC-fasteoi   libata
NMI:          0          0 
LOC:  298418709  298418523 
ERR:       2342

I find it very interesting that the counts for 0 and 7 on CPU0 are very close to
the counts for 7 and 0 on CPU1 (i.e. reversed).  Note that I have nothing on the
parallel port.  I have not disabled irqbalance.  I will try to attach the
correponding dump.

Comment 14 D. Hugh Redelmeier 2007-07-02 06:02:23 UTC
Created attachment 158321 [details]
another kdump/crash(8) result

Fresh crash, fresh crash output.
This one was very quick (uptime about 8 minutes).  I was using mplayer.
At the request of Eric Biederman, I included a disassembly of the
handle_edge_irq routine (the one that is faulting).

Comment 15 D. Hugh Redelmeier 2007-07-02 06:08:05 UTC
Notice the APIC errors in the kernel log.  These seem suspicious, but I don't
think that they are the problem:

(1) The APIC errors have happened on my machine since I got it (2006 January). 
This kernel problem started perhaps 2007 March (under FC6); certainly not for
the first year I had it.

(2) Dave Jones told me that he sees these messages fairly often, especially on
systems with ATI chipsets, and they seem to be harmless.  This machine has an
ATI chipset.

Comment 16 Chuck Ebbert 2007-07-02 22:40:20 UTC
With a complete crash dump available, we should be able to dump the entire
kernel stack at the time of the crash. Can we get that information?

Comment 17 D. Hugh Redelmeier 2007-07-03 02:18:05 UTC
re #16: I have retained several of the most recent crash dumps.  What crash
command would you like me to issue?  These are the commands I'm issuing now:
  bt -a
  bt -a -l -f
  log -m
  irq
  q


Comment 18 D. Hugh Redelmeier 2007-07-03 05:30:05 UTC
Created attachment 158404 [details]
kdump/crash(8) output; irqbalance was disabled

Another crash.	This one with irqbalance disabled.  I added "task" and "sys"
crash commands.

Comment 19 Chuck Ebbert 2007-07-03 22:03:15 UTC
(In reply to comment #13)
 
> Thu Jun 28 13:37:47 EDT 2007
>            CPU0       CPU1       
>   0:  295747891    2707466    <NULL>-edge      timer

Well that's strange.            ^^^^^^
It's the timer interrupt and it has no recognized handler type.

And dmesg says:

    <3>..MP-BIOS bug: 8254 timer not connected to IO-APIC

Can you boot with kernel option "apic=debug" and post the boot messages?

Also some of these kernel options might change the behavior:

disable_8254_timer/enable_8254_timer

disable_timer_pin_1/enable_timer_pin_1



Comment 20 D. Hugh Redelmeier 2007-07-04 14:41:17 UTC
[This is from Eric Biederman.  Bugzilla isn't cooperating with him so I'm
transcribing this from email, with his permission]

It looks like it never completed the irq_chip restructuring, and so
something is getting confused and we are walking off a NULL pointer in
mask_ack_irq. 

Although the fact that we decide to mask the timer interrupt is odd in
and of itself.

So it just should be a matter of cleaning up the lapic_irq_type in
arch/x86_64/kernel/ioapic.c to fix this. 


There as been a little work done in this in the most recent kernels,
but I suspect the problem still persists.

Could you verify that the problem is still present in 2.6.22-rc7?
If so I will see if I can cook up a trivial patch to sort this out.


Comment 21 D. Hugh Redelmeier 2007-07-05 18:03:57 UTC
Created attachment 158606 [details]
dmesg after boot with apic=debug

as requested in #19

Comment 22 Chuck Ebbert 2007-07-05 21:25:37 UTC
(In reply to comment #21)
> Created an attachment (id=158606) [edit]
> dmesg after boot with apic=debug
> 
> as requested in #19

Well, I'm in over my head now.
Did Eric look at this, and the <NULL> IRQ handler type I pointed out in
comment #19 ?



Comment 23 D. Hugh Redelmeier 2007-07-07 18:42:15 UTC
Created attachment 158724 [details]
kdump/crash(8) results for vanilla 2.6.22-rc7

Eric asked me to test vanilla 2.6.22-rc7.  This is a crash(8) analysis of an
oops from the vanilla kernel -- looks the same to me.

I've modified crash(8) so that the irq command works.  This log has the (long!)
output from that command.

I still have most of these kdumps so I can do further analysis as directed.

Comment 24 D. Hugh Redelmeier 2007-07-30 04:43:42 UTC
Created attachment 160214 [details]
kdump/crash(8) output from newest F7 kernel, kernel-2.6.22.1-33.fc7

The IRQ dump is shorter (I used the new -u flag in crash's irq command.
Still the same problem with the new kernel.

No word from Eric.

Comment 25 han pingtian 2007-08-08 00:07:44 UTC
I also meet this problem in kernel-2.6.22.1-41.fc7.x86_64.

Comment 26 Chuck Ebbert 2007-08-08 21:18:46 UTC
(In reply to comment #25)
> I also meet this problem in kernel-2.6.22.1-41.fc7.x86_64.

Try disabling irqbalance.

Also, try forcing IRQ 0 to CPU 0:

# echo "1" >/proc/irq/0/smp_affinity


Comment 27 han pingtian 2007-08-09 10:50:14 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > I also meet this problem in kernel-2.6.22.1-41.fc7.x86_64.
> 
> Try disabling irqbalance.
> 
> Also, try forcing IRQ 0 to CPU 0:
> 
> # echo "1" >/proc/irq/0/smp_affinity
> 

failed to change :
# echo "1" >/proc/irq/0/smp_affinity
-bash: echo: write error: Input/output error
#



Comment 28 Christopher Brown 2008-01-09 01:08:58 UTC
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Comment 29 D. Hugh Redelmeier 2008-01-09 06:45:05 UTC
I don't seem to experience this in 2.6.23.8-34.fc7.x86_64.  I've been running it
for almost a month without a problem.

I don't know exactly when the problem was fixed.  For some time kdump was not
working so I didn't bother running in the vulnerable mode (i.e. I ran with
maxcpus=1).

I managed to get kdump fixed https://bugzilla.redhat.com/show_bug.cgi?id=399731#c7

Once that happened, I eliminated the maxcpus=1, fully expecting a crash.  I'm
still waiting.

Summary: the problem must have been fixed, but I don't know how.


Note You need to log in before you can comment on or make changes to this bug.