Description of problem: My system freezes once in a while. When I manage to catch the panic message, handle_edge_irq seems to be implicated. (Normally I use X so I cannot normally see the panic message. To observe the message, I've intentionally run with a text console and used another machine as an X server to access the applications.) This system has an Athlon X2 system with an ATI chipset. It has an nVidia video card using the open-source nv driver. First captured panic: during installation, progress stalled during "checking dependencies". I switched to the console screen. Shortly later, a panic appeared. Note that the console had only 25 lines so some of the text was probably lost. The report is here: https://www.redhat.com/archives/fedora-list/2007-June/msg01076.html Top of call stack: handle_edge_irq+0x5c/0x128 I managed to install by using "maxcpus=1". Second panic: after installation, during "normal" use (web browsing on X server). Only 25 line console. I think that I saw "unable to handle null pointer deref" scroll off the screen. I will attach a picture of the console "cimg0254.jpg" Top of call stack 0: _raw_spin_lock+0xc5/0xeb _spin_lock+0x2d/0x31 handle_edge_irq+0x10d/0x135 Top of call stack 1: __trigger_all_cpu_backtrace+0x71/0x92 _raw_spin_lock+0xca/0xeb handle_edge_irq+0x10d/0x135 Third panic: using kernel-debug, during normal use. More lines in console, but still not enough. Stack seems similar see picture cimg0256.jpg (and cimg0257.jpg which captures the right columns of the screen which has part of the list of loaded modules). Version-Release number of selected component (if applicable): kernel-2.6.21-1.3194.fc7.x86_64 kernel-debug-2.6.21-1.3194.fc7.x86_64 How reproducible: Takes time, but happens too often. Additional info: I'm using the system to type this. maxcpus=1 seems to prevent the problem (can't be sure).
Created attachment 156305 [details] screen capture of second panic
Created attachment 156306 [details] screen capture of third panic
Created attachment 156307 [details] right portion of screen dump of third panic
Created attachment 156308 [details] dmesg output note BUG in dmesg output. This seems to be an example of https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=240982
Created attachment 157212 [details] result of kdump/crash(8) from an instance of this panic I fed this script to crash(8): bt -a bt -a -l -f log -m irq q
Created attachment 157213 [details] another kdump/crash(8) result
What does /proc/inteerupts look like after the system has been running for a while?
Created attachment 157515 [details] Another panic screen Happened during boot
Created attachment 157516 [details] Yet another panic screen Happened during boot
The symptoms of this problem sound just like the problem I am having. Normally the system just freezes and no kernel messages are logged. I have to do a hard power reset to recover. They always seem to happen either during or shortly after a yum update or install. Sometimes they when one occurs, I have many repeats shortly after I boot up. The above two panics appeared during boot after an update triggerred freeze event, which is why I was able to get a picture.
Philip: Your crashes look different from mine (but I'm not an expert). Your first crash seems to be in ACPI code. Have you tried booting with the kernel parameter acpi=off? ACPI code seems to get into trouble often enough that there is a rich literature on avoiding it with kernel parameters. acpi=off is the most blunt. I'm not sure what code crashed in your second case. The stack dump is short and the few routines on it seem to be for dumping -- a recursive failure? In both cases it looks like the kernel code tried to dereference an invalid (but not NULL) pointer. You don't mention what hardware you are using. My problems are in X86_64 and I think that yours must be i386. My problem is only with a dual-core CPU (I can cure it with maxcpus=1) but I think that you are using a single-core CPU. If you get far enough along, consider using kdump and crash(8). https://www.redhat.com/archives/fedora-list/2007-June/msg03592.html You might want to look at https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=242369 Good luck!
(In reply to comment #6) > Created an attachment (id=157213) [edit] > another kdump/crash(8) result Hugh, what is in /proc/interrupts after the system has been running for a while? Try disabling irqbalance (service irqbalance stop) and see if it helps.
[It took a while to crash again.] [I'm at OLS this week -- are you?] Thu Jun 28 13:37:47 EDT 2007 CPU0 CPU1 0: 295747891 2707466 <NULL>-edge timer 1: 9271 11788 IO-APIC-edge i8042 7: 2704331 295728662 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 43985 345016 IO-APIC-edge i8042 14: 997473 341068 IO-APIC-edge libata 15: 0 0 IO-APIC-edge libata 17: 1 0 IO-APIC-fasteoi ATI IXP 19: 316108 3364718 IO-APIC-fasteoi ohci_hcd:usb1, ohci_hcd:usb2, ehci_hcd:us b3 20: 71 887867 IO-APIC-fasteoi eth0 21: 6732120 66707796 IO-APIC-fasteoi fw_ohci, bttv0 22: 77780 188100 IO-APIC-fasteoi libata NMI: 0 0 LOC: 298418709 298418523 ERR: 2342 I find it very interesting that the counts for 0 and 7 on CPU0 are very close to the counts for 7 and 0 on CPU1 (i.e. reversed). Note that I have nothing on the parallel port. I have not disabled irqbalance. I will try to attach the correponding dump.
Created attachment 158321 [details] another kdump/crash(8) result Fresh crash, fresh crash output. This one was very quick (uptime about 8 minutes). I was using mplayer. At the request of Eric Biederman, I included a disassembly of the handle_edge_irq routine (the one that is faulting).
Notice the APIC errors in the kernel log. These seem suspicious, but I don't think that they are the problem: (1) The APIC errors have happened on my machine since I got it (2006 January). This kernel problem started perhaps 2007 March (under FC6); certainly not for the first year I had it. (2) Dave Jones told me that he sees these messages fairly often, especially on systems with ATI chipsets, and they seem to be harmless. This machine has an ATI chipset.
With a complete crash dump available, we should be able to dump the entire kernel stack at the time of the crash. Can we get that information?
re #16: I have retained several of the most recent crash dumps. What crash command would you like me to issue? These are the commands I'm issuing now: bt -a bt -a -l -f log -m irq q
Created attachment 158404 [details] kdump/crash(8) output; irqbalance was disabled Another crash. This one with irqbalance disabled. I added "task" and "sys" crash commands.
(In reply to comment #13) > Thu Jun 28 13:37:47 EDT 2007 > CPU0 CPU1 > 0: 295747891 2707466 <NULL>-edge timer Well that's strange. ^^^^^^ It's the timer interrupt and it has no recognized handler type. And dmesg says: <3>..MP-BIOS bug: 8254 timer not connected to IO-APIC Can you boot with kernel option "apic=debug" and post the boot messages? Also some of these kernel options might change the behavior: disable_8254_timer/enable_8254_timer disable_timer_pin_1/enable_timer_pin_1
[This is from Eric Biederman. Bugzilla isn't cooperating with him so I'm transcribing this from email, with his permission] It looks like it never completed the irq_chip restructuring, and so something is getting confused and we are walking off a NULL pointer in mask_ack_irq. Although the fact that we decide to mask the timer interrupt is odd in and of itself. So it just should be a matter of cleaning up the lapic_irq_type in arch/x86_64/kernel/ioapic.c to fix this. There as been a little work done in this in the most recent kernels, but I suspect the problem still persists. Could you verify that the problem is still present in 2.6.22-rc7? If so I will see if I can cook up a trivial patch to sort this out.
Created attachment 158606 [details] dmesg after boot with apic=debug as requested in #19
(In reply to comment #21) > Created an attachment (id=158606) [edit] > dmesg after boot with apic=debug > > as requested in #19 Well, I'm in over my head now. Did Eric look at this, and the <NULL> IRQ handler type I pointed out in comment #19 ?
Created attachment 158724 [details] kdump/crash(8) results for vanilla 2.6.22-rc7 Eric asked me to test vanilla 2.6.22-rc7. This is a crash(8) analysis of an oops from the vanilla kernel -- looks the same to me. I've modified crash(8) so that the irq command works. This log has the (long!) output from that command. I still have most of these kdumps so I can do further analysis as directed.
Created attachment 160214 [details] kdump/crash(8) output from newest F7 kernel, kernel-2.6.22.1-33.fc7 The IRQ dump is shorter (I used the new -u flag in crash's irq command. Still the same problem with the new kernel. No word from Eric.
I also meet this problem in kernel-2.6.22.1-41.fc7.x86_64.
(In reply to comment #25) > I also meet this problem in kernel-2.6.22.1-41.fc7.x86_64. Try disabling irqbalance. Also, try forcing IRQ 0 to CPU 0: # echo "1" >/proc/irq/0/smp_affinity
(In reply to comment #26) > (In reply to comment #25) > > I also meet this problem in kernel-2.6.22.1-41.fc7.x86_64. > > Try disabling irqbalance. > > Also, try forcing IRQ 0 to CPU 0: > > # echo "1" >/proc/irq/0/smp_affinity > failed to change : # echo "1" >/proc/irq/0/smp_affinity -bash: echo: write error: Input/output error #
Hello, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the Fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage I am CC'ing myself to this bug and will try and assist you in resolving it if I can. There hasn't been much activity on this bug for a while. Could you tell me if you are still having problems with the latest kernel? If the problem no longer exists then please close this bug or I'll do so in a few days if there is no additional information lodged.
I don't seem to experience this in 2.6.23.8-34.fc7.x86_64. I've been running it for almost a month without a problem. I don't know exactly when the problem was fixed. For some time kdump was not working so I didn't bother running in the vulnerable mode (i.e. I ran with maxcpus=1). I managed to get kdump fixed https://bugzilla.redhat.com/show_bug.cgi?id=399731#c7 Once that happened, I eliminated the maxcpus=1, fully expecting a crash. I'm still waiting. Summary: the problem must have been fixed, but I don't know how.