Created attachment 314317 [details] Full panic obtained through netconsole Description of problem: When trying to boot into the kdump kernel on a HP Proliant BL680c G5 (by issuing SysRq-c)I'm getting the following panic (full panic obtained through netconsole attached): Aug 14 16:06:12 Call Trace: Aug 14 16:06:12 <NMI> Aug 14 16:06:12 [<ffffffff8810702b>] ? :hpwdt:asminline_call+0x2b/0x56 Aug 14 16:06:12 [<ffffffff88107298>] :hpwdt:hpwdt_pretimeout+0x44/0x8f Aug 14 16:06:12 [<ffffffff8128dc98>] notifier_call_chain+0x33/0x5b Aug 14 16:06:12 [<ffffffff8128dce2>] atomic_notifier_call_chain+0x13/0x15 Aug 14 16:06:12 [<ffffffff8104a2c0>] notify_die+0x2e/0x30 Aug 14 16:06:12 [<ffffffff8128bfa0>] default_do_nmi+0x53/0x1a1 Aug 14 16:06:12 [<ffffffff8100aff0>] ? default_idle+0x0/0x5f Aug 14 16:06:12 [<ffffffff8128c5a2>] do_nmi+0x2e/0x43 Aug 14 16:06:12 [<ffffffff8128bb9f>] nmi+0x7f/0x90 Aug 14 16:06:12 [<ffffffff8100aff0>] ? default_idle+0x0/0x5f Aug 14 16:06:12 [<ffffffff8100a053>] ? mwait_idle+0x0/0x45 Aug 14 16:06:12 [<ffffffff8100a093>] ? mwait_idle+0x40/0x45 Aug 14 16:06:12 <<EOE>> Aug 14 16:06:12 [<ffffffff8100afa8>] cpu_idle+0x78/0xc0 Aug 14 16:06:12 [<ffffffff81286527>] start_secondary+0x3fc/0x40b Version-Release number of selected component (if applicable): kernel-2.6.25.14-108.fc9 How reproducible: Every time Steps to Reproduce: 1. update to kexec-tools-1.102pre-12.fc9 (Due to Bz 443878) 2. system-config-kdump 3. reboot 4. echo c > /proc/sysrq-trigger Actual results: panic Expected results: Uhm, panic.... But the kexec environment shortly thereafter (-:
Looks like the nmi watchdog tripped on during the boot up. Can you add nmi_watchdog=0 to the kdump kernel command line and see if the problem clears up?
No, adding nmi_watchdog=0 to KDUMP_COMMANDLINE_APPEND in /etc/sysconfig/kdump didn't make any difference. Neither did it to add it to the main kernel command line on the GRUB boot screen either )-:
Does this system have some sort of RAC card in it (Something that I assueme the OS interfaces to via the hpwdt module)? Is it possible to remove this module before we start the kexec service and panic the box?
Yes it has a iLO2. And after I rmmod hpwpd at least it no longer panics. When I now do SysRq-c it just prints SysRq : Trigger a crashdump and that's it. It just hangs there.
Does it hang, or does it just stop responding through the Remote console. If you're able can you attach a real serial console to the box and verify that its hung. If it is hung can you record a sysrq-t from the system while its hung?
I hooked up a real VGA console to the blade and it really is hung. And it no longer reacts to SysRq-t
Can you start tracking down exactly where you are hanging? Lets start by adding early_printk=vga or earlyprintk=<serial console spec> to the kdump kernel command line. That should at least tell us if we are hanging in the second kernel or during the shutdown of the boot kernel. If that doesn't give you any information, can you start instrumenting machine_crash_shutdown? Or shall I write a patch for that?
I first tried to just add early_printk=vga but that didn't give me anything. So I proceeded to adding a few printk:s into machine_crash_shutdown and it turns out it's hanging somewhere inside lapic_shutdown() I then tried to add "nolapic" to the kdump command line and then it goes all the way through machine_crash_shutdown() but after that nothing happens, i.e. it hangs again. I also tried adding "nolapic" to the regular kernel command line, but it then hangs during bootup. The last thing that's printed is usb-6.2: New USB device found, idVendor=93f0, idProduct=1327 usb 6-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 6-2: Product: Virtual Hub usb 6-2: Manufacturer: HP And that's it, there is hangs.
Well lets see what we need to turn off to get it to boot. Perhaps we can determine what all is wrong with this system if we know what we need to turn off to make it work. You've already disabled the lapic. Is it possible to disable usb as well ? Either via bios or the nousb option on the kdump kernel command line?
No, adding nousb (in addition to nolapic) to the kexec command line makes no difference, i.e. it goes just as far as with only nolapic. Adding nousb (in addition to nolapic) to the main kernel command line in GRUB also makes the machine hang during startup. This time the last things on the console are Loading cciss module HP CISS (v 3.6.20) ACPI: PCI Interrupt 000:08:00[A] -> Link [LNKA] -> GSI 5 (level, low) -> IRQ 5 cciss: MSI-X init failed -22 cciss0: <0x3230> at PCI 000:08:00.0 IRQ 5 using DAC blocks= 143305920 block_size= 512 heads=255, sectors=32, cylinders=17562
ugh, cciss just underwent some major changes to support the use of cciss, and I'm not sure if they're upstream yet. Can you try a rawhide kernel?
With kernel-2.6.27-0.290.rc5.fc10 creashkernel memory reservation fails, see bug 461001 for details. But at least the main kernel boots fine when the parameters nolapic and nousb are given.
Ok, I'll set this to waiting on you to try with the latest kernel. I've grabbed 461001, and you can try it once I have that fixed.
OK, so with kernel-2.6.27-0.314.rc5.git9.fc10 the original crash (in hpwdt) is still there... Another observation is that the initrd kdump creates is missing the cciss modules. Could this be related to bug 442811 or is mkdumprd completely unrelated to mkinitrd?
probably related to bz 442811, but in comment 10 it looks like you are trying to load cciss in kdump. Can you explain the discrepancy?
The printouts about cciss in comment 10 (and USB in comment8) are from booting the main kernel (i.e. from grub, not kexec) with nolapic and nousb. Booting the F9 kdump kernel (from kexec) with nolapic and nousb gets me through machine_crash_shutdown() but after that nothing happens and nothing more is printed.
Created attachment 316617 [details] patch to correct exactmap parsing we just found a problem with the the e820 map parsing on all of our x86 kernels is bad. Its been causing lots of problems lately (don't know how we didn't see it before). Can you give this patch a try?
Patch is in 2.6.26.5-36.fc9
kernel-2.6.26.5-39.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.26.5-39.fc9
kernel-2.6.26.5-39.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8089
Created attachment 317076 [details] exerpt from dmesg kernel-2.6.26.5-39.fc9 fails to allocate crashkernel memory during boot, see the attached excerpt form dmesg. Not sure why though, afaik the memory should be available...
(In reply to comment #21) > Created an attachment (id=317076) [details] > exerpt from dmesg > > kernel-2.6.26.5-39.fc9 fails to allocate crashkernel memory during boot, see > the attached excerpt form dmesg. Not sure why though, afaik the memory should > be available... That's strange. Did any earlier 2.6.26 kernel fail that way? The change that went in doesn't look like it could have cause the failure.
Its entirely possible on some systems that the memory you specified might already be allocated (or at least partially allocated, since you need a contiguous region). Try using the newer syntax (in which you just omit the @location portion). This allows the kernel to slect an appropriate region for you, so you aren't bound a specific memory location.
kernel-2.6.26.5-44.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.26.5-44.fc9
kernel-2.6.26.5-45.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8283
kernel-2.6.26.5-45.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report.
Apologies for the long silence from my side... Anyway, today I got the chance to revisit this issue, this time with kernel-2.6.27.4-79.fc10 The original problem i.e. that it panics in hpwdt is still there. And if I unload the hpwdt module before hitting SysRq-c it now prints Kernel panic - not syncing: Out of memory and no killable processes...
It prints that prior to booting the kdump kernel, or while the kdump kernel is booting? I can't imagine a sysrq-c produces that directly.
Sorry, I should have been more specific, it certainly looks like this happens during boot of the kdump kernel. I've added early_printk=vga to KDUMP_COMMANDLINE_APPEND and when I hit SysRq-c the system starts to boot into the new kernel. The last lines of the screen read: NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlebeled traffic allowed by default PCI-GART: No AMD northbridge found. hpet0: at MMIO 0xfed00000, IRQs 2,8,0 hpet0: 3 64-bit timers, 14318180 Hz Kernel panic - not syncing: Out of memory and no killable processes... Disabling hpet timer does no difference except then the hpet lines are of course not printed.
Wow, how much ram did you reserve for kdump with the crashkernel paramter?
128M. I tried upping it to 256M, but still the same problem.
Thats unreal, something in the kernel must be pre-allocating a huge amount of ram for this to be happening. Is this a system I can get access to to poke around on?
No, unfortunately the machine is on customers premises and not accessible. But it's a HP BL680c G5 model with four quad-core CPU:s and 16 GB of memory so you could either try to find it from our HW lab or maybe get access to one directly through HP. And if there is any poking I could do for you, just let me know (-:
Yeah, an lsmod of the system and output of /proc/slabinfo would be a good start.
[root@lunkyzard ~]# uname -a Linux lunkyzard.netact.noklab.net 2.6.27.5-94.fc10.x86_64 #1 SMP Mon Nov 10 15:19:36 EST 2008 x86_64 x86_64 x86_64 GNU/Linux [root@lunkyzard ~]# lsmod Module Size Used by sunrpc 191208 3 ipv6 287272 90 dm_multipath 23704 0 uinput 16128 0 iTCO_wdt 20176 0 iTCO_vendor_support 11652 1 iTCO_wdt qla2xxx 185956 0 ipmi_si 47564 0 serio_raw 14084 0 pcspkr 11008 0 tg3 122500 0 ipmi_msghandler 39288 1 ipmi_si bnx2 180232 0 hpwdt 15856 0 libphy 25600 1 tg3 scsi_transport_fc 49540 1 qla2xxx scsi_tgt 20528 1 scsi_transport_fc shpchp 38044 0 cciss 66312 3 radeon 270216 0 drm 200048 1 radeon i2c_algo_bit 13956 1 radeon i2c_core 29088 3 radeon,drm,i2c_algo_bit /proc/slabinfo attached
Created attachment 323443 [details] /proc/slabinfo
Hmm, nothing looks out of place there. Could you please send me dmseg logs from both production kernel and kdump kernel boot? Thanks
[root@lunkyzard ~]# uname -a Linux lunkyzard.netact.noklab.net 2.6.27.5-117.fc10.x86_64 #1 SMP Tue Nov 18 11:58:53 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Attached is the output from dmesg of the production kernel I've added "early_printk=serial console=ttyS0,9600" to KDUMP_COMMANDLINE_APPEND but still, the output I get on the serial console is very short and concise: Kernel panic - not syncing: Out of memory and no killable processes... And that's it )-:
Created attachment 324183 [details] dmesg from production kernel
I understand that its short, but could you please attach the serial console log from the kdump boot as well? Knowing where it gets that message is sometimes telling about whats running the kernel out of memory.
I'm sorry, but that really is all that does come out on the serial console after SysRq-c has been issue. log attached...
Created attachment 324506 [details] serial console log
Thats way more than what you showed me in comment #38. Why are you using the default configuration? Thats whats going wrong here? You're supposed to configure kdump so that it captures the vmcore from the initramfs. What you're doing is mounting the root filesystem and running /sbin/init, which is starting all your services and hogging up ram until such time as you simply oom kill yourself. modify /etc/kdump.conf to specify your root filesystem and partition, so that the initramfs can capture your vmcore for you directly. That will fix your problem
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle. Changing version to '10'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
ping, any update?
closing due to lack of response.
Neil, apologies for my non-responsiveness. I lost access to the original hardware and I couldn't reproduce the issue on a somewhat similar machine I had. But let's hope everything works fine now, otherwise I'll get in touch.