Bug 459103 - kexec/kdump panics on HP Proliant
Summary: kexec/kdump panics on HP Proliant
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 10
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-08-14 13:56 UTC by David Juran
Modified: 2009-04-24 12:52 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-04-17 10:57:51 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Full panic obtained through netconsole (5.91 KB, text/plain)
2008-08-14 13:56 UTC, David Juran
no flags Details
patch to correct exactmap parsing (1.09 KB, patch)
2008-09-12 19:53 UTC, Neil Horman
no flags Details | Diff
exerpt from dmesg (4.74 KB, text/plain)
2008-09-18 13:16 UTC, David Juran
no flags Details
/proc/slabinfo (6.90 KB, text/plain)
2008-11-13 12:17 UTC, David Juran
no flags Details
dmesg from production kernel (60.83 KB, text/plain)
2008-11-20 15:27 UTC, David Juran
no flags Details
serial console log (1.81 KB, text/plain)
2008-11-24 17:06 UTC, David Juran
no flags Details

Description David Juran 2008-08-14 13:56:36 UTC
Created attachment 314317 [details]
Full panic obtained through netconsole

Description of problem:
When trying to boot into the kdump kernel on a HP Proliant BL680c G5 (by issuing SysRq-c)I'm getting the following panic (full panic obtained through netconsole attached):

Aug 14 16:06:12 Call Trace:
Aug 14 16:06:12  <NMI> 
Aug 14 16:06:12  [<ffffffff8810702b>] ? :hpwdt:asminline_call+0x2b/0x56
Aug 14 16:06:12  [<ffffffff88107298>] :hpwdt:hpwdt_pretimeout+0x44/0x8f
Aug 14 16:06:12  [<ffffffff8128dc98>] notifier_call_chain+0x33/0x5b
Aug 14 16:06:12  [<ffffffff8128dce2>] atomic_notifier_call_chain+0x13/0x15
Aug 14 16:06:12  [<ffffffff8104a2c0>] notify_die+0x2e/0x30
Aug 14 16:06:12  [<ffffffff8128bfa0>] default_do_nmi+0x53/0x1a1
Aug 14 16:06:12  [<ffffffff8100aff0>] ? default_idle+0x0/0x5f
Aug 14 16:06:12  [<ffffffff8128c5a2>] do_nmi+0x2e/0x43
Aug 14 16:06:12  [<ffffffff8128bb9f>] nmi+0x7f/0x90
Aug 14 16:06:12  [<ffffffff8100aff0>] ? default_idle+0x0/0x5f
Aug 14 16:06:12  [<ffffffff8100a053>] ? mwait_idle+0x0/0x45
Aug 14 16:06:12  [<ffffffff8100a093>] ? mwait_idle+0x40/0x45
Aug 14 16:06:12  <<EOE>> 
Aug 14 16:06:12  [<ffffffff8100afa8>] cpu_idle+0x78/0xc0
Aug 14 16:06:12  [<ffffffff81286527>] start_secondary+0x3fc/0x40b



Version-Release number of selected component (if applicable):
kernel-2.6.25.14-108.fc9

How reproducible:
Every time

Steps to Reproduce:
1. update to kexec-tools-1.102pre-12.fc9 (Due to Bz 443878)
2. system-config-kdump
3. reboot
4. echo c > /proc/sysrq-trigger
  
Actual results:
panic

Expected results:
Uhm, panic.... But the kexec environment shortly thereafter (-:

Comment 1 Neil Horman 2008-08-14 15:19:24 UTC
Looks like the nmi watchdog tripped on during the boot up.  Can you add nmi_watchdog=0 to the kdump kernel command line and see if the problem clears up?

Comment 2 David Juran 2008-08-15 12:17:39 UTC
No, adding nmi_watchdog=0 to KDUMP_COMMANDLINE_APPEND in /etc/sysconfig/kdump didn't make any difference. Neither did it to add it to the main kernel command line on the GRUB boot screen either )-:

Comment 3 Neil Horman 2008-08-15 19:50:34 UTC
Does this system have some sort of RAC card in it (Something that I assueme the OS interfaces to via the hpwdt module)? Is it possible to remove this module before we start the kexec service and panic the box?

Comment 4 David Juran 2008-08-21 15:52:14 UTC
Yes it has a iLO2. And after I rmmod hpwpd at least it no longer panics.
When I now do SysRq-c it just prints
SysRq : Trigger a crashdump 
and that's it. It just hangs there.

Comment 5 Neil Horman 2008-08-21 17:22:55 UTC
Does it hang, or does it just stop responding through the Remote console.  If you're able can you attach a real serial console to the box and verify that its hung.  If it is hung can you record a sysrq-t from the system while its hung?

Comment 6 David Juran 2008-08-22 13:44:53 UTC
I hooked up a real VGA console to the blade and it really is hung. And it no longer reacts to SysRq-t

Comment 7 Neil Horman 2008-08-22 14:43:02 UTC
Can you start tracking down exactly where you are hanging?  Lets start by adding early_printk=vga or earlyprintk=<serial console spec> to the kdump kernel command line.  That should at least tell us if we are hanging in the second kernel or during the shutdown of the boot kernel.  If that doesn't give you any information, can you start instrumenting machine_crash_shutdown?  Or shall I write a patch for that?

Comment 8 David Juran 2008-08-26 11:56:51 UTC
I first tried to just add early_printk=vga but that didn't give me anything. So I proceeded to adding a few printk:s into machine_crash_shutdown and it turns out it's hanging somewhere inside lapic_shutdown()
I then tried to add "nolapic" to the kdump command line and then it goes all the way through machine_crash_shutdown() but after that nothing happens, i.e. it hangs again.

I also tried adding "nolapic" to the regular kernel command line, but it then hangs during bootup. The last thing that's printed is 
usb-6.2: New USB device found, idVendor=93f0, idProduct=1327
usb 6-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 6-2: Product: Virtual Hub
usb 6-2: Manufacturer: HP

And that's it, there is hangs.

Comment 9 Neil Horman 2008-08-26 14:21:48 UTC
Well lets see what we need to turn off to get it to boot.  Perhaps we can determine what all is wrong with this system if we know what we need to turn off to make it work.  You've already disabled the lapic.  Is it possible to disable usb as well ?  Either via bios or the  nousb option on the kdump kernel command line?

Comment 10 David Juran 2008-09-01 13:07:58 UTC
No, adding nousb (in addition to nolapic) to the kexec command line makes no difference, i.e. it goes just as far as with only nolapic.

Adding nousb (in addition to nolapic) to the main kernel command line in GRUB also makes the machine hang during startup. This time the last things on the console are
Loading cciss module
HP CISS (v 3.6.20)
ACPI: PCI Interrupt 000:08:00[A] -> Link [LNKA] -> GSI 5 (level, low) -> IRQ 5
cciss: MSI-X init failed -22
cciss0: <0x3230> at PCI 000:08:00.0 IRQ 5 using DAC
      blocks= 143305920 block_size= 512
      heads=255, sectors=32, cylinders=17562

Comment 11 Neil Horman 2008-09-02 11:11:02 UTC
ugh, cciss just underwent some major changes to support the use of cciss, and I'm not sure if they're  upstream yet. Can you try a rawhide kernel?

Comment 12 David Juran 2008-09-03 10:08:41 UTC
With kernel-2.6.27-0.290.rc5.fc10 creashkernel memory reservation fails, see bug 461001 for details.
But at least the main kernel boots fine when the parameters nolapic and nousb are given.

Comment 13 Neil Horman 2008-09-03 12:30:19 UTC
Ok, I'll set this to waiting on you to try with the latest kernel.  I've grabbed 461001, and you can try it once I have that fixed.

Comment 14 David Juran 2008-09-09 15:04:56 UTC
OK, so with kernel-2.6.27-0.314.rc5.git9.fc10 the original crash (in hpwdt) is still there... 
Another observation is that the initrd kdump creates is missing the cciss modules. Could this be related to bug 442811 or is mkdumprd completely unrelated to mkinitrd?

Comment 15 Neil Horman 2008-09-09 16:41:56 UTC
probably related to bz 442811, but  in comment 10 it looks like you are trying to load cciss in kdump.  Can you explain the discrepancy?

Comment 16 David Juran 2008-09-12 14:08:04 UTC
The printouts about cciss in comment 10 (and USB in comment8) are from booting the main kernel (i.e. from grub, not kexec) with nolapic and nousb.

Booting the F9 kdump kernel (from kexec) with nolapic and nousb gets me through machine_crash_shutdown() but after that nothing happens and nothing more is printed.

Comment 17 Neil Horman 2008-09-12 19:53:29 UTC
Created attachment 316617 [details]
patch to correct exactmap parsing

we just found a problem with the  the e820 map parsing on all of our x86 kernels is bad.  Its been causing lots of problems lately (don't know how we didn't see it before).

Can you give this patch a try?

Comment 18 Chuck Ebbert 2008-09-14 01:02:04 UTC
Patch is in 2.6.26.5-36.fc9

Comment 19 Fedora Update System 2008-09-14 06:14:13 UTC
kernel-2.6.26.5-39.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/kernel-2.6.26.5-39.fc9

Comment 20 Fedora Update System 2008-09-16 23:21:15 UTC
kernel-2.6.26.5-39.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8089

Comment 21 David Juran 2008-09-18 13:16:32 UTC
Created attachment 317076 [details]
exerpt from dmesg

kernel-2.6.26.5-39.fc9 fails to allocate crashkernel memory during boot, see the attached excerpt form dmesg. Not sure why though, afaik the memory should be available...

Comment 22 Chuck Ebbert 2008-09-19 18:48:14 UTC
(In reply to comment #21)
> Created an attachment (id=317076) [details]
> exerpt from dmesg
> 
> kernel-2.6.26.5-39.fc9 fails to allocate crashkernel memory during boot, see
> the attached excerpt form dmesg. Not sure why though, afaik the memory should
> be available...

That's strange. Did any earlier 2.6.26 kernel fail that way? The change that went in doesn't look like it could have cause the failure.

Comment 23 Neil Horman 2008-09-19 19:25:10 UTC
Its entirely possible on some systems that the memory you specified might already be allocated (or at least partially allocated, since you need a contiguous region).  Try using the newer syntax (in which you just omit the @location portion).  This allows the kernel to slect an appropriate region for you, so you aren't bound a specific memory location.

Comment 24 Fedora Update System 2008-09-20 06:29:57 UTC
kernel-2.6.26.5-44.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/kernel-2.6.26.5-44.fc9

Comment 25 Fedora Update System 2008-09-25 00:16:48 UTC
kernel-2.6.26.5-45.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8283

Comment 26 Fedora Update System 2008-10-01 06:36:40 UTC
kernel-2.6.26.5-45.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 27 David Juran 2008-11-10 11:26:16 UTC
Apologies for the long silence from my side... 
Anyway, today I got the chance to revisit this issue, this time with kernel-2.6.27.4-79.fc10 
  The original problem i.e. that it panics in hpwdt is still there. And if I unload the hpwdt module before hitting SysRq-c it now prints
Kernel panic - not syncing: Out of memory and no killable processes...

Comment 28 Neil Horman 2008-11-10 12:10:25 UTC
It prints that prior to booting the kdump kernel, or while the kdump kernel is booting?  I can't imagine a sysrq-c produces that directly.

Comment 29 David Juran 2008-11-10 13:01:46 UTC
Sorry, I should have been more specific, it certainly looks like this happens during boot of the kdump kernel.

I've added early_printk=vga to KDUMP_COMMANDLINE_APPEND and when I hit SysRq-c the system starts to boot into the new kernel. The last lines of the screen read:

NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel: unlebeled traffic allowed by default
PCI-GART: No AMD northbridge found.
hpet0: at MMIO 0xfed00000, IRQs 2,8,0
hpet0: 3 64-bit timers, 14318180 Hz
Kernel panic - not syncing: Out of memory and no killable processes...

Disabling hpet timer does no difference except then the hpet lines are of course not printed.

Comment 30 Neil Horman 2008-11-10 15:24:55 UTC
Wow, how much ram did you reserve for kdump with the crashkernel paramter?

Comment 31 David Juran 2008-11-11 12:57:16 UTC
128M. I tried upping it to 256M, but still the same problem.

Comment 32 Neil Horman 2008-11-11 13:48:51 UTC
Thats unreal, something in the kernel must be pre-allocating a huge amount of ram for this to be happening.  Is this a system I can get access to to poke around on?

Comment 33 David Juran 2008-11-12 12:49:29 UTC
No, unfortunately the machine is on customers premises and not accessible. But it's a HP BL680c G5 model with four quad-core CPU:s and 16 GB of memory so you could either try to find it from our HW lab or maybe get access to one directly through HP.
And if there is any poking I could do for you, just let me know (-:

Comment 34 Neil Horman 2008-11-12 16:49:22 UTC
Yeah, an lsmod of the system and output of /proc/slabinfo would be a good start.

Comment 35 David Juran 2008-11-13 12:16:27 UTC
[root@lunkyzard ~]# uname -a
Linux lunkyzard.netact.noklab.net 2.6.27.5-94.fc10.x86_64 #1 SMP Mon Nov 10 15:19:36 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@lunkyzard ~]# lsmod
Module                  Size  Used by
sunrpc                191208  3 
ipv6                  287272  90 
dm_multipath           23704  0 
uinput                 16128  0 
iTCO_wdt               20176  0 
iTCO_vendor_support    11652  1 iTCO_wdt
qla2xxx               185956  0 
ipmi_si                47564  0 
serio_raw              14084  0 
pcspkr                 11008  0 
tg3                   122500  0 
ipmi_msghandler        39288  1 ipmi_si
bnx2                  180232  0 
hpwdt                  15856  0 
libphy                 25600  1 tg3
scsi_transport_fc      49540  1 qla2xxx
scsi_tgt               20528  1 scsi_transport_fc
shpchp                 38044  0 
cciss                  66312  3 
radeon                270216  0 
drm                   200048  1 radeon
i2c_algo_bit           13956  1 radeon
i2c_core               29088  3 radeon,drm,i2c_algo_bit

/proc/slabinfo attached

Comment 36 David Juran 2008-11-13 12:17:32 UTC
Created attachment 323443 [details]
/proc/slabinfo

Comment 37 Neil Horman 2008-11-13 15:57:03 UTC
Hmm, nothing looks out of place there.  Could you please send me dmseg logs from both production kernel and kdump kernel boot?  Thanks

Comment 38 David Juran 2008-11-20 15:26:21 UTC
[root@lunkyzard ~]# uname -a
Linux lunkyzard.netact.noklab.net 2.6.27.5-117.fc10.x86_64 #1 SMP Tue Nov 18 11:58:53 EST 2008 x86_64 x86_64 x86_64 GNU/Linux


Attached is the output from dmesg of the production kernel

I've added "early_printk=serial console=ttyS0,9600" to KDUMP_COMMANDLINE_APPEND but still, the output I get on the serial console is very short and concise:

Kernel panic - not syncing: Out of memory and no killable processes...

And that's it )-:

Comment 39 David Juran 2008-11-20 15:27:50 UTC
Created attachment 324183 [details]
dmesg from production kernel

Comment 40 Neil Horman 2008-11-20 18:36:11 UTC
I understand that its short, but could you please attach the serial console log from the kdump boot as well?  Knowing where it gets that message is sometimes telling about whats running the kernel out of memory.

Comment 41 David Juran 2008-11-24 17:05:11 UTC
I'm sorry, but that really is all that does come out on the serial console after SysRq-c has been issue. log attached...

Comment 42 David Juran 2008-11-24 17:06:03 UTC
Created attachment 324506 [details]
serial console log

Comment 43 Neil Horman 2008-11-24 21:16:33 UTC
Thats way more than what you showed me in comment #38.  Why are you using the default configuration?  Thats whats going wrong here?  You're supposed to configure kdump so that it captures the vmcore from  the initramfs.  What you're doing is mounting the root filesystem and running /sbin/init, which is starting all your services and hogging up ram until such time as you simply oom kill yourself.

modify /etc/kdump.conf to specify your root filesystem and partition, so that the initramfs can capture your vmcore for you directly.  That will fix your problem

Comment 44 Bug Zapper 2008-11-26 02:46:37 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 45 Neil Horman 2009-03-23 14:19:43 UTC
ping, any update?

Comment 46 Neil Horman 2009-04-17 10:57:51 UTC
closing due to lack of response.

Comment 47 David Juran 2009-04-24 12:52:37 UTC
Neil, apologies for my non-responsiveness. I lost access to the original hardware and I couldn't reproduce the issue on a somewhat similar machine I had. But let's hope everything works fine now, otherwise I'll get in touch.


Note You need to log in before you can comment on or make changes to this bug.