Bug 458013 - Kdump on Dom0 Failed with CCISS
Summary: Kdump on Dom0 Failed with CCISS
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kexec-tools
Version: 5.2
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-08-06 05:43 UTC by Qian Cai
Modified: 2009-09-09 05:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-09-23 10:42:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
capture Kernel boot logs from Kdump on bare metal Kernel (13.84 KB, text/plain)
2008-08-06 05:43 UTC, Qian Cai
no flags Details
Kdump on Dom0 worked correctly with .88.el5 Kernel (19.08 KB, text/plain)
2008-08-06 16:47 UTC, Qian Cai
no flags Details
Kdump on Dom0 failed with exe SIGSEGV. (26.17 KB, text/plain)
2008-08-12 11:27 UTC, Qian Cai
no flags Details
Kdump on Dom0 failed with init SIGSEGV. (12.38 KB, text/plain)
2008-08-12 11:28 UTC, Qian Cai
no flags Details
Kdump on Dom0 failed with dropping to rootfs (16.01 KB, text/plain)
2008-08-12 11:30 UTC, Qian Cai
no flags Details
Kdump on Dom0 failed with capture Kernel panic. (15.21 KB, text/plain)
2008-08-12 11:31 UTC, Qian Cai
no flags Details

Description Qian Cai 2008-08-06 05:43:26 UTC
Created attachment 313527 [details]
capture Kernel boot logs from Kdump on bare metal Kernel

Description of problem:
Kdump Kernel panicked with NMI Watchdog for Kdump on Dom0. It worked fine for Kdump on bare metal Kernel. The machine in question was hp-dl360g5-01.rhts.bos.redhat.com.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3776694

Loading cciss.ko module
HP CISS Driver (v 3.6.20-RH1)
cciss: using PCI PM to reset controller
cciss: controller message 03:00 succeeded
ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 16 (level, low) -> IRQ 169
cciss0: <0x3230> at PCI 0000:06:00.0 IRQ 98 using DAC
      blocks= 286677120 block_size= 512
      heads= 255, sectors= 32, cylinders= 35132

      blocks= 286677120 block_size= 512
      heads= 255, sectors= 32, cylinders= 35132

 cciss/c0d0: p1 p2
NMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: cciss sd_mod scsi_mod
Pid: 432, comm: khelper Not tainted 2.6.18-92.el5 #1
RIP: 0010:[<ffffffff80146861>]  [<ffffffff80146861>] list_del+0x70/0x71
RSP: 0000:ffff810009763c88  EFLAGS: 00000046
RAX: ffff8100029f8cc0 RBX: 000000000000000d RCX: 0000000000000000
RDX: ffff810009477040 RSI: ffff8100029f8cc0 RDI: ffff810009cd2000
RBP: ffff810009cd2000 R08: ffff8100029fd400 R09: ffff810009c01000
R10: 0000000000000000 R11: 000000d000000000 R12: ffff8100029fd400
R13: ffff8100029f8cc0 R14: 0000000000000003 R15: ffff810009c00080
FS:  0000000000000000(0000) GS:ffffffff8039e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b6748cf8000 CR3: 0000000009645000 CR4: 00000000000006e0
Process khelper (pid: 432, threadinfo ffff810009762000, task ffff8100096a67a0)
Stack:  ffffffff8005be5b 000000d000000001 0000000000000246 00000000000000d0
 ffff810009c00080 0000000000000011 0000000000000000 ffff810009da2000
 ffffffff8000a91a 0000000000000001 0000000000000001 00000000ffffffe9
Call Trace:
 [<ffffffff8005be5b>] cache_alloc_refill+0xf1/0x186
 [<ffffffff8000a91a>] kmem_cache_alloc+0x6c/0x76
 [<ffffffff80012390>] get_empty_filp+0x57/0x14e
 [<ffffffff800234b0>] __path_lookup_intent_open+0x2b/0x97
 [<ffffffff8003bca7>] open_exec+0x24/0xc0
 [<ffffffff8005be70>] cache_alloc_refill+0x106/0x186
 [<ffffffff800d359e>] alternate_node_alloc+0x70/0x8c
 [<ffffffff8003e80a>] do_execve+0x46/0x243
 [<ffffffff8009a827>] __call_usermodehelper+0x0/0x4f
 [<ffffffff80054760>] sys_execve+0x36/0x4c
 [<ffffffff8005e01c>] execve+0x64/0xc8
 [<ffffffff8009a827>] __call_usermodehelper+0x0/0x4f
 [<ffffffff8009aa42>] ____call_usermodehelper+0x57/0x61
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009a827>] __call_usermodehelper+0x0/0x4f
 [<ffffffff8009a9eb>] ____call_usermodehelper+0x0/0x61
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: c3 41 54 49 89 fc 55 48 89 f5 53 48 89 d3 48 8b 52 08 48 39 
Kernel panic - not syncing: nmi watchdog
 BUG: warning at kernel/panic.c:137/panic() (Not tainted)

Call Trace:
 <NMI>  [<ffffffff8008f599>] panic+0x1da/0x1eb
 [<ffffffff8006b4b8>] _show_stack+0xdb/0xea
 [<ffffffff8006b5ab>] show_registers+0xe4/0x100
 [<ffffffff8006521d>] die_nmi+0x66/0xa3
 [<ffffffff800658a1>] nmi_watchdog_tick+0x107/0x1fb
 [<ffffffff80065586>] default_do_nmi+0x86/0x214
 [<ffffffff800659d8>] do_nmi+0x43/0x61
 [<ffffffff80064e47>] nmi+0x7f/0x88
 [<ffffffff80146861>] list_del+0x70/0x71
 <<EOE>>  [<ffffffff8005be5b>] cache_alloc_refill+0xf1/0x186
 [<ffffffff8000a91a>] kmem_cache_alloc+0x6c/0x76
 [<ffffffff80012390>] get_empty_filp+0x57/0x14e
 [<ffffffff800234b0>] __path_lookup_intent_open+0x2b/0x97
 [<ffffffff8003bca7>] open_exec+0x24/0xc0
 [<ffffffff8005be70>] cache_alloc_refill+0x106/0x186
 [<ffffffff800d359e>] alternate_node_alloc+0x70/0x8c
 [<ffffffff8003e80a>] do_execve+0x46/0x243
 [<ffffffff8009a827>] __call_usermodehelper+0x0/0x4f
 [<ffffffff80054760>] sys_execve+0x36/0x4c
 [<ffffffff8005e01c>] execve+0x64/0xc8
 [<ffffffff8009a827>] __call_usermodehelper+0x0/0x4f
 [<ffffffff8009aa42>] ____call_usermodehelper+0x57/0x61
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009a827>] __call_usermodehelper+0x0/0x4f
 [<ffffffff8009a9eb>] ____call_usermodehelper+0x0/0x61
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

I have noticed that Kdump on bare metal Kernel has slightly different CCISS driver initialization messages,

HP CISS Driver (v 3.6.20-RH1)
cciss: using PCI PM to reset controller
cciss: resetting MSI-X             --> this did not show with Kdump on Dom0.
cciss: controller message 03:00 succeeded

I have attached capture Kernel boot logs from Kdump on bare metal Kernel.

Version-Release number of selected component (if applicable):
kexec-tools-1.102pre-21.el5
kernel-2.6.18-92.el5
kernel-xen-2.6.18-92.el5

How reproducible:
always

Steps to Reproduce:
1. configure Dom0 Kernel with crashkernel=128M@32M
2. SysRq-C
  
Actual results:
Capture Kernel panicked.

Expected results:
Successfully captured a vmcore.

Additional info:
System information could be found at,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3776355

Comment 1 Neil Horman 2008-08-06 15:27:57 UTC
I wonder if this has something to do with bz 230717.  Cai, can you try this with kernel 2.6.18-90.el5?  That should not have the patch for 230717 in place.  Thanks!

Comment 2 Qian Cai 2008-08-06 16:47:00 UTC
Created attachment 313610 [details]
Kdump on Dom0 worked correctly with .88.el5 Kernel

From previous testing logs, I found out that it worked fine with .88.el5 Kernel before.

Comment 3 Neil Horman 2008-08-06 18:01:37 UTC
Tomas, given that you were the Red Hat technical contact on bz 230717, can you shed any light on whats going on here?

Comment 4 Tomas Henzl 2008-08-07 11:32:04 UTC
(In reply to comment #3)
> Tomas, given that you were the Red Hat technical contact on bz 230717, can you
> shed any light on whats going on here?

Don't where the difference could be. I'm going to set up my test box with a xen kernel to see if I'm able to verify it and then add some other debug messages.

Comment 5 Neil Horman 2008-08-07 12:13:20 UTC
Thank you, let me know what you find.

Comment 6 Tomas Henzl 2008-08-07 15:57:11 UTC
On my local machine which is not MSI-X capable it works, but it is unreliable - I noticed once that the vmcore was not created even if the kdump kernel booted. I'll continue tomorrow.

Comment 7 Neil Horman 2008-08-08 13:07:04 UTC
thanks, let us know what you find.

Comment 8 Tomas Henzl 2008-08-08 13:35:06 UTC
Neil,
the message "ciss: resetting MSI-X" is not showed because the test (control & PCI_MSIX_FLAGS_ENABLE) in the xen kernel is zero. (See code below).
This could that be caused by the fact that the xen kernel/hypervisor is not using MSI-X even if it is possible ?

On this box, almost every test passed, only one time the vmcore was created with  name "vmcore-incomplete" (with right size).

I'm using kernel 2.6.18-92.1.10.el5xen, kdump kernel is 2.6.18-92.1.10.el5,
kexec-tools 1.102pre.

	pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
	if (pos) {
		pci_read_config_word(pdev, msi_control_reg(pos), &control);
		if (control & PCI_MSIX_FLAGS_ENABLE) {
			printk(KERN_INFO "cciss: resetting MSI-X\n");

Comment 9 Neil Horman 2008-08-08 15:34:53 UTC
I honestly don't know.  If you look earlier in the logs, assign_interrupt_mode seems to indicate that we can use MSI features in the dom0 kernel.  As to why cciss isn't detecting MSI capabilities in the production xen kernel, I'm not sure, but should that really matter?  if we reboot into a non xen kernel and all of a sudden the cciss driver detects that it can use MSI, it should be able to reset it without oopsing or deadlocking, shouldn't it?  Or am I missing something?

Also, and thank you for pointing this out, but cai, why were you using the kdump kernel in this system?  IIRC with 5.2 you should be able to use the xen kernel itself during kdump operations.  Tomas, does the problem persist if you use the xen kernel to boot kdump?

Comment 10 Tomas Henzl 2008-08-11 15:33:58 UTC
(In reply to comment #9)
> I honestly don't know.  If you look earlier in the logs, assign_interrupt_mode
> seems to indicate that we can use MSI features in the dom0 kernel.  As to why
> cciss isn't detecting MSI capabilities in the production xen kernel, I'm not
> sure, but should that really matter?  if we reboot into a non xen kernel and
> all of a sudden the cciss driver detects that it can use MSI, it should be able
> to reset it without oopsing or deadlocking, shouldn't it?  Or am I missing
> something?
Sure it should.
> 
> Also, and thank you for pointing this out, but cai, why were you using the
> kdump kernel in this system?  IIRC with 5.2 you should be able to use the xen
> kernel itself during kdump operations.  Tomas, does the problem persist if you
> use the xen kernel to boot kdump?
I don't know if it is possible to use xen kernel for this. On a box with only a xen kernel the creation of the initial kdump image fails,so I was forced to install a non xen kernel.

Comment 11 Tomas Henzl 2008-08-11 15:40:34 UTC
Cai,
I tried this today on the hp-dl360g5-01 -
[root@hp-dl360g5-01 ~]# uname -a
Linux hp-dl360g5-01.rhts.bos.redhat.com 2.6.18-92.el5xen #1 SMP Tue Apr 29 13:31:30 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
three times  (echo c > /proc/sysrq-trigger) and every time the vmcore was created without an deadlock.
This means that I'm probably doing something differently than you, please tell me what should I do to see the problem.
The box is reserved now, feel free to use it (I'm leaving now).

Comment 12 Mike Miller (OS Dev) 2008-08-11 17:42:28 UTC
> I honestly don't know.  If you look earlier in the logs, assign_interrupt_mode
> seems to indicate that we can use MSI features in the dom0 kernel.  As to why
> cciss isn't detecting MSI capabilities in the production xen kernel, I'm not

It's not that cciss doesn't detect the MSI/X capabilities. The problem is the vector is already allocated. The code to reset the MSI/X stuff is not getting called.

Comment 13 Qian Cai 2008-08-12 11:26:24 UTC
I have tried on two machines,
hp-dl360g5-01.rhts.bos.redhat.com
hp-dl360g5-02.rhts.bos.redhat.com

It looks like Kdump on Dom0 works occasionally. However, if I ran the following cron job,

@reboot echo a >>/root/log; sleep 120; rm -rf /var/crash/*; sync; echo c >/proc/sysrq-trigger

I was able to see problems not before long. Lots of strange failures. Please see the following attachments. 

Also, I have ran the cron job for Kdump on bare metal, but it worked fine without seen any problem.

I have the machine hp-dl360g5-02.rhts.bos.redhat.com reserved, feel free to grab it.

Comment 14 Qian Cai 2008-08-12 11:27:57 UTC
Created attachment 314083 [details]
Kdump on Dom0 failed with exe SIGSEGV.

Comment 15 Qian Cai 2008-08-12 11:28:39 UTC
Created attachment 314085 [details]
Kdump on Dom0 failed with init SIGSEGV.

Comment 16 Qian Cai 2008-08-12 11:30:41 UTC
Created attachment 314086 [details]
Kdump on Dom0 failed with dropping to rootfs

Comment 17 Qian Cai 2008-08-12 11:31:21 UTC
Created attachment 314088 [details]
Kdump on Dom0 failed with capture Kernel panic.

Comment 19 Tomas Henzl 2008-08-15 13:34:53 UTC
Hi,
I tried out some changes in reset code during the drivers initialization - without success, sometimes I got 20 successful kdumps , but then it fails again at random places.
Today rhts is not working for me, so I'll continue next week.

Comment 20 Neil Horman 2008-08-15 19:49:09 UTC
Thanks Tomas, let us know what you find out.

Comment 21 Tomas Henzl 2008-09-23 08:50:48 UTC
Neil,
after some problems with previous kernels, which weren't working (you know this vmcore zero issue), I'm now testing the kernel-2.6.18-115.el5.src.rpm on my box, it's working for several days flawlesly (continuos kdumping). On the rhts hp-dl360g5-01.rhts.bos.redhat.com which was more vulnerable to the problem I stopped the tests now by 892 successful kdumps. 
I tested also upstream kernel 2.6.26 to find an inspiration but it failed after 514 and then after 50 tests. 
So at the moment is in this area our kernel better then upstream kernel.

Comment 22 Neil Horman 2008-09-23 10:27:06 UTC
Ok, it sounds like the cciss maintainer has some work to do upstream  then.

Where does that leave us with this bug.  Shall we close it?

Comment 23 Qian Cai 2008-09-23 10:36:30 UTC
Please close it. Thanks!

Comment 24 Neil Horman 2008-09-23 10:42:01 UTC
copy that. Thanks Cai!


Note You need to log in before you can comment on or make changes to this bug.