239520 – kdump 2nd kernel failed to load cciss driver

Bug 239520 - kdump 2nd kernel failed to load cciss driver

Summary: kdump 2nd kernel failed to load cciss driver

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Red Hat Kernel Manager
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-09 03:01 UTC by masanari iida
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHBA-2007-0959
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 19:48:47 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test kernel with upstream patch in place to ignore unsent scsi requests in cciss on kexec boot (14.50 MB, application/x-rpm) 2007-05-10 15:41 UTC, Neil Horman	no flags	Details
2.6.18-8 kernel dmesg & iomem output (27.17 KB, text/plain) 2007-05-11 10:04 UTC, masanari iida	no flags	Details
2.6.18-18 kernel dmesg & iomem output (26.89 KB, text/plain) 2007-05-11 10:16 UTC, masanari iida	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0959	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5 Update 1	2007-11-08 00:47:37 UTC

Description masanari iida 2007-05-09 03:01:30 UTC

Description of problem:
kdump(local disk dump) with SmartArray failed while initializing
cciss driver for 2nd kernel. 

Version-Release number of selected component (if applicable):
RHEL5 (2.6.18-8)
cciss driver within the kernel.

How reproducible:
Proliant ML370G5  (SmartArray P400)   The symptom happen  once in 2-3 tests.
Proliant ML350G4p (SmartArray 642)    The symptom NOT happen.

Steps to Reproduce:
1. Setup kdump other than local disk dump.
2. Modify config to local disk dump.
3. Restart kdump service.
4. Crash the system. ( # echo c > /proc/sysrq-trigger )
  
Actual results:
kernel panic during 2nd kernel load cciss driver.
This symptom happen intermittently. (once in 2-3 tests)

Expected results:
The 2nd kernel load cciss succesfully.

Additional info:
Following panic signiture tells that this would be a known 
regression which was discussed on Linux Kernel Mailing list.
http://www.ussg.iu.edu/hypermail/linux/kernel/0606.2/3055.html


Kernel BUG at drivers/block/cciss.c:2232
invalid opcode: 0000 [1] SMP 
last sysfs file: 
Modules linked in: cciss sd_mod scsi_mod
Pid: 421, comm: exe Not tainted 2.6.18-8.el5 #1
RIP: 0010:[<ffffffff88042040>]  [<ffffffff88042040>] :cciss:sendcmd+0x2ca/0x33e
RSP: 0000:ffff810008507b38  EFLAGS: 00010286
RAX: 0000000000000044 RBX: ffff810008580000 RCX: ffffffff8042d520
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802da65c
RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000020
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000004e20
R13: ffff8100084ca800 R14: 0000000000000000 R15: 0000000000000000
FS:  000000000aa0c8b0(0063) GS:ffffffff8038a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaaac8000 CR3: 00000000013cd000 CR4: 00000000000006e0

Process exe (pid: 421, threadinfo ffff810008506000, task ffff81000842a100)
Stack:  0000000000000001 00000000084d11c0 0000000000000024 ffff8100084d1180
 00000012ffffffff ffff8100084ca800 ffff81000847e780 ffff8100084d1180
 0000000000000000 ffff810008ad0800 0000000000000000 ffffffff88042353

Call Trace:
 [<ffffffff88042353>] :cciss:cciss_getgeometry+0xc7/0x2bd
 [<ffffffff88044c6d>] :cciss:cciss_init_one+0x75c/0xbcb
 [<ffffffff80060c1b>] wait_for_completion+0x1f/0xa2
 [<ffffffff800862c5>] task_rq_lock+0x3d/0x6f
 [<ffffffff80145d93>] pci_device_probe+0x100/0x180
 [<ffffffff801a3692>] driver_probe_device+0x52/0xa2
 [<ffffffff801a37b9>] __driver_attach+0x65/0xb6
 [<ffffffff801a3754>] __driver_attach+0x0/0xb6
 [<ffffffff801a30df>] bus_for_each_dev+0x43/0x6e
 [<ffffffff801a2d25>] bus_add_driver+0x7e/0x130
 [<ffffffff80145f77>] __pci_register_driver+0x57/0x7e
 [<ffffffff800a2bbf>] sys_init_module+0x16aa/0x185f
 [<ffffffff8005b14e>] system_call+0x7e/0x83

Code: 0f 0b 68 2f 61 04 88 c2 b8 08 e9 2b fe ff ff 8b 43 38 8b 73 

RIP  [<ffffffff88042040>] :cciss:sendcmd+0x2ca/0x33e
 RSP <ffff810008507b38>
 <0>Kernel panic - not syncing: Fatal exception

Comment 1 masanari iida 2007-05-09 05:33:42 UTC

As far as I can tell, this problem related to the activity of
disk I/O just before the system crash.
In order to reproduce the panic higher rate,I share with you 
another reproduce scenario.

Reproduce environment
/    /dev/cciss/c0d0p1
swap /dev/cciss/c0dp2

kdump partition is set to /dev/cciss/c0d0p1. (/var/crash)

Steps to Reproduce:

(1) Start kdump (local disk dump).
(2) Force the system to access to the Smart Array which
    has a dump partition.

    # dd if=/dev/cciss/c0d0p1 of=/dev/null &
  
(3) Enable sysrq and crash.
 
    # echo 1 > /ptroc/sys/kernel/sysrq
    # echo c > /proc/sysrq-trigger

For your information, following configuration prevent
the system to panic, even if I use SmartArray in the system.

/     /dev/sda1
swap  /dev/sda2
/dump /dev/cciss/c0d0p1 ( only for kdump use)

This is because, the system doesn't use /dump during 
the normal operation, so that cciss for 2nd kernel may
initialize the card without errors.

Comment 2 masanari iida 2007-05-09 11:39:40 UTC

Even if I configure kdump(scp), following step reproduce the same symptom.

(1) Start kdump (scp).
(2) Make a large amount of I/O to the sytem disk.

    # dd if=/dev/cciss/c0d0p1 of=/dev/null &
  
(3) Enable sysrq and crash.
 
    # echo 1 > /ptroc/sys/kernel/sysrq
    # echo c > /proc/sysrq-trigger

Even if I configure kdump(scp) initrd-kdump.img still contains 
cciss driver. So it may crash same as kdump(diskdump) case.

Comment 3 RHEL Program Management 2007-05-09 15:44:25 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Neil Horman 2007-05-10 15:42:02 UTC

Created attachment 154479 [details]
test kernel with upstream patch in place to ignore unsent scsi requests in cciss on kexec boot

Please test with this kernel and let me know if your problem is solved. 
Thanks!

Comment 5 Neil Horman 2007-05-10 15:54:41 UTC

note that to make this kernel work properly, you will need to add the kernel
command line option reset_devices to the kernel commandline.  you can do this in
/etc/sysconfig/kdump, using the KDUMP_COMMANDLINE_APPEND variable..  You will
also need to restart the kdump service after adding the above option.

Comment 6 masanari iida 2007-05-11 09:54:12 UTC

I have a good news and bad news.

In short, the bug that I have reported is fixed by your patch.
And now I found another one.

Bad one.
If I boot up my x86_64 box with test kernel, even though I specify
crashkernel=128M@16M, the kernel failed to secure 128MB of RAM for 
2nd kernel. So, the test kernel can not be used as 1st kernel with kdump.
The error message that I found in dmesg is,
"Memory of crash kernel(0x1000000 to 0x8ffffff) notwithin  permissible range"

Good one.
I use 2.6.18-8 as a 1st boot kernel. Because the original kernel 
can secure 128MB of RAM for kdump.
Then I set 　KDUMP_KERNELVER="2.6.18-18.el5bz239520" within 
/etc/sysconfig/kdump, and re-generate initrd-kdump.img.
With this configuration, I can use the test kernel as 2nd kernel.
I tested the failure scenario that I have provided in comment #2,
and I have confirmed that no more cciss panic occured.

Please check out dmesg and /proc/iomem output of good and bad one.

Thanks.

Comment 7 masanari iida 2007-05-11 10:04:25 UTC

Created attachment 154525 [details]
2.6.18-8 kernel dmesg & iomem output

This is a good dmesg and iomem.

Comment 8 masanari iida 2007-05-11 10:16:31 UTC

Created attachment 154526 [details]
2.6.18-18 kernel dmesg & iomem output

This is a bad dmesg and iomem.

Comment 9 Neil Horman 2007-05-11 14:26:54 UTC

Thats not bad news, thats expected.  The kernel I provided to you was the x86_64
kdump kernel. It is meant to be used soley as the kernel that is booted via the
kexec boot method.  Its not meant to be booted from a cold start.  It should
boot just fine, but you shouldn't be able to issue the crashkernel parameter on
the kernel command line from it.  Note that x86_64 and ppc64 are the only
kernels that still require a separate kdump kernel.  The other arches can use a
unified kernel between both normal and kexec operation.

I'll propose this for inclusion ASAP.

Comment 10 RHEL Program Management 2007-05-11 15:22:23 UTC

This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 11 Don Zickus 2007-06-16 00:37:03 UTC

in 2.6.18-27.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 masanari iida 2007-06-19 09:56:03 UTC

I have tested 2.6.18-27.el5 (both x86 and x86_64).
Both type of kernel work as expected.

I was told that there is a similar bz case #230717.
It disscussed about exactly same panic, but the comment #18 said
root cause of the problem is Smart Array card firmware.
So, my question is, do we need to apply this errata kernel _AND_ 
firmware update?  or either errata kernel _OR_ firmware patch?

Since your name "Don Zickus" is also in #230717 (#comment 13), 
so I believe you know something about the call,too.

Regards,
Masanari

Comment 13 Mike Miller (OS Dev) 2007-07-03 15:19:44 UTC

I Nacked this fix upstream. It's a pretty poor workaround. Sorry, Neil. I
completed the code for this prior to leaving for OLS. I did not have to complete
testing. That's why it's not upstream.

Comment 14 Mike Miller (OS Dev) 2007-07-11 20:12:33 UTC

I am currently testing this new code. Several hiccups so far.

Comment 17 errata-xmlrpc 2007-11-07 19:48:47 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.