Bug 239520 - kdump 2nd kernel failed to load cciss driver
Summary: kdump 2nd kernel failed to load cciss driver
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 5.0
Hardware: All
OS: Linux
Target Milestone: ---
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Martin Jenner
Depends On:
TreeView+ depends on / blocked
Reported: 2007-05-09 03:01 UTC by masanari iida
Modified: 2007-11-30 22:07 UTC (History)
5 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-11-07 19:48:47 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
test kernel with upstream patch in place to ignore unsent scsi requests in cciss on kexec boot (14.50 MB, application/x-rpm)
2007-05-10 15:41 UTC, Neil Horman
no flags Details
2.6.18-8 kernel dmesg & iomem output (27.17 KB, text/plain)
2007-05-11 10:04 UTC, masanari iida
no flags Details
2.6.18-18 kernel dmesg & iomem output (26.89 KB, text/plain)
2007-05-11 10:16 UTC, masanari iida
no flags Details

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description masanari iida 2007-05-09 03:01:30 UTC
Description of problem:
kdump(local disk dump) with SmartArray failed while initializing
cciss driver for 2nd kernel. 

Version-Release number of selected component (if applicable):
RHEL5 (2.6.18-8)
cciss driver within the kernel.

How reproducible:
Proliant ML370G5  (SmartArray P400)   The symptom happen  once in 2-3 tests.
Proliant ML350G4p (SmartArray 642)    The symptom NOT happen.

Steps to Reproduce:
1. Setup kdump other than local disk dump.
2. Modify config to local disk dump.
3. Restart kdump service.
4. Crash the system. ( # echo c > /proc/sysrq-trigger )
Actual results:
kernel panic during 2nd kernel load cciss driver.
This symptom happen intermittently. (once in 2-3 tests)

Expected results:
The 2nd kernel load cciss succesfully.

Additional info:
Following panic signiture tells that this would be a known 
regression which was discussed on Linux Kernel Mailing list.

Kernel BUG at drivers/block/cciss.c:2232
invalid opcode: 0000 [1] SMP 
last sysfs file: 
Modules linked in: cciss sd_mod scsi_mod
Pid: 421, comm: exe Not tainted 2.6.18-8.el5 #1
RIP: 0010:[<ffffffff88042040>]  [<ffffffff88042040>] :cciss:sendcmd+0x2ca/0x33e
RSP: 0000:ffff810008507b38  EFLAGS: 00010286
RAX: 0000000000000044 RBX: ffff810008580000 RCX: ffffffff8042d520
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802da65c
RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000020
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000004e20
R13: ffff8100084ca800 R14: 0000000000000000 R15: 0000000000000000
FS:  000000000aa0c8b0(0063) GS:ffffffff8038a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaaac8000 CR3: 00000000013cd000 CR4: 00000000000006e0

Process exe (pid: 421, threadinfo ffff810008506000, task ffff81000842a100)
Stack:  0000000000000001 00000000084d11c0 0000000000000024 ffff8100084d1180
 00000012ffffffff ffff8100084ca800 ffff81000847e780 ffff8100084d1180
 0000000000000000 ffff810008ad0800 0000000000000000 ffffffff88042353

Call Trace:
 [<ffffffff88042353>] :cciss:cciss_getgeometry+0xc7/0x2bd
 [<ffffffff88044c6d>] :cciss:cciss_init_one+0x75c/0xbcb
 [<ffffffff80060c1b>] wait_for_completion+0x1f/0xa2
 [<ffffffff800862c5>] task_rq_lock+0x3d/0x6f
 [<ffffffff80145d93>] pci_device_probe+0x100/0x180
 [<ffffffff801a3692>] driver_probe_device+0x52/0xa2
 [<ffffffff801a37b9>] __driver_attach+0x65/0xb6
 [<ffffffff801a3754>] __driver_attach+0x0/0xb6
 [<ffffffff801a30df>] bus_for_each_dev+0x43/0x6e
 [<ffffffff801a2d25>] bus_add_driver+0x7e/0x130
 [<ffffffff80145f77>] __pci_register_driver+0x57/0x7e
 [<ffffffff800a2bbf>] sys_init_module+0x16aa/0x185f
 [<ffffffff8005b14e>] system_call+0x7e/0x83

Code: 0f 0b 68 2f 61 04 88 c2 b8 08 e9 2b fe ff ff 8b 43 38 8b 73 

RIP  [<ffffffff88042040>] :cciss:sendcmd+0x2ca/0x33e
 RSP <ffff810008507b38>
 <0>Kernel panic - not syncing: Fatal exception

Comment 1 masanari iida 2007-05-09 05:33:42 UTC
As far as I can tell, this problem related to the activity of
disk I/O just before the system crash.
In order to reproduce the panic higher rate,I share with you 
another reproduce scenario.

Reproduce environment
/    /dev/cciss/c0d0p1
swap /dev/cciss/c0dp2

kdump partition is set to /dev/cciss/c0d0p1. (/var/crash)

Steps to Reproduce:

(1) Start kdump (local disk dump).
(2) Force the system to access to the Smart Array which
    has a dump partition.

    # dd if=/dev/cciss/c0d0p1 of=/dev/null &
(3) Enable sysrq and crash.
    # echo 1 > /ptroc/sys/kernel/sysrq
    # echo c > /proc/sysrq-trigger

For your information, following configuration prevent
the system to panic, even if I use SmartArray in the system.

/     /dev/sda1
swap  /dev/sda2
/dump /dev/cciss/c0d0p1 ( only for kdump use)

This is because, the system doesn't use /dump during 
the normal operation, so that cciss for 2nd kernel may
initialize the card without errors.

Comment 2 masanari iida 2007-05-09 11:39:40 UTC
Even if I configure kdump(scp), following step reproduce the same symptom.

(1) Start kdump (scp).
(2) Make a large amount of I/O to the sytem disk.

    # dd if=/dev/cciss/c0d0p1 of=/dev/null &
(3) Enable sysrq and crash.
    # echo 1 > /ptroc/sys/kernel/sysrq
    # echo c > /proc/sysrq-trigger

Even if I configure kdump(scp) initrd-kdump.img still contains 
cciss driver. So it may crash same as kdump(diskdump) case.

Comment 3 RHEL Product and Program Management 2007-05-09 15:44:25 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update

Comment 4 Neil Horman 2007-05-10 15:42:02 UTC
Created attachment 154479 [details]
test kernel with upstream patch in place to ignore unsent scsi requests in cciss on kexec boot

Please test with this kernel and let me know if your problem is solved. 

Comment 5 Neil Horman 2007-05-10 15:54:41 UTC
note that to make this kernel work properly, you will need to add the kernel
command line option reset_devices to the kernel commandline.  you can do this in
/etc/sysconfig/kdump, using the KDUMP_COMMANDLINE_APPEND variable..  You will
also need to restart the kdump service after adding the above option.

Comment 6 masanari iida 2007-05-11 09:54:12 UTC
I have a good news and bad news.

In short, the bug that I have reported is fixed by your patch.
And now I found another one.

Bad one.
If I boot up my x86_64 box with test kernel, even though I specify
crashkernel=128M@16M, the kernel failed to secure 128MB of RAM for 
2nd kernel. So, the test kernel can not be used as 1st kernel with kdump.
The error message that I found in dmesg is,
"Memory of crash kernel(0x1000000 to 0x8ffffff) notwithin  permissible range"

Good one.
I use 2.6.18-8 as a 1st boot kernel. Because the original kernel 
can secure 128MB of RAM for kdump.
Then I set  KDUMP_KERNELVER="2.6.18-18.el5bz239520" within 
/etc/sysconfig/kdump, and re-generate initrd-kdump.img.
With this configuration, I can use the test kernel as 2nd kernel.
I tested the failure scenario that I have provided in comment #2,
and I have confirmed that no more cciss panic occured.

Please check out dmesg and /proc/iomem output of good and bad one.


Comment 7 masanari iida 2007-05-11 10:04:25 UTC
Created attachment 154525 [details]
2.6.18-8 kernel dmesg & iomem output

This is a good dmesg and iomem.

Comment 8 masanari iida 2007-05-11 10:16:31 UTC
Created attachment 154526 [details]
2.6.18-18 kernel dmesg & iomem output

This is a bad dmesg and iomem.

Comment 9 Neil Horman 2007-05-11 14:26:54 UTC
Thats not bad news, thats expected.  The kernel I provided to you was the x86_64
kdump kernel. It is meant to be used soley as the kernel that is booted via the
kexec boot method.  Its not meant to be booted from a cold start.  It should
boot just fine, but you shouldn't be able to issue the crashkernel parameter on
the kernel command line from it.  Note that x86_64 and ppc64 are the only
kernels that still require a separate kdump kernel.  The other arches can use a
unified kernel between both normal and kexec operation.

I'll propose this for inclusion ASAP.

Comment 10 RHEL Product and Program Management 2007-05-11 15:22:23 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 11 Don Zickus 2007-06-16 00:37:03 UTC
in 2.6.18-27.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 masanari iida 2007-06-19 09:56:03 UTC
I have tested 2.6.18-27.el5 (both x86 and x86_64).
Both type of kernel work as expected.

I was told that there is a similar bz case #230717.
It disscussed about exactly same panic, but the comment #18 said
root cause of the problem is Smart Array card firmware.
So, my question is, do we need to apply this errata kernel _AND_ 
firmware update?  or either errata kernel _OR_ firmware patch?

Since your name "Don Zickus" is also in #230717 (#comment 13), 
so I believe you know something about the call,too.



Comment 13 Mike Miller (OS Dev) 2007-07-03 15:19:44 UTC
I Nacked this fix upstream. It's a pretty poor workaround. Sorry, Neil. I
completed the code for this prior to leaving for OLS. I did not have to complete
testing. That's why it's not upstream.

Comment 14 Mike Miller (OS Dev) 2007-07-11 20:12:33 UTC
I am currently testing this new code. Several hiccups so far.

Comment 17 errata-xmlrpc 2007-11-07 19:48:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.