Description of problem: kdump(local disk dump) with SmartArray failed while initializing cciss driver for 2nd kernel. Version-Release number of selected component (if applicable): RHEL5 (2.6.18-8) cciss driver within the kernel. How reproducible: Proliant ML370G5 (SmartArray P400) The symptom happen once in 2-3 tests. Proliant ML350G4p (SmartArray 642) The symptom NOT happen. Steps to Reproduce: 1. Setup kdump other than local disk dump. 2. Modify config to local disk dump. 3. Restart kdump service. 4. Crash the system. ( # echo c > /proc/sysrq-trigger ) Actual results: kernel panic during 2nd kernel load cciss driver. This symptom happen intermittently. (once in 2-3 tests) Expected results: The 2nd kernel load cciss succesfully. Additional info: Following panic signiture tells that this would be a known regression which was discussed on Linux Kernel Mailing list. http://www.ussg.iu.edu/hypermail/linux/kernel/0606.2/3055.html Kernel BUG at drivers/block/cciss.c:2232 invalid opcode: 0000 [1] SMP last sysfs file: Modules linked in: cciss sd_mod scsi_mod Pid: 421, comm: exe Not tainted 2.6.18-8.el5 #1 RIP: 0010:[<ffffffff88042040>] [<ffffffff88042040>] :cciss:sendcmd+0x2ca/0x33e RSP: 0000:ffff810008507b38 EFLAGS: 00010286 RAX: 0000000000000044 RBX: ffff810008580000 RCX: ffffffff8042d520 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802da65c RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000020 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000004e20 R13: ffff8100084ca800 R14: 0000000000000000 R15: 0000000000000000 FS: 000000000aa0c8b0(0063) GS:ffffffff8038a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaaaac8000 CR3: 00000000013cd000 CR4: 00000000000006e0 Process exe (pid: 421, threadinfo ffff810008506000, task ffff81000842a100) Stack: 0000000000000001 00000000084d11c0 0000000000000024 ffff8100084d1180 00000012ffffffff ffff8100084ca800 ffff81000847e780 ffff8100084d1180 0000000000000000 ffff810008ad0800 0000000000000000 ffffffff88042353 Call Trace: [<ffffffff88042353>] :cciss:cciss_getgeometry+0xc7/0x2bd [<ffffffff88044c6d>] :cciss:cciss_init_one+0x75c/0xbcb [<ffffffff80060c1b>] wait_for_completion+0x1f/0xa2 [<ffffffff800862c5>] task_rq_lock+0x3d/0x6f [<ffffffff80145d93>] pci_device_probe+0x100/0x180 [<ffffffff801a3692>] driver_probe_device+0x52/0xa2 [<ffffffff801a37b9>] __driver_attach+0x65/0xb6 [<ffffffff801a3754>] __driver_attach+0x0/0xb6 [<ffffffff801a30df>] bus_for_each_dev+0x43/0x6e [<ffffffff801a2d25>] bus_add_driver+0x7e/0x130 [<ffffffff80145f77>] __pci_register_driver+0x57/0x7e [<ffffffff800a2bbf>] sys_init_module+0x16aa/0x185f [<ffffffff8005b14e>] system_call+0x7e/0x83 Code: 0f 0b 68 2f 61 04 88 c2 b8 08 e9 2b fe ff ff 8b 43 38 8b 73 RIP [<ffffffff88042040>] :cciss:sendcmd+0x2ca/0x33e RSP <ffff810008507b38> <0>Kernel panic - not syncing: Fatal exception
As far as I can tell, this problem related to the activity of disk I/O just before the system crash. In order to reproduce the panic higher rate,I share with you another reproduce scenario. Reproduce environment / /dev/cciss/c0d0p1 swap /dev/cciss/c0dp2 kdump partition is set to /dev/cciss/c0d0p1. (/var/crash) Steps to Reproduce: (1) Start kdump (local disk dump). (2) Force the system to access to the Smart Array which has a dump partition. # dd if=/dev/cciss/c0d0p1 of=/dev/null & (3) Enable sysrq and crash. # echo 1 > /ptroc/sys/kernel/sysrq # echo c > /proc/sysrq-trigger For your information, following configuration prevent the system to panic, even if I use SmartArray in the system. / /dev/sda1 swap /dev/sda2 /dump /dev/cciss/c0d0p1 ( only for kdump use) This is because, the system doesn't use /dump during the normal operation, so that cciss for 2nd kernel may initialize the card without errors.
Even if I configure kdump(scp), following step reproduce the same symptom. (1) Start kdump (scp). (2) Make a large amount of I/O to the sytem disk. # dd if=/dev/cciss/c0d0p1 of=/dev/null & (3) Enable sysrq and crash. # echo 1 > /ptroc/sys/kernel/sysrq # echo c > /proc/sysrq-trigger Even if I configure kdump(scp) initrd-kdump.img still contains cciss driver. So it may crash same as kdump(diskdump) case.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 154479 [details] test kernel with upstream patch in place to ignore unsent scsi requests in cciss on kexec boot Please test with this kernel and let me know if your problem is solved. Thanks!
note that to make this kernel work properly, you will need to add the kernel command line option reset_devices to the kernel commandline. you can do this in /etc/sysconfig/kdump, using the KDUMP_COMMANDLINE_APPEND variable.. You will also need to restart the kdump service after adding the above option.
I have a good news and bad news. In short, the bug that I have reported is fixed by your patch. And now I found another one. Bad one. If I boot up my x86_64 box with test kernel, even though I specify crashkernel=128M@16M, the kernel failed to secure 128MB of RAM for 2nd kernel. So, the test kernel can not be used as 1st kernel with kdump. The error message that I found in dmesg is, "Memory of crash kernel(0x1000000 to 0x8ffffff) notwithin permissible range" Good one. I use 2.6.18-8 as a 1st boot kernel. Because the original kernel can secure 128MB of RAM for kdump. Then I set KDUMP_KERNELVER="2.6.18-18.el5bz239520" within /etc/sysconfig/kdump, and re-generate initrd-kdump.img. With this configuration, I can use the test kernel as 2nd kernel. I tested the failure scenario that I have provided in comment #2, and I have confirmed that no more cciss panic occured. Please check out dmesg and /proc/iomem output of good and bad one. Thanks.
Created attachment 154525 [details] 2.6.18-8 kernel dmesg & iomem output This is a good dmesg and iomem.
Created attachment 154526 [details] 2.6.18-18 kernel dmesg & iomem output This is a bad dmesg and iomem.
Thats not bad news, thats expected. The kernel I provided to you was the x86_64 kdump kernel. It is meant to be used soley as the kernel that is booted via the kexec boot method. Its not meant to be booted from a cold start. It should boot just fine, but you shouldn't be able to issue the crashkernel parameter on the kernel command line from it. Note that x86_64 and ppc64 are the only kernels that still require a separate kdump kernel. The other arches can use a unified kernel between both normal and kexec operation. I'll propose this for inclusion ASAP.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
in 2.6.18-27.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I have tested 2.6.18-27.el5 (both x86 and x86_64). Both type of kernel work as expected. I was told that there is a similar bz case #230717. It disscussed about exactly same panic, but the comment #18 said root cause of the problem is Smart Array card firmware. So, my question is, do we need to apply this errata kernel _AND_ firmware update? or either errata kernel _OR_ firmware patch? Since your name "Don Zickus" is also in #230717 (#comment 13), so I believe you know something about the call,too. Regards, Masanari
I Nacked this fix upstream. It's a pretty poor workaround. Sorry, Neil. I completed the code for this prior to leaving for OLS. I did not have to complete testing. That's why it's not upstream.
I am currently testing this new code. Several hiccups so far.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html