Bug 239520
Summary: | kdump 2nd kernel failed to load cciss driver | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | masanari iida <masanari_iida> |
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 5.0 | CC: | coughlan, maarten, mike.miller, nhorman, rick.beldin |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-07 19:48:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
masanari iida
2007-05-09 03:01:30 UTC
As far as I can tell, this problem related to the activity of disk I/O just before the system crash. In order to reproduce the panic higher rate,I share with you another reproduce scenario. Reproduce environment / /dev/cciss/c0d0p1 swap /dev/cciss/c0dp2 kdump partition is set to /dev/cciss/c0d0p1. (/var/crash) Steps to Reproduce: (1) Start kdump (local disk dump). (2) Force the system to access to the Smart Array which has a dump partition. # dd if=/dev/cciss/c0d0p1 of=/dev/null & (3) Enable sysrq and crash. # echo 1 > /ptroc/sys/kernel/sysrq # echo c > /proc/sysrq-trigger For your information, following configuration prevent the system to panic, even if I use SmartArray in the system. / /dev/sda1 swap /dev/sda2 /dump /dev/cciss/c0d0p1 ( only for kdump use) This is because, the system doesn't use /dump during the normal operation, so that cciss for 2nd kernel may initialize the card without errors. Even if I configure kdump(scp), following step reproduce the same symptom. (1) Start kdump (scp). (2) Make a large amount of I/O to the sytem disk. # dd if=/dev/cciss/c0d0p1 of=/dev/null & (3) Enable sysrq and crash. # echo 1 > /ptroc/sys/kernel/sysrq # echo c > /proc/sysrq-trigger Even if I configure kdump(scp) initrd-kdump.img still contains cciss driver. So it may crash same as kdump(diskdump) case. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 154479 [details]
test kernel with upstream patch in place to ignore unsent scsi requests in cciss on kexec boot
Please test with this kernel and let me know if your problem is solved.
Thanks!
note that to make this kernel work properly, you will need to add the kernel command line option reset_devices to the kernel commandline. you can do this in /etc/sysconfig/kdump, using the KDUMP_COMMANDLINE_APPEND variable.. You will also need to restart the kdump service after adding the above option. I have a good news and bad news. In short, the bug that I have reported is fixed by your patch. And now I found another one. Bad one. If I boot up my x86_64 box with test kernel, even though I specify crashkernel=128M@16M, the kernel failed to secure 128MB of RAM for 2nd kernel. So, the test kernel can not be used as 1st kernel with kdump. The error message that I found in dmesg is, "Memory of crash kernel(0x1000000 to 0x8ffffff) notwithin permissible range" Good one. I use 2.6.18-8 as a 1st boot kernel. Because the original kernel can secure 128MB of RAM for kdump. Then I set  KDUMP_KERNELVER="2.6.18-18.el5bz239520" within /etc/sysconfig/kdump, and re-generate initrd-kdump.img. With this configuration, I can use the test kernel as 2nd kernel. I tested the failure scenario that I have provided in comment #2, and I have confirmed that no more cciss panic occured. Please check out dmesg and /proc/iomem output of good and bad one. Thanks. Created attachment 154525 [details]
2.6.18-8 kernel dmesg & iomem output
This is a good dmesg and iomem.
Created attachment 154526 [details]
2.6.18-18 kernel dmesg & iomem output
This is a bad dmesg and iomem.
Thats not bad news, thats expected. The kernel I provided to you was the x86_64 kdump kernel. It is meant to be used soley as the kernel that is booted via the kexec boot method. Its not meant to be booted from a cold start. It should boot just fine, but you shouldn't be able to issue the crashkernel parameter on the kernel command line from it. Note that x86_64 and ppc64 are the only kernels that still require a separate kdump kernel. The other arches can use a unified kernel between both normal and kexec operation. I'll propose this for inclusion ASAP. This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST. in 2.6.18-27.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 I have tested 2.6.18-27.el5 (both x86 and x86_64). Both type of kernel work as expected. I was told that there is a similar bz case #230717. It disscussed about exactly same panic, but the comment #18 said root cause of the problem is Smart Array card firmware. So, my question is, do we need to apply this errata kernel _AND_ firmware update? or either errata kernel _OR_ firmware patch? Since your name "Don Zickus" is also in #230717 (#comment 13), so I believe you know something about the call,too. Regards, Masanari I Nacked this fix upstream. It's a pretty poor workaround. Sorry, Neil. I completed the code for this prior to leaving for OLS. I did not have to complete testing. That's why it's not upstream. I am currently testing this new code. Several hiccups so far. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |