Bug 500741
Summary: | [Stratus 5.5 bug] "critical_disks" makes kdump unreliable | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert N. Evans <robert.evans> | ||||||
Component: | kexec-tools | Assignee: | Neil Horman <nhorman> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.3 | CC: | andriusb, balkov, bzeranski, charlotte.richardson, chas.horvath, cward, jparadis, phan, qcai, richard.johnson | ||||||
Target Milestone: | rc | Keywords: | OtherQA | ||||||
Target Release: | 5.5 | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
kdump waits for all devices in its critical_disks list to be available
before it performs a dump. Previously, there was no limit to the time that
kdump would wait for a device to respond. Therefore, the dump might never
be performed. kexec-tools now has a disk_timeout parameter that limits how
long kdump will wait for storage to respond. This ensures that the dump will
take place.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 600583 (view as bug list) | Environment: | |||||||
Last Closed: | 2010-03-30 07:47:10 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 533941 | ||||||||
Attachments: |
|
Description
Robert N. Evans
2009-05-13 21:16:51 UTC
I understand your argument in principle, but I just can't do it. We need to be able to wait for all the disks to be present so that we can guarantee that the dump is captured at all. We can't ignore sd drives, since hard disk drives are the entire reason that the critical disk list was created. In the converse situation, if we only have scsi disks, the above change renders the critical disk list useless, and if a drive is broken, and all sorts of unexpected errors can occur. The option to create a timeout sounds a bit better. I'll attach a patch for you to try shortly. Created attachment 343898 [details]
patch to add disk_timeout config option
Here it is, this lets you set a disk_timeout option to limit how long we wait for critical disks. I've not tested it yet, but Let me know how it works for you
Created attachment 344194 [details]
Revised patch
I have revised the patch to be compatible with msh. Also added, handling for "disk_timeout" not configured; in this case there is no limit to the wait for critical disks.
I verified this worked as expected with these test cases:
- missing critical disks and disk_timeout configured to 0
- missing critical disks and disk_timeout configured to 7
- missing critical disks and disk_timeout not configured
- critical disks all present and disk_timeout not configured
Please consider taking this change for kexec-tools.
yeah, that looks good. I'll commit this to whichever release it gets approved for. Thanks! Stratus: Would this cause any heartburn if this was proposed for RHEL 5.5? We can work around this problem. Although it would be nice to have a fix earlier, it is great to get this fix in RHEL 5.5. OK - will do. Proposing officially for RHEL 5.5. Thanks for the feedback. fixed in -79.el5 Neil - Can you make the new kexec-tools RPM available for me to test? I'd like to make sure the avoidance Stratus is using is compatible with the new version of kexec-tools. its in brew. Neil - I don't believe Stratus can get packages in Brew (unless Jim can grab them) since they are external. Would it be possible to have them on a people page in the meantime? If not, I'll see if Jim can bring them down for Stratus. Fix verified on Stratus hardware with both kexec-tools-1.102pre-79.el5.x86_64 and kexec-tools-1.102pre-83.el5.x86_64. Using kdump.conf disk_timeout=0, successfully collected dumps when incomplete RAID1 present, with no delays waiting for disks. Also verified mkdumprd script by comparison with version from comment 4 that was thoroughly tested at Stratus. Verified that Stratus work-around for this problem properly accommodates situation when this fix is present. So a new version of the Stratus lsb-ft-cstools RPM is not needed to use the fix from Red Hat. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: kdump waits for all devices in its critical_disks list to be available before it performs a dump. Previously, there was no limit to the time that kdump would wait for a device to respond. Therefore, the dump might never be performed. kexec-tools now has a disk_timeout parameter that limits how long kdump will wait for storage to respond. This ensures that the dump will take place. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0179.html |