Bug 254163
Summary: | kdump under kernel-xen fails with I/O load on aic7xxx or megaraid_mbox | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Bryn M. Reeves <bmr> | ||||||
Component: | kexec-tools | Assignee: | Jarod Wilson <jarod> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 5.3 | CC: | anderson, andriusb, coughlan, cward, ddomingo, dmair, martin.wilck, nhorman, qcai, syeghiay, tao, vgoyal, xen-maint | ||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
A bug in the IDE/ATA driver stack that could prevent a system using kernel-xen from booting into the kdump environment is now fixed. In previous releases, this occurred if the system encountered a kernel panic while an IDE device was performing I/O and the IDE device was being controlled by a device driver other than libata.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 600600 (view as bug list) | Environment: | |||||||
Last Closed: | 2009-01-20 20:58:07 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 391221, 391501, 409971, 454962 | ||||||||
Attachments: |
|
Description
Bryn M. Reeves
2007-08-24 15:15:45 UTC
Original reporter tested with the following patch. diff -uNrp xen.orig/arch/x86/machine_kexec.c xen/arch/x86/machine_kexec.c --- xen.orig/arch/x86/machine_kexec.c 2007-05-03 16:40:19.000000000 +0900 +++ xen/arch/x86/machine_kexec.c 2007-08-22 04:17:19.000000000 +0900 @@ -82,6 +82,7 @@ static void __machine_reboot_kexec(void smp_send_stop(); + lapic_shutdown(); #ifdef CONFIG_X86_IO_APIC disable_IO_APIC(); #endif The results is below. - aic7xxx CTL+ALT+SysRq+c 1 FAIL/1 TRY) echo c >/proc/sysrq-trigger 1 FAIL/1 TRY) NMI 1 FAIL/1 TRY) - megaraid_mbox CTL+ALT+SysRq+c 1 SUCCESS/1 TRY) echo c >/proc/sysrq-trigger 1 SUCCESS/1 TRY) NMI Is this really specific to Xen? It is quite possible that the problem is caused by kexec booting into the kdump kernel while the HW is in a state that the kernel is not ready to reinitialise from. And I'd expect such a condition to be reproducible on baremetal too; does kdump always work if you are doing all the IO load from a single baremetal kernel? I'll find out - it was reported to me as only affecting xen. I've not verified this myself yet as I don't have the hardware & was asked to get this into BZ asap as it's been raised as a 5.1 blocker. Will pass the query over to the reporter and find out if the TAM has gotten hardware to reproduce on yet. This is reported as easily reproducible on xen kernels but applying the same test to the non-xen builds has not yet triggered the problem. It may just be that it's easier to reproduce on xen - the partner is continuing to test & we're looking for hardware in-house to reproduce on. Also confirmed as only triggered via NMI/keyboard sysrq - the reporter has not been able to trigger a failed dump via the proc sysrq interface. The old IDE code has no support for recovering from this kind of mess. IFF we are not in the middle of an I/O then forcing the control registers to PIO 0 might help (current libata does this). If you abort mid transaction then you will need to perform an SRST and initialisation sequence on the devices which old IDE only partially understands. If there is data pending then some devices/controllers won't recover, or will recover only if you drain it off. Created attachment 294963 [details]
uploaded patch fixing kdump initscript to pass hd?=noprobe,cdrom
Either I don't fully understand what the patch is trying to do, or its a bit broken... At the first sign of a $DRIVE not already in KDUMP_COMMANDLINE, the routine exits, which doesn't seem to be what we're after. Automagic configuration would want to keep going if the drive isn't already handled, no? I've got a slightly reworked version I'll attach in a sec that does what I thought it was we were trying to accomplish here... (tested successfully on an x86_64 system w/a SATA HD and an IDE CD-ROM) Not sure what would happen if the boot volume was IDE though... This definitely needs a bit more thought/explanation/testing before I'm comfortable putting it into kexec-tools. Created attachment 299953 [details]
Different approach, assuming we want to set any hdX=* options not already set
thanks Jarod, added the following to "Known Issues" of RHEL5.2 release notes: <quote> If a system configured for kdump encounters a kernel panic while an IDE device is performing I/O, the system may be unable to successfully boot into the kdump environment. This occurs if the IDE device is controlled by a device driver other than libata, and is caused by a bug in the IDE/ATA driver stack. To work around this, use the kdump command-line argument hdX=noprobe for storage devices and hdX=cdrom for optical drives. </quote> please advise (before April 15) if any revisions are required. thanks! > use the kdump command-line argument hdX=noprobe for storage
> devices and hdX=cdrom for optical drives.
Did this mean adding the argument to KDUMP_COMMANDLINE_APPEND of
/etc/sysconfig/kdump?
This event sent from IssueTracker by mmatsuya
issue 130241
Yes, these should be added to the KDUMP_COMMANDLINE_APPEND option in /etc/sysconfig/kdump. Another thing that perhaps needs to be noted/tested/clarified... What if the busy non-libata IDE device that was busy is the hard disk you've set up to capture your vmcores? (I don't actually know the answer, though I swear I've captured a vmcore to an IDE disk before...) revising 2nd paragraph of release note: <quote> To work around this, use the kdump command-line argument hd[X]=noprobe for storage devices and hd[X]=cdrom for optical drives, where [X] is the device identifier. Either command-line argument should be added to KDUMP_COMMANDLINE_APPEND in /etc/sysconfig/kdump. </quote> please advise before April 15 if any further revisions are required. thanks! This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Have we received any feedback for the proposed patches attached to this bug? In particular, I'd like to know if the patch in comment #31 satisfies the need here. The patch in comment #22 seemed incorrect to me when I last looked at it in detail, and I never heard back whether I was just missing something, or if it was indeed not correct... Bug 254163 (kdump under kernel-xen fails with I/O load on aic7xxx or megaraid_mbox) does not look like testable here. Masahiro, Andrius, Flavio or anyone, could you ask Fujitsu to try any version of kexec-tools equal or later than 1.102pre-39.el5 to see it fixes their problem? I am currently testing of kexec-tools advisory for RHEL 5.3, but I do not have the hardware to reproduce this problem. If I understand correctly, it needs a CDROM with a media inserted, an IDE disk, and issuing a SysRq-C via keyboard (not via /proc interface). If so, it is totally impossible for me to test it using remote RHTS machines. From FJ: -- Hi, Yes, if you want to reproduce this issue, we need to insert a media to DVD-ROM drive. And we need to access it busily before dumping. Thus some DVD drives don't case this issue. Best Regards, Akio Takebe -- This event sent from IssueTracker by moshiro issue 130241 What the rate of the failure? I have tried several times with the following steps, but not been able to reproduce the problem. # rpm -q kexec-tools kexec-tools-1.102pre-21.el5 # dmesg ... Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ESB2: IDE controller at PCI slot 0000:00:1f.1 ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 21 (level, low) -> IRQ 49 ESB2: chipset revision 9 ESB2: not 100% native mode: will probe irqs later ide0: BM-DMA at 0x2080-0x2087, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0x2088-0x208f, BIOS settings: hdc:pio, hdd:pio Probing IDE interface ide0... hda: DV-28E-N, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 34 Probing IDE interface ide1... Probing IDE interface ide1... Probing IDE interface ide2... Probing IDE interface ide3... ... I inserted a DVD to the drive, and ran the following command in two terminals, while :; do dd if=/dev/cdrom of=/dev/null; done Then, I issued Alt-SysRq-C via keyboard. However, VMCores were generated without any issue. # ls -l /var/crash/127.0.0.1-2008-11-27-04\:* /var/crash/127.0.0.1-2008-11-27-04:14:50: total 83880 -rw------- 1 root root 85824162 Nov 27 04:15 vmcore /var/crash/127.0.0.1-2008-11-27-04:21:13: total 81804 -rw------- 1 root root 83704021 Nov 27 04:21 vmcore In addition, the fix in the latest kexec-tools is doing this right now. function avoid_cdrom_drive() { local DRIVE="" local MEDIA="" local IDE_DRIVES=(`echo hd{a,b,c,d}`) local COUNTER="0" for DRIVE in ${IDE_DRIVES[@]} do if ! $(echo "$KDUMP_COMMANDLINE" |grep -q "$DRIVE=");then if [ -f /proc/ide/$DRIVE/media ];then MEDIA=$(cat /proc/ide/$DRIVE/media) if [ x"$MEDIA" == x"cdrom" ]; then KDUMP_IDE_NOPROBE_COMMANDLINE="$KDUMP_IDE_NOPROBE_COMMANDLINE $DRIVE=cdrom" COUNTER=$(($COUNTER+1)) fi fi fi done # We don't find cdrom drive. if [ $COUNTER -eq 0 ]; then KDUMP_IDE_NOPROBE_COMMANDLINE="" fi } The final kexec arguments it will be used will be something like, /sbin/kexec -p '--command-line=BOOT_IMAGE=scsi0:EFI\redhat\vmlinuz-2.6.18-124.el5 rhgb quiet root=LABEL=/ ro irqpoll maxcpus=1 reset_devices hda=cdrom' --initrd=/boot/efi/efi/redhat/initrd-2.6.18-124.el5kdump.img /boot/efi/efi/redhat/vmlinuz-2.6.18-124.el5 Is that what you need? I asked it because it differs from the one mentioned in release note, i.e. "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe". Since I can't test it, I would like to check with you about it. I have just come across this from the issue tracker. ----------------------------------------- Event posted 11-27-2008 05:59am EST by asakai Hi, Thank you for your testing. The rate that I could reproduce this issue is 100%. To avoid this issue, we need "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe" options. But this option may cause other problem. For example, some machines which use ide disks at kdumping cannot do kdump to the disks. I just worry about that. Fujitsu doesn't have such a server, but other vendor may have such a server. What do you think about it? Best Regards, Akio Takebe ---------------------------------------- To answer to the above question, I am not a developer, but the one to verify if the agreed fix has already included in RHEL 5.3. So, I am not the best person to tell you if the fix is the CORRECT solution. In other words, I care about more if Fujitsu agreed the already committed fix or not? If so, I'll mark this bug as already been verified, and it will be in RHEL 5.3 release as is. Otherwise, if you don't think it is the RIGHT solution for you at this point, we'll need to ask the developer to re-work it. From FJ: --- Hi, If other vendors don't complain, automatically adding "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe" options is the correct solution, I think. But if not, just adding the article to avoid this bug in release notes is OK, I think. Best Regards, Akio Takebe --- This event sent from IssueTracker by moshiro issue 130241 OK, so the current patch is doing this, automatically adding "hda=cdrom" only, and no more release note. Does it sound like the right solution for Fujitsu? From the issue tracker: --- Hi, The patch of BZ#254163 Comment #31 is right solution. hdX=noprobe options are also needed. e.g. "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe" Best Regards, Akio Takebe --- Thanks, I'll ping the developer about this. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A bug in the IDE/ATA driver stack that could prevent a system using kernel-xen from booting into the kdump environment is now fixed. In previous releases, this occurred if the system encountered a kernel panic while an IDE device was performing I/O and the IDE device was being controlled by a device driver other than libata. *** Bug 473852 has been marked as a duplicate of this bug. *** From FJ: --- I cannot edit the BZ directly. This issue occurred on both native and xen environments. And the current patch adds only hda=cdrom option, not add hdX=noprobe option. So this issue is not fixed yet. Best Regards, Akio Takebe --- This event sent from IssueTracker by moshiro issue 130241 Fujitsu, the new re-work patch is doing the following. Adding "hdX=cdrom" to the existing CDROM devices, and "hdX=noprobe" to the existing storage devices. For example, If the system has hda (CDROM), hdb (IDE disk), it will add "hda=cdrom hdb=noprobe". If the system has hda (CDROM) only, it will add "hda=cdrom" only. Here is the patch, +function avoid_cdrom_drive() +{ + local DRIVE="" + local MEDIA="" + local IDE_DRIVES=(`echo hd{a,b,c,d}`) + local COUNTER="0" + + for DRIVE in ${IDE_DRIVES[@]} + do + if ! $(echo "$KDUMP_COMMANDLINE" |grep -q "$DRIVE=");then + if [ -f /proc/ide/$DRIVE/media ];then + MEDIA=$(cat /proc/ide/$DRIVE/media) + if [ x"$MEDIA" == x"cdrom" ]; then + KDUMP_IDE_NOPROBE_COMMANDLINE="$KDUMP_IDE_NOPROBE_COMMANDLINE $DRIVE=cdrom" + COUNTER=$(($COUNTER+1)) + else + KDUMP_IDE_NOPROBE_COMMANDLINE="$KDUMP_IDE_NOPROBE_COMMANDLINE $DRIVE=noprobe" + fi + fi + fi + done + # We don't find cdrom drive. + if [ $COUNTER -eq 0 ]; then + KDUMP_IDE_NOPROBE_COMMANDLINE="" + fi +} + # Load the kdump kerel specified in /etc/sysconfig/kdump # If none is specified, try to load a kdump kernel with the same version # as the currently running kernel. @@ -226,6 +267,8 @@ KDUMP_COMMANDLINE=`echo $KDUMP_COMMANDLINE | sed -e 's/crashkernel=[0-9]\+[MmKkGg]@[0-9]\+[MmGgKk]//'` KDUMP_COMMANDLINE="${KDUMP_COMMANDLINE} ${KDUMP_COMMANDLINE_APPEND}" + avoid_cdrom_drive + KDUMP_COMMANDLINE="${KDUMP_COMMANDLINE} ${KDUMP_IDE_NOPROBE_COMMANDLINE}" KEXEC_OUTPUT=`$KEXEC $KEXEC_ARGS $standard_kexec_args \ --command-line="$KDUMP_COMMANDLINE" \ Do you agree this fix? Can you try the latest kexec-tools-1.102pre-52.el5 to see if it fixes your problem? From FJ: --- Hi, I don't try it yet. But I suspect kdump would not work with it because it doesn't add hdX=noprobe option for not-existing hdX. When I checked wheather kdump work with only hda=noprove, the system had a CDROM device and didn't have any other IDE device. At that time, kdump could not work with hda=cdrom. But kdump could work with hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe. Best Regards, Akio Takebe --- This event sent from IssueTracker by moshiro issue 130241 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0105.html *** Bug 469608 has been marked as a duplicate of this bug. *** |