Bug 254163

Summary:

kdump under kernel-xen fails with I/O load on aic7xxx or megaraid_mbox

Product:

Red Hat Enterprise Linux 5

Reporter:

Bryn M. Reeves <bmr>

Component:

kexec-tools

Assignee:

Jarod Wilson <jarod>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

5.3

CC:

anderson, andriusb, coughlan, cward, ddomingo, dmair, martin.wilck, nhorman, qcai, syeghiay, tao, vgoyal, xen-maint

Target Milestone:

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

A bug in the IDE/ATA driver stack that could prevent a system using kernel-xen from booting into the kdump environment is now fixed. In previous releases, this occurred if the system encountered a kernel panic while an IDE device was performing I/O and the IDE device was being controlled by a device driver other than libata.

Story Points:

---

Clone Of:

Clones:

600600 (view as bug list)

Environment:

Last Closed:

2009-01-20 20:58:07 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

391221, 391501, 409971, 454962

Attachments:

Description	Flags
uploaded patch fixing kdump initscript to pass hd?=noprobe,cdrom	none
Different approach, assuming we want to set any hdX=* options not already set	none

Description Bryn M. Reeves 2007-08-24 15:15:45 UTC

Description of problem:
We tested kdump on a system having aic7xxxx and megaraid_mbox.
If we added heavy I/O load on guests, kdump of xen didn't work properly.
If we didn't add any load, kdump of xen can work.
Wiht NMI or Sysrq key, this issue is happend.
But with "ehco c >/proc/sysrq-trigger", it has not been happend yet.


Version-Release number of selected component (if applicable):
   Red Hat Enterprise Linux Version Number: RHEL5.1 beta
   Release Number: public beta
   Architecture: i686
   Kernel Version: kernel-xen-2.6.18-36.el5 + patch from bug 251341

How reproducible:
 - aic7xxx
      CTL+ALT+SysRq+c             5 FAIL   /5 TRY)
      echo c >/proc/sysrq-trigger 5 SUCCESS/5 TRY)
      NMI                         5 FAIL   /5 TRY)

- megaraid_mbox
      CTL+ALT+SysRq+c             5 FAIL   /5 TRY)
      echo c >/proc/sysrq-trigger 5 SUCCESS/5 TRY)
      NMI                         5 FAIL   /5 TRY)


Steps to Reproduce:
1. Setup kdump of Xen (/etc/kdump.conf, /etc/sysconfi/kdump)
2. add I/O load on two Guests(by using many "dd" commands and so on)
3. CTL+ALT+Sysrq+c, or NMI buton.
  
Actual results:
When we use aic7xxx and fail to kdump, the following message is shown.
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:08: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 16384K size 4096 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
hda: +Dsodr"
           =8G H, ATA DISK drive
hda: IRQ probe failed (0xfffffcfe)
hdb: IRQ probe failed (0xfffffcfe)
hdb: IRQ probe failed (0xfffffcfe)
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: max request size: 128KiB
hda: 178746828 sectors (91518 MB) w/10067KiB Cache, CHS=24379/78/94
hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: 0xef
hda: drive not ready for command
hda: cache flushes supported
hda: INVALID GEOMETRY: 78 PHYSICAL HEADS?
hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: 0xde
hda: drive not ready for command
hda:hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hda: drive not ready for command
hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hda: drive not ready for command
hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hda: drive not ready for command
hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hda: drive not ready for command
ide0: reset timed-out, status=0x95
hda: status timeout: status=0x95 { Busy }
ide: failed opcode was: unknown

Expected results:
Correct vmcore generated

Additional info:
So far this has only produced a failure testing via the physical NMI button or a
keyboard invoked sysrq-c. Triggering an oops via /proc/sysrq-trigger has NOT
resulted in a kdump failure.

Is it possible the device state isn't being reset properly when the oops/panic
takes place in interrupt context?

Comment 2 Bryn M. Reeves 2007-08-24 15:18:30 UTC

Original reporter tested with the following patch.
  
   diff -uNrp xen.orig/arch/x86/machine_kexec.c xen/arch/x86/machine_kexec.c
   --- xen.orig/arch/x86/machine_kexec.c   2007-05-03 16:40:19.000000000 +0900
   +++ xen/arch/x86/machine_kexec.c        2007-08-22 04:17:19.000000000 +0900
   @@ -82,6 +82,7 @@ static void __machine_reboot_kexec(void
  
        smp_send_stop();
  
   +    lapic_shutdown();
   #ifdef CONFIG_X86_IO_APIC
        disable_IO_APIC();
   #endif

   The results is below.
    - aic7xxx
      CTL+ALT+SysRq+c             1 FAIL/1 TRY)
      echo c >/proc/sysrq-trigger 1 FAIL/1 TRY)
      NMI                         1 FAIL/1 TRY)

- megaraid_mbox
      CTL+ALT+SysRq+c             1 SUCCESS/1 TRY)
      echo c >/proc/sysrq-trigger 1 SUCCESS/1 TRY)
      NMI

Comment 3 Stephen Tweedie 2007-08-24 15:31:58 UTC

Is this really specific to Xen?  It is quite possible that the problem is caused
by kexec booting into the kdump kernel while the HW is in a state that the
kernel is not ready to reinitialise from.  And I'd expect such a condition to be
reproducible on baremetal too; does kdump always work if you are doing all the
IO load from a single baremetal kernel?

Comment 4 Bryn M. Reeves 2007-08-24 17:09:58 UTC

I'll find out - it was reported to me as only affecting xen. I've not verified
this myself yet as I don't have the hardware & was asked to get this into BZ
asap  as it's been raised as a 5.1 blocker. Will pass the query over to the
reporter and find out if the TAM has gotten hardware to reproduce on yet.

Comment 5 Bryn M. Reeves 2007-08-28 08:58:54 UTC

This is reported as easily reproducible on xen kernels but applying the same
test to the non-xen builds has not yet triggered the problem. It may just be
that it's easier to reproduce on xen - the partner is continuing to test & we're
looking for hardware in-house to reproduce on.

Comment 6 Bryn M. Reeves 2007-08-28 09:02:03 UTC

Also confirmed as only triggered via NMI/keyboard sysrq - the reporter has not
been able to trigger a failed dump via the proc sysrq interface.

Comment 17 Alan Cox 2008-01-16 21:42:00 UTC

The old IDE code has no support for recovering from this kind of mess. 

IFF we are not in the middle of an I/O then forcing the control registers to PIO
0 might help (current libata does this). If you abort mid transaction then you
will need to perform an SRST and initialisation sequence on the devices which
old IDE only partially understands. If there is data pending then some
devices/controllers won't recover, or will recover only if you drain it off.

Comment 22 Flavio Leitner 2008-02-15 02:01:35 UTC

Created attachment 294963 [details]
uploaded patch fixing  kdump initscript to pass hd?=noprobe,cdrom

Comment 30 Jarod Wilson 2008-04-01 19:37:43 UTC

Either I don't fully understand what the patch is trying to do, or its a bit
broken... At the first sign of a $DRIVE not already in KDUMP_COMMANDLINE, the
routine exits, which doesn't seem to be what we're after. Automagic
configuration would want to keep going if the drive isn't already handled, no?

I've got a slightly reworked version I'll attach in a sec that does what I
thought it was we were trying to accomplish here... (tested successfully on an
x86_64 system w/a SATA HD and an IDE CD-ROM) Not sure what would happen if the
boot volume was IDE though... This definitely needs a bit more
thought/explanation/testing before I'm comfortable putting it into kexec-tools.

Comment 31 Jarod Wilson 2008-04-01 19:39:27 UTC

Created attachment 299953 [details]
Different approach, assuming we want to set any hdX=* options not already set

Comment 39 Don Domingo 2008-04-07 22:47:55 UTC

thanks Jarod, added the following to "Known Issues" of RHEL5.2 release notes:

<quote>
If a system configured for kdump encounters a kernel panic while an IDE device
is performing I/O, the system may be unable to successfully boot into the kdump
environment. This occurs if the IDE device is controlled by a device driver
other than libata, and is caused by a bug in the IDE/ATA driver stack.

To work around this, use the kdump command-line argument hdX=noprobe for storage
devices and hdX=cdrom for optical drives.
</quote>

please advise (before April 15) if any revisions are required. thanks!

Comment 40 Issue Tracker 2008-04-08 03:11:28 UTC

> use the kdump command-line argument hdX=noprobe for storage
> devices and hdX=cdrom for optical drives.

Did this mean adding the argument to KDUMP_COMMANDLINE_APPEND of
/etc/sysconfig/kdump?


This event sent from IssueTracker by mmatsuya 
 issue 130241

Comment 41 Jarod Wilson 2008-04-08 03:57:25 UTC

Yes, these should be added to the KDUMP_COMMANDLINE_APPEND option in
/etc/sysconfig/kdump.

Another thing that perhaps needs to be noted/tested/clarified... What if the
busy non-libata IDE device that was busy is the hard disk you've set up to
capture your vmcores? (I don't actually know the answer, though I swear I've
captured a vmcore to an IDE disk before...)

Comment 42 Don Domingo 2008-04-08 23:16:41 UTC

revising 2nd paragraph of release note:

<quote>
To work around this, use the kdump command-line argument hd[X]=noprobe for
storage devices and hd[X]=cdrom for optical drives, where [X] is the device
identifier. Either command-line argument should be added to
KDUMP_COMMANDLINE_APPEND in /etc/sysconfig/kdump.
</quote>

please advise before April 15 if any further revisions are required. thanks!

Comment 44 RHEL Program Management 2008-06-02 20:33:11 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 45 Jarod Wilson 2008-07-09 20:00:15 UTC

Have we received any feedback for the proposed patches attached to this bug? In
particular, I'd like to know if the patch in comment #31 satisfies the need
here. The patch in comment #22 seemed incorrect to me when I last looked at it
in detail, and I never heard back whether I was just missing something, or if it
was indeed not correct...

Comment 54 Qian Cai 2008-11-27 03:36:14 UTC

Bug 254163 (kdump under kernel-xen fails with I/O load on aic7xxx or megaraid_mbox) does not look like testable here. 

Masahiro, Andrius, Flavio or anyone, could you ask Fujitsu to try any version of kexec-tools equal or later than 1.102pre-39.el5 to see it fixes their problem? I am currently testing of kexec-tools advisory for RHEL 5.3, but I do not have the hardware to reproduce this problem. If I understand correctly, it needs a CDROM with a media inserted, an IDE disk, and issuing a SysRq-C via keyboard (not via /proc interface). If so, it is totally impossible for me to test it using remote RHTS machines.

Comment 55 Issue Tracker 2008-11-27 07:19:57 UTC

From FJ:
--
Hi,

Yes, if you want to reproduce this issue,
we need to insert a media to DVD-ROM drive.
And we need to access it busily before dumping.
Thus some DVD drives don't case this issue.

Best Regards,

Akio Takebe 
--


This event sent from IssueTracker by moshiro 
 issue 130241

Comment 56 Qian Cai 2008-11-27 10:05:42 UTC

What the rate of the failure? I have tried several times with the following steps, but not been able to reproduce the problem.

# rpm -q kexec-tools
kexec-tools-1.102pre-21.el5

# dmesg
...
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ESB2: IDE controller at PCI slot 0000:00:1f.1
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 21 (level, low) -> IRQ 49
ESB2: chipset revision 9
ESB2: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0x2080-0x2087, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0x2088-0x208f, BIOS settings: hdc:pio, hdd:pio
Probing IDE interface ide0...
hda: DV-28E-N, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 34
Probing IDE interface ide1...
Probing IDE interface ide1...
Probing IDE interface ide2...
Probing IDE interface ide3...
...

I inserted a DVD to the drive, and ran the following command in two terminals,

while :; do dd if=/dev/cdrom of=/dev/null; done

Then, I issued Alt-SysRq-C via keyboard. However, VMCores were generated without any issue.

# ls -l /var/crash/127.0.0.1-2008-11-27-04\:*
/var/crash/127.0.0.1-2008-11-27-04:14:50:
total 83880
-rw------- 1 root root 85824162 Nov 27 04:15 vmcore

/var/crash/127.0.0.1-2008-11-27-04:21:13:
total 81804
-rw------- 1 root root 83704021 Nov 27 04:21 vmcore

Comment 57 Qian Cai 2008-11-27 10:34:05 UTC

In addition, the fix in the latest kexec-tools is doing this right now.

function avoid_cdrom_drive()
{
        local DRIVE=""
        local MEDIA=""
        local IDE_DRIVES=(`echo hd{a,b,c,d}`)
        local COUNTER="0"

        for DRIVE in ${IDE_DRIVES[@]}
        do
                if ! $(echo "$KDUMP_COMMANDLINE" |grep -q "$DRIVE=");then
                        if [ -f /proc/ide/$DRIVE/media ];then
                                MEDIA=$(cat /proc/ide/$DRIVE/media)
                                if [ x"$MEDIA" == x"cdrom" ]; then
                                        KDUMP_IDE_NOPROBE_COMMANDLINE="$KDUMP_IDE_NOPROBE_COMMANDLINE $DRIVE=cdrom"
                                        COUNTER=$(($COUNTER+1))
                                fi
                        fi
                fi
        done
        # We don't find cdrom drive.
        if [ $COUNTER -eq 0 ]; then
                KDUMP_IDE_NOPROBE_COMMANDLINE=""
        fi
}

The final kexec arguments it will be used will be something like,

/sbin/kexec -p '--command-line=BOOT_IMAGE=scsi0:EFI\redhat\vmlinuz-2.6.18-124.el5 rhgb quiet root=LABEL=/  ro irqpoll maxcpus=1 reset_devices  hda=cdrom' --initrd=/boot/efi/efi/redhat/initrd-2.6.18-124.el5kdump.img /boot/efi/efi/redhat/vmlinuz-2.6.18-124.el5

Is that what you need? I asked it because it differs from the one mentioned in release note, i.e. "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe". Since I can't test it, I would like to check with you about it.

Comment 58 Qian Cai 2008-11-28 04:22:31 UTC

I have just come across this from the issue tracker.

-----------------------------------------
Event posted 11-27-2008 05:59am EST by asakai
 	
Hi,

Thank you for your testing.
The rate that I could reproduce this issue is 100%.
To avoid this issue, we need "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe" options.

But this option may cause other problem.
For example, some machines which use ide disks at kdumping cannot do kdump to the disks.
I just worry about that. Fujitsu doesn't have such a
server, but other vendor may have such a server.
What do you think about it?

Best Regards,

Akio Takebe
----------------------------------------

To answer to the above question, I am not a developer, but the one to verify if the agreed fix has already included in RHEL 5.3. So, I am not the best person to tell you if the fix is the CORRECT solution. In other words, I care about more if Fujitsu agreed the already committed fix or not? If so, I'll mark this bug as already been verified, and it will be in RHEL 5.3 release as is. Otherwise, if you don't think it is the RIGHT solution for you at this point, we'll need to ask the developer to re-work it.

Comment 59 Issue Tracker 2008-11-28 04:52:12 UTC

From FJ:
---
Hi,

If other vendors don't complain, automatically
adding "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe"
options is the correct solution, I think.
But if not, just adding the article to avoid this bug
in release notes is OK, I think.

Best Regards,

Akio Takebe 
---


This event sent from IssueTracker by moshiro 
 issue 130241

Comment 60 Qian Cai 2008-11-28 05:05:14 UTC

OK, so the current patch is doing this,

automatically adding "hda=cdrom" only, and no more release note.

Does it sound like the right solution for Fujitsu?

Comment 61 Qian Cai 2008-11-28 05:26:56 UTC

From the issue tracker:
---
Hi,

The patch of BZ#254163 Comment #31 is right solution.
hdX=noprobe options are also needed.
e.g. "hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe"

Best Regards,

Akio Takebe
---

Thanks, I'll ping the developer about this.

Comment 65 Don Domingo 2008-12-01 00:32:00 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
A bug in the IDE/ATA driver stack that could prevent a system using kernel-xen from booting into the kdump environment is now fixed. In previous releases, this occurred if the system encountered a kernel panic while an IDE device was performing I/O and the IDE device was being controlled by a device driver other than libata.

Comment 67 Neil Horman 2008-12-02 16:14:40 UTC

*** Bug 473852 has been marked as a duplicate of this bug. ***

Comment 72 Issue Tracker 2008-12-03 00:40:54 UTC

From FJ:
---
I cannot edit the BZ directly.
This issue occurred on both native and xen environments.
And the current patch adds only hda=cdrom option, not add hdX=noprobe
option. So this issue is not fixed yet.

Best Regards,

Akio Takebe 
---


This event sent from IssueTracker by moshiro 
 issue 130241

Comment 74 Qian Cai 2008-12-03 08:11:46 UTC

Fujitsu, the new re-work patch is doing the following.

Adding "hdX=cdrom" to the existing CDROM devices, and "hdX=noprobe" to the existing storage devices. For example,

If the system has hda (CDROM), hdb (IDE disk), it will add "hda=cdrom hdb=noprobe".

If the system has hda (CDROM) only, it will add "hda=cdrom" only.

Here is the patch,

+function avoid_cdrom_drive()
+{
+ local DRIVE=""
+ local MEDIA=""
+ local IDE_DRIVES=(`echo hd{a,b,c,d}`)
+ local COUNTER="0"
+
+ for DRIVE in ${IDE_DRIVES[@]}
+ do
+  if ! $(echo "$KDUMP_COMMANDLINE" |grep -q "$DRIVE=");then
+   if [ -f /proc/ide/$DRIVE/media ];then
+    MEDIA=$(cat /proc/ide/$DRIVE/media)
+    if [ x"$MEDIA" == x"cdrom" ]; then
+     KDUMP_IDE_NOPROBE_COMMANDLINE="$KDUMP_IDE_NOPROBE_COMMANDLINE
$DRIVE=cdrom"
+     COUNTER=$(($COUNTER+1))
+    else
+     KDUMP_IDE_NOPROBE_COMMANDLINE="$KDUMP_IDE_NOPROBE_COMMANDLINE $DRIVE=noprobe"
+    fi
+   fi
+  fi
+ done
+ # We don't find cdrom drive.
+ if [ $COUNTER -eq 0 ]; then
+  KDUMP_IDE_NOPROBE_COMMANDLINE=""
+ fi
+}
+
 # Load the kdump kerel specified in /etc/sysconfig/kdump
 # If none is specified, try to load a kdump kernel with the same version
 # as the currently running kernel.
@@ -226,6 +267,8 @@

  KDUMP_COMMANDLINE=`echo $KDUMP_COMMANDLINE | sed -e
's/crashkernel=[0-9]\+[MmKkGg]@[0-9]\+[MmGgKk]//'`
  KDUMP_COMMANDLINE="${KDUMP_COMMANDLINE} ${KDUMP_COMMANDLINE_APPEND}"
+ avoid_cdrom_drive
+ KDUMP_COMMANDLINE="${KDUMP_COMMANDLINE} ${KDUMP_IDE_NOPROBE_COMMANDLINE}"

  KEXEC_OUTPUT=`$KEXEC $KEXEC_ARGS $standard_kexec_args \
   --command-line="$KDUMP_COMMANDLINE" \


Do you agree this fix? Can you try the latest kexec-tools-1.102pre-52.el5 to see if it fixes your problem?

Comment 75 Issue Tracker 2008-12-06 03:08:37 UTC

From FJ:
---
Hi,

I don't try it yet.
But I suspect kdump would not work with it
because it doesn't add hdX=noprobe option for not-existing hdX.
When I checked wheather kdump work with only hda=noprove,
the system had a CDROM device and didn't have any other IDE device. At
that time, kdump could not work with hda=cdrom.
But kdump could work with hda=cdrom hdb=noprobe hdc=noprobe hdd=noprobe.

Best Regards,

Akio Takebe 
---


This event sent from IssueTracker by moshiro 
 issue 130241

Comment 84 errata-xmlrpc 2009-01-20 20:58:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0105.html

Comment 86 Prarit Bhargava 2009-07-20 13:55:15 UTC

*** Bug 469608 has been marked as a duplicate of this bug. ***