Bug 503960 - System freezes when removing ipr driver after injecting EEH errors
System freezes when removing ipr driver after injecting EEH errors
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
ppc64 All
low Severity medium
: rc
: ---
Assigned To: Ameet Paranjape
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-03 10:51 EDT by IBM Bug Proxy
Modified: 2013-03-07 20:06 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:53:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Backport of the patch sent to mainstream (530 bytes, text/plain)
2009-06-03 10:51 EDT, IBM Bug Proxy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 53080 None None None Never

  None (edit)
Description IBM Bug Proxy 2009-06-03 10:51:08 EDT
=Comment: #0=================================================
Kleber Sacilotto De Souza <klebers@linux.vnet.ibm.com> - 
---Problem Description---
The system hangs when trying to remove ipr module after the 6th EEH error injected.
 
Contact Information = Kleber Sacilotto de Souza <klebers@br.ibm.com> 
 
---Additional Hardware Info---
One Cadet-X adapter (572A), two Squib-E adapters (574E) and six HUS151473VLS300
HDD connected to one of the Squib-E adapters.  

 
---uname output---
Linux devl4e-hickory-lp2 2.6.27.19-5-ppc64 #1 SMP 2009-02-28 04:40:21 +0100 ppc64 ppc64 ppc64 GNU/Linux
 
Machine Type = 9117-MMA 
 
---System Hang---
 The system doesn't respond to the HMC console and doesn't accept new SSH connections. It's needed
to reboot the system to reclaim it.
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
 1) Generate some I/O request to a disk connected to an IOA:
devl4e-hickory-lp2:~ # dd if=/dev/sde of=foo bs=1M count=1K

2) Inject EEH errors on the bus while the I/O is being performed:
devl4e-hickory-lp2:~ # errinjct eeh -v -f 0 -p U789D.001.DQDTTPP-P1-C1 -a
0xffec0000 -m 0xfffc0000

3) Repeat steps 3 and 4 for 6 times. The system will disable the device.

4) Try to remove the ipr module. The module will be in use, so only 'rmmod -f' can be used:

devl4e-hickory-lp2:~ # lsmod
Module                  Size  Used by
<snipped>
ipr                   116488  1
<snipped>

devl4e-hickory-lp2:~ # rmmod ipr
ERROR: Module ipr is in use

devl4e-hickory-lp2:~ # rmmod -f ipr
<HANGS HERE>

Log messages:
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: Call Trace:
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3ba0] [c000000000010f4c]
.show_stack+0x6c/0x16c (unreliable)
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3c50] [c00000000005582c]
.eeh_dn_check_failure+0x30c/0x35c
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3d00] [c000000000055958]
.eeh_check_failure+0xdc/0x104
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3d80] [d0000000000c3e74] .ipr_isr+0xe4/0x474
[ipr]
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3e50] [c0000000000e2754]
.handle_IRQ_event+0xd0/0x190
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3ef0] [c0000000000e4c1c]
.handle_fasteoi_irq+0x118/0x1d0
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3f90] [c00000000002ab78]
.call_handle_irq+0x1c/0x2c
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abab0] [c00000000000d57c] .do_IRQ+0x100/0x1c4
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abb50] [c000000000004d18]
hardware_interrupt_entry+0x18/0x1c
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: --- Exception: 501 at .raw_local_irq_restore+0x70/0x80
Apr  2 14:56:56 devl4e-hickory-lp2 kernel:     LR = .cpu_idle+0x108/0x1a4
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abed0] [c0000000005215a8]
.start_secondary+0x358/0x398
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abf90] [c0000000000083c0]
.start_secondary_prolog+0xc/0x10
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: EEH: Detected PCI bus error on device 0002:01:00.0
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: EEH: PCI device at location=U789D.001.DQDTTPP-P1-C1-T1
driver=ipr pci addr=0002:01:00.0
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: has failed 6 times in the last hour and has been
permanently disabled.
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: Please try reseating this device or replacing it.
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: EEH: Unexpected state change 2, err=-7
dn=/pci@800000020000204/pci1014,0339@0
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: EEH: of node=/pci@800000020000204/pci1014,0339@0
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: EEH: PCI device/vendor: ffffffff
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: EEH: PCI cmd/status register: ffffffff
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: RTAS: event: 272, Type: Platform Error, Severity: 2
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: ipr 0002:01:00.0: IOA taken offline - error recovery failed

 
---Kernel Component Data--- 
Stack trace output:
 no
 
Oops output:
 no
 
System Dump Info:
  The system is not configured to capture a system dump.
 
=Comment: #5=================================================
Kleber Sacilotto De Souza <klebers@linux.vnet.ibm.com> - 

Backport of the patch sent to mainstream

This patch has been added to the upstream SCSI tree. It can be found here:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=6e145ad73987cfc8375e5396073dd2692e07bd15

This patch is scheduled to be pushed when the merge window opens for 2.6.31.


Thanks,
Kleber
Comment 1 IBM Bug Proxy 2009-06-03 10:51:14 EDT
Created attachment 346405 [details]
Backport of the patch sent to mainstream
Comment 2 IBM Bug Proxy 2009-06-08 09:11:26 EDT
------- Comment From ameet@austin.ibm.com 2009-06-08 09:04 EDT-------
Business Justification:
Without  this patch when EEH error occurs, the EEH native recovery won't work properly
on the PCI-E SAS adapters.   The device would still need to go offline and can
only be restarted under user intervention. This is a RAS issue on Power
systems.

The patch in this bug also requires the fix in RIT 304637.

Red Hat,

Please consider this under the exception process for RHEL 5.4.
Comment 5 RHEL Product and Program Management 2009-06-08 17:31:12 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 6 Don Zickus 2009-06-11 11:37:41 EDT
in kernel-2.6.18-153.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 8 IBM Bug Proxy 2009-06-12 10:30:48 EDT
------- Comment From ameet@austin.ibm.com 2009-06-12 10:21 EDT-------
Bug successfully tested on kernel 2.6.18-153.el5.
Comment 9 IBM Bug Proxy 2009-07-01 16:30:40 EDT
------- Comment From klebers@linux.vnet.ibm.com 2009-07-01 16:26 EDT-------
Confirmed as fixed on RHEL5.4 Beta.
Comment 11 errata-xmlrpc 2009-09-02 04:53:36 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.