Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 503960

Summary: System freezes when removing ipr driver after injecting EEH errors
Product: Red Hat Enterprise Linux 5 Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Ameet Paranjape <aparanja>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: aparanja, dzickus, peterm
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:53:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Backport of the patch sent to mainstream none

Description IBM Bug Proxy 2009-06-03 14:51:08 UTC
=Comment: #0=================================================
Kleber Sacilotto De Souza <klebers.ibm.com> - 
---Problem Description---
The system hangs when trying to remove ipr module after the 6th EEH error injected.
 
Contact Information = Kleber Sacilotto de Souza <klebers.com> 
 
---Additional Hardware Info---
One Cadet-X adapter (572A), two Squib-E adapters (574E) and six HUS151473VLS300
HDD connected to one of the Squib-E adapters.  

 
---uname output---
Linux devl4e-hickory-lp2 2.6.27.19-5-ppc64 #1 SMP 2009-02-28 04:40:21 +0100 ppc64 ppc64 ppc64 GNU/Linux
 
Machine Type = 9117-MMA 
 
---System Hang---
 The system doesn't respond to the HMC console and doesn't accept new SSH connections. It's needed
to reboot the system to reclaim it.
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
 1) Generate some I/O request to a disk connected to an IOA:
devl4e-hickory-lp2:~ # dd if=/dev/sde of=foo bs=1M count=1K

2) Inject EEH errors on the bus while the I/O is being performed:
devl4e-hickory-lp2:~ # errinjct eeh -v -f 0 -p U789D.001.DQDTTPP-P1-C1 -a
0xffec0000 -m 0xfffc0000

3) Repeat steps 3 and 4 for 6 times. The system will disable the device.

4) Try to remove the ipr module. The module will be in use, so only 'rmmod -f' can be used:

devl4e-hickory-lp2:~ # lsmod
Module                  Size  Used by
<snipped>
ipr                   116488  1
<snipped>

devl4e-hickory-lp2:~ # rmmod ipr
ERROR: Module ipr is in use

devl4e-hickory-lp2:~ # rmmod -f ipr
<HANGS HERE>

Log messages:
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: Call Trace:
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3ba0] [c000000000010f4c]
.show_stack+0x6c/0x16c (unreliable)
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3c50] [c00000000005582c]
.eeh_dn_check_failure+0x30c/0x35c
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3d00] [c000000000055958]
.eeh_check_failure+0xdc/0x104
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3d80] [d0000000000c3e74] .ipr_isr+0xe4/0x474
[ipr]
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3e50] [c0000000000e2754]
.handle_IRQ_event+0xd0/0x190
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3ef0] [c0000000000e4c1c]
.handle_fasteoi_irq+0x118/0x1d0
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c000000001fa3f90] [c00000000002ab78]
.call_handle_irq+0x1c/0x2c
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abab0] [c00000000000d57c] .do_IRQ+0x100/0x1c4
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abb50] [c000000000004d18]
hardware_interrupt_entry+0x18/0x1c
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: --- Exception: 501 at .raw_local_irq_restore+0x70/0x80
Apr  2 14:56:56 devl4e-hickory-lp2 kernel:     LR = .cpu_idle+0x108/0x1a4
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abed0] [c0000000005215a8]
.start_secondary+0x358/0x398
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: [c0000000c61abf90] [c0000000000083c0]
.start_secondary_prolog+0xc/0x10
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: EEH: Detected PCI bus error on device 0002:01:00.0
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: EEH: PCI device at location=U789D.001.DQDTTPP-P1-C1-T1
driver=ipr pci addr=0002:01:00.0
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: has failed 6 times in the last hour and has been
permanently disabled.
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: Please try reseating this device or replacing it.
Apr  2 14:56:56 devl4e-hickory-lp2 kernel: EEH: Unexpected state change 2, err=-7
dn=/pci@800000020000204/pci1014,0339@0
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: EEH: of node=/pci@800000020000204/pci1014,0339@0
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: EEH: PCI device/vendor: ffffffff
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: EEH: PCI cmd/status register: ffffffff
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: RTAS: event: 272, Type: Platform Error, Severity: 2
Apr  2 14:57:04 devl4e-hickory-lp2 kernel: ipr 0002:01:00.0: IOA taken offline - error recovery failed

 
---Kernel Component Data--- 
Stack trace output:
 no
 
Oops output:
 no
 
System Dump Info:
  The system is not configured to capture a system dump.
 
=Comment: #5=================================================
Kleber Sacilotto De Souza <klebers.ibm.com> - 

Backport of the patch sent to mainstream

This patch has been added to the upstream SCSI tree. It can be found here:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=6e145ad73987cfc8375e5396073dd2692e07bd15

This patch is scheduled to be pushed when the merge window opens for 2.6.31.


Thanks,
Kleber

Comment 1 IBM Bug Proxy 2009-06-03 14:51:14 UTC
Created attachment 346405 [details]
Backport of the patch sent to mainstream

Comment 2 IBM Bug Proxy 2009-06-08 13:11:26 UTC
------- Comment From ameet.com 2009-06-08 09:04 EDT-------
Business Justification:
Without  this patch when EEH error occurs, the EEH native recovery won't work properly
on the PCI-E SAS adapters.   The device would still need to go offline and can
only be restarted under user intervention. This is a RAS issue on Power
systems.

The patch in this bug also requires the fix in RIT 304637.

Red Hat,

Please consider this under the exception process for RHEL 5.4.

Comment 5 RHEL Program Management 2009-06-08 21:31:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Don Zickus 2009-06-11 15:37:41 UTC
in kernel-2.6.18-153.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 8 IBM Bug Proxy 2009-06-12 14:30:48 UTC
------- Comment From ameet.com 2009-06-12 10:21 EDT-------
Bug successfully tested on kernel 2.6.18-153.el5.

Comment 9 IBM Bug Proxy 2009-07-01 20:30:40 UTC
------- Comment From klebers.ibm.com 2009-07-01 16:26 EDT-------
Confirmed as fixed on RHEL5.4 Beta.

Comment 11 errata-xmlrpc 2009-09-02 08:53:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html