Bug 511141 - qla2xxx - Provide fundamental reset capability for EEH
qla2xxx - Provide fundamental reset capability for EEH
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
ppc64 Linux
high Severity high
: rc
: 5.4
Assigned To: Marcus Barrow
Red Hat Kernel QE team
: OtherQA
Depends On:
Blocks: 460170
  Show dependency treegraph
 
Reported: 2009-07-13 16:57 EDT by Marcus Barrow
Modified: 2009-09-03 09:44 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:29:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Add EEH callback to query device reset type (6.65 KB, patch)
2009-07-13 17:45 EDT, Richard A Lary
no flags Details | Diff
Add EEH callback to query device reset type to qla2xxx driver (968 bytes, application/octet-stream)
2009-07-13 17:48 EDT, Richard A Lary
no flags Details
Add EEH callback to query device reset type (6.66 KB, application/octet-stream)
2009-07-14 13:20 EDT, Richard A Lary
no flags Details
Add fndmntl_rst_rqd bitfield to arch/powerpc/kernel/pci_64.c (1.63 KB, application/octet-stream)
2009-07-15 01:15 EDT, Richard A Lary
no flags Details
Allows qla2xxx driver to set fndmntl_rst_rqd for specific device types (521 bytes, application/octet-stream)
2009-07-15 01:17 EDT, Richard A Lary
no flags Details
Allows qlge driver to set fndmntl_rst_rqd bit (435 bytes, application/octet-stream)
2009-07-15 01:18 EDT, Richard A Lary
no flags Details
rhel5 fundamental reset for EEH (5.61 KB, patch)
2009-07-15 16:39 EDT, Marcus Barrow
no flags Details | Diff
include the 4 Gb/S HBA's (5.84 KB, patch)
2009-07-17 08:44 EDT, Marcus Barrow
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 04:53:34 EDT

  None (edit)
Description Marcus Barrow 2009-07-13 16:57:07 EDT
The EEH code needs to be able to provide a fundamental reset to recover from some errors.

This ability to tolerate and recover from hardware errors is an important reliability improvement for customers. It can prevent systems becoming unusable and protects customer access to their data.
Comment 1 Richard A Lary 2009-07-13 17:42:43 EDT
Native device driver support for "Extended Error Handling” (EEH) allows I/O device drivers to recover from  intermittent PCI bus errors by resetting and then restarting the device which experienced the PCI bus error. 

IBM has support for this feature on all IBM platform models starting with Power 5 on most I/O devices.

IBM/QLogic have been working together to understand the root cause of an issue seen during recovery of EEH errors on QLogic PCIe adapters. QLogic has determined the QLogic PCIe adapters require
a fundamental reset following an EEH error in order to be fully recovered.

This reliability feature is a key differentiator for IBM PowerPC platforms.

IBM has developed two patches which will resolve the issue by providing a method for the device
driver to request the kernel Power PC EEH driver to issue a fundamental reset instead of a hot reset for devices which require one.  One patch is to EEH kernel driver and EEH header files, the second patch is
for qla2xxx driver to report which reset type is required for devices which require a fundamental reset.

The kernel patch will be submitted to ppc64-dev list for comments.  The qla2xxx patch has been reviewed
by Qlogic and will be pushed upstream pending favorable comments on kernel patch.

For reference I have posted the patches to this bug.  These patches may not be in final form pending comments from kernel community.
Comment 2 Richard A Lary 2009-07-13 17:45:47 EDT
Created attachment 351528 [details]
Add EEH callback to query device reset type

Proposed patch for reference pending comment from ppc64-dev
Comment 3 Richard A Lary 2009-07-13 17:48:25 EDT
Created attachment 351529 [details]
Add EEH callback to query device reset type to qla2xxx driver

Proposed patch to add reset_type EEH callback to qla2xxx driver to allow driver to request fundamental reset for device types which require it.
Comment 4 Richard A Lary 2009-07-14 13:20:13 EDT
Created attachment 351639 [details]
Add EEH callback to query device reset type

Original attachment was incorrect copy of the patch, replacing with correct copy.
Comment 5 RHEL Product and Program Management 2009-07-14 18:03:26 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 6 Richard A Lary 2009-07-15 01:15:16 EDT
Created attachment 353775 [details]
Add fndmntl_rst_rqd bitfield to arch/powerpc/kernel/pci_64.c

Review of original patch set raised concerns with kABI breakage. This patch set should resolve those issues as the added bit field does not change the size of struct pci_dev.  Patch set includes this patch to define and intiialize fndmntl_rst_rqd.  rh54_qla2xxx_set_reset_type.patch allows qla2xxx driver to set bit for specific device types, rh54_qlge_set_reset_type.patch allows qlge driver to set bit.  Both drivers must set bit as both are pci functions on same pci slot.

Please consider this alternate solution.
Comment 7 Richard A Lary 2009-07-15 01:17:13 EDT
Created attachment 353777 [details]
Allows qla2xxx driver to set fndmntl_rst_rqd for specific device types

See comment with previous attachment
Comment 8 Richard A Lary 2009-07-15 01:18:15 EDT
Created attachment 353778 [details]
Allows qlge driver to set fndmntl_rst_rqd bit

See comments with previous attachment
Comment 9 John Jarvis 2009-07-15 09:38:10 EDT
IBM is signed up to test and provide feedback.
Comment 10 Richard A Lary 2009-07-15 10:17:39 EDT
Agreed, IBM will test and provide feedback in timely manner to support this request.
Comment 11 Marcus Barrow 2009-07-15 16:39:06 EDT
Created attachment 353905 [details]
rhel5 fundamental reset for EEH


Patches for kernel, qla2xxx and qlge to provide fundamental reset for EEH. In addition to the work in the previous three patches, this include version number updates for the drivers.
Comment 12 Richard A Lary 2009-07-15 17:54:11 EDT
Applied fhel5 fundamental reset for EEH patch to -158 kernel on Power PC
server.  Verified sucessful recovery from injected EEH errors on both qlge
driver and qla2xxx driver. A total of 11 injected EEH errors were recovered; 8
glge errors, 3 qla2xxx errors.  Upon one qlge device reaching 6 EEH error
within one hour, both qlge devices and qla2xxx devices were sucessfully
shutdown.

Patch is working as expected.
Comment 13 IBM Bug Proxy 2009-07-15 18:01:42 EDT
------- Comment From rlary@us.ibm.com 2009-07-15 17:52 EDT-------
Applied fhel5 fundamental reset for EEH patch to -158 kernel on Power PC server.  Verified sucessful recovery from injected EEH errors on both qlge driver and qla2xxx driver. A total of 11 injected EEH errors were recovered; 8 glge errors, 3 qla2xxx errors.  Upon one qlge device reaching 6 EEH error within one hour, both qlge devices and qla2xxx devices were sucessfully shutdown.
Comment 14 Richard A Lary 2009-07-16 00:19:00 EDT
It is important to note that this patch fixes an issue which can lead to loss of access to data storage requiring system reboot to restore storage access.  With this patch in place, should a pci bus error be detected by IBM PCI bridge, the system will fully recover from this otherwise potential unrecoverable event.
Comment 16 Marcus Barrow 2009-07-17 08:44:42 EDT
Created attachment 354134 [details]
include the 4 Gb/S HBA's
Comment 17 Don Zickus 2009-07-21 15:37:25 EDT
in kernel-2.6.18-159.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 19 Richard A Lary 2009-07-22 18:54:01 EDT
Downloaded, installed -159 kernel. Verified patch fixes the EEH fundamental reset issue.  Both qlge and qla2xxx drivers recover as expected from injected pci bus errors.
Comment 20 IBM Bug Proxy 2009-07-22 22:11:16 EDT
------- Comment From rlary@us.ibm.com 2009-07-22 22:08 EDT-------
Marking as closed on IBM side, expect patches in Snap4
Comment 22 Chris Ward 2009-08-04 10:12:18 EDT
Marcus, any update for this request from QLogic?
Comment 23 Marcus Barrow 2009-08-04 10:23:41 EDT
The -159 kernel was downloaded and passed testing at IBM's site. This should be set to VERIFIRED now. My browser is not letting me update the "Verfired By" filed without removing the other fields though ( any,IBM). Could you do that please?
Comment 25 errata-xmlrpc 2009-09-02 04:29:41 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.