Red Hat Bugzilla – Bug 511141
qla2xxx - Provide fundamental reset capability for EEH
Last modified: 2009-09-03 09:44:02 EDT
The EEH code needs to be able to provide a fundamental reset to recover from some errors.
This ability to tolerate and recover from hardware errors is an important reliability improvement for customers. It can prevent systems becoming unusable and protects customer access to their data.
Native device driver support for "Extended Error Handling” (EEH) allows I/O device drivers to recover from intermittent PCI bus errors by resetting and then restarting the device which experienced the PCI bus error.
IBM has support for this feature on all IBM platform models starting with Power 5 on most I/O devices.
IBM/QLogic have been working together to understand the root cause of an issue seen during recovery of EEH errors on QLogic PCIe adapters. QLogic has determined the QLogic PCIe adapters require
a fundamental reset following an EEH error in order to be fully recovered.
This reliability feature is a key differentiator for IBM PowerPC platforms.
IBM has developed two patches which will resolve the issue by providing a method for the device
driver to request the kernel Power PC EEH driver to issue a fundamental reset instead of a hot reset for devices which require one. One patch is to EEH kernel driver and EEH header files, the second patch is
for qla2xxx driver to report which reset type is required for devices which require a fundamental reset.
The kernel patch will be submitted to ppc64-dev list for comments. The qla2xxx patch has been reviewed
by Qlogic and will be pushed upstream pending favorable comments on kernel patch.
For reference I have posted the patches to this bug. These patches may not be in final form pending comments from kernel community.
Created attachment 351528 [details]
Add EEH callback to query device reset type
Proposed patch for reference pending comment from ppc64-dev
Created attachment 351529 [details]
Add EEH callback to query device reset type to qla2xxx driver
Proposed patch to add reset_type EEH callback to qla2xxx driver to allow driver to request fundamental reset for device types which require it.
Created attachment 351639 [details]
Add EEH callback to query device reset type
Original attachment was incorrect copy of the patch, replacing with correct copy.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Created attachment 353775 [details]
Add fndmntl_rst_rqd bitfield to arch/powerpc/kernel/pci_64.c
Review of original patch set raised concerns with kABI breakage. This patch set should resolve those issues as the added bit field does not change the size of struct pci_dev. Patch set includes this patch to define and intiialize fndmntl_rst_rqd. rh54_qla2xxx_set_reset_type.patch allows qla2xxx driver to set bit for specific device types, rh54_qlge_set_reset_type.patch allows qlge driver to set bit. Both drivers must set bit as both are pci functions on same pci slot.
Please consider this alternate solution.
Created attachment 353777 [details]
Allows qla2xxx driver to set fndmntl_rst_rqd for specific device types
See comment with previous attachment
Created attachment 353778 [details]
Allows qlge driver to set fndmntl_rst_rqd bit
See comments with previous attachment
IBM is signed up to test and provide feedback.
Agreed, IBM will test and provide feedback in timely manner to support this request.
Created attachment 353905 [details]
rhel5 fundamental reset for EEH
Patches for kernel, qla2xxx and qlge to provide fundamental reset for EEH. In addition to the work in the previous three patches, this include version number updates for the drivers.
Applied fhel5 fundamental reset for EEH patch to -158 kernel on Power PC
server. Verified sucessful recovery from injected EEH errors on both qlge
driver and qla2xxx driver. A total of 11 injected EEH errors were recovered; 8
glge errors, 3 qla2xxx errors. Upon one qlge device reaching 6 EEH error
within one hour, both qlge devices and qla2xxx devices were sucessfully
Patch is working as expected.
------- Comment From firstname.lastname@example.org 2009-07-15 17:52 EDT-------
Applied fhel5 fundamental reset for EEH patch to -158 kernel on Power PC server. Verified sucessful recovery from injected EEH errors on both qlge driver and qla2xxx driver. A total of 11 injected EEH errors were recovered; 8 glge errors, 3 qla2xxx errors. Upon one qlge device reaching 6 EEH error within one hour, both qlge devices and qla2xxx devices were sucessfully shutdown.
It is important to note that this patch fixes an issue which can lead to loss of access to data storage requiring system reboot to restore storage access. With this patch in place, should a pci bus error be detected by IBM PCI bridge, the system will fully recover from this otherwise potential unrecoverable event.
Created attachment 354134 [details]
include the 4 Gb/S HBA's
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
Downloaded, installed -159 kernel. Verified patch fixes the EEH fundamental reset issue. Both qlge and qla2xxx drivers recover as expected from injected pci bus errors.
------- Comment From email@example.com 2009-07-22 22:08 EDT-------
Marking as closed on IBM side, expect patches in Snap4
Marcus, any update for this request from QLogic?
The -159 kernel was downloaded and passed testing at IBM's site. This should be set to VERIFIRED now. My browser is not letting me update the "Verfired By" filed without removing the other fields though ( any,IBM). Could you do that please?
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.