Description of problem: The Linux SCSI mid-layer, for kernels >= 2.6.18, does not handle SCSI disks that incorrectly report hardware or medium errors. Version-Release number of selected component (if applicable): 2.6.18-53.el5 How reproducible: Always Actual results: The specific scenario I recently chased involved using MD raid1 with 2 SCSI disks that were connected via an Adaptec aacraid controller. Physically pulling one of the disks from the enclosure did _not_ result in the block layer (MD raid1) getting an IO error. The reason was the SCSI mid-layer was dropping the error on the floor because it was calculating 'good_bytes' to be non-zero for this HARDWARE_ERROR condition. Expected results: Even with "out-of-spec" SCSI disks, the SCSI mid-layer can takes steps to feel the HARDWARE_ERROR and should keep 'good_bytes' set to 0. The result is scsi_end_request propagates uptodate=0 to the block layer. Additional info: This regression in the SCSI mid-layer affects all kernels >= 2.6.18; this includes all RHEL5 kernels AFAIK (definitely RHEL5U1). It is my hope that a fix can be included in >= RHEL5U2. To assist in that end, upstream has identified an appropriate fix (this is also the candidate fix for all "stable" kernels) available here: http://marc.info/?l=linux-scsi&m=120199000703534&w=4
The fix is queued in the scsi-misc-2.6 tree for Linus: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=b42abb39ca8a5414f039839866f0357725a53618
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
We considered putting this fix in 5.2, even though it arrived after beta code-freeze. Code review brought up another nearby, but different, issue. This resulted in some delay. In then end, we decided that the risk associated with changing I/O completion code without a full beta test (and plenty of soak time upstream), is not worth the benefit for this issue in 5.2. We intend to fix this in 5.3.
in kernel-2.6.18-115.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html