Description of problem: On a RHEL 5.5 FC host, one can regularly see SCSI devices marked as offlined during IO with fabric faults. Seen on both Emulex & QLogic FC hosts. A snippet of the /var/log/messages for the offline scenario shows the following on one such Emulex host (with lpfc log verbose set to 0x1004): Sep 1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000 Sep 1 10:23:01 IBMx346-200-114 kernel: end_request: I/O error, dev sdbx, sector 5285304 Sep 1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0721 Device Reset rport failure: rdata xffff81007e0a5ca8 Sep 1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0714 SCSI layer issued Bus Reset Data: x2002 ........ Sep 1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x1000 x16 x0 x0 Sep 1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x8000 x16 x0 x0 ....... Sep 1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: scsi: Device offlined - not ready after error recovery Sep 1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x07010000 ....... Sep 1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000 Version-Release number of selected component (if applicable): RHEL 5.5 Errata (2.6.18-194.11.1.el5) Emulex - LPe12002 FW: 2.00A3 (U3D2.00A3) DVR: v8.2.0.63.3p QLogic - QLE2562 FW:v5.03.02 DVR: v8.03.01.04.05.05-k How reproducible: Frequent.
Created attachment 446233 [details] /var/log/messages for the SCSI offline scenario Above logs taken with lpfc log verbose set to 0x1004
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
After discussions with Mike, we came to the following conclusions: 1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5 where the offlined SCSI devices were prevented from moving back to running state (in certain scenarios like the one mentioned above in comment #0). This issue is also seen in RHEL6 - tracked in bug 643237. 2) There is also a race bug in the timeout code for FC drivers which may also trigger the SCSI offline issue. This is present in all RHEL4/RHEL5/RHEL6 kernels. But that's a different issue altogether and not tracked here.
Created attachment 454776 [details] Mike Christie's reverted block state debug patch addressing the RHEL 5.5 regression
(In reply to comment #5) > After discussions with Mike, we came to the following conclusions: > > 1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5 > where the offlined SCSI devices were prevented from moving back to running > state (in certain scenarios like the one mentioned above in comment #0). This is addressed by Mike's patch in comment #6.
Mike, Could you also attach the actual patch here (the non debug one)?
(In reply to comment #8) > Mike, > > Could you also attach the actual patch here (the non debug one)? It is actually in a kernel you can test already: https://bugzilla.redhat.com/show_bug.cgi?id=641193#c3
*** This bug has been marked as a duplicate of bug 641193 ***