Bug 632195

Summary: [NetApp 5.6 bug] SCSI devices offlined on 5.5 FC host during IO with fabric faults
Product: Red Hat Enterprise Linux 5 Reporter: Martin George <marting>
Component: kernelAssignee: Mike Christie <mchristi>
Status: CLOSED DUPLICATE QA Contact: Storage QE <storage-qe>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.5.zCC: andriusb, bdonahue, coughlan, mchristi, xdl-redhat-bugzilla
Target Milestone: rcKeywords: OtherQA, Regression
Target Release: 5.6   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-09 15:28:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 557597    
Attachments:
Description Flags
/var/log/messages for the SCSI offline scenario
none
Mike Christie's reverted block state debug patch addressing the RHEL 5.5 regression none

Description Martin George 2010-09-09 11:34:00 UTC
Description of problem:
On a RHEL 5.5 FC host, one can regularly see SCSI devices marked as offlined during IO with fabric faults. Seen on both Emulex & QLogic FC hosts.

A snippet of the /var/log/messages for the offline scenario shows the following 
on one such Emulex host (with lpfc log verbose set to 0x1004):

Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000 
Sep  1 10:23:01 IBMx346-200-114 kernel: end_request: I/O error, dev sdbx, sector 5285304 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0721 Device Reset rport failure: rdata xffff81007e0a5ca8 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0714 SCSI layer issued Bus Reset Data: x2002 
........
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x1000 x16 x0 x0 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x8000 x16 x0 x0 
.......
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: scsi: Device offlined - not ready after error recovery 
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x07010000
.......
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000


Version-Release number of selected component (if applicable):
RHEL 5.5 Errata (2.6.18-194.11.1.el5)
Emulex - LPe12002 FW: 2.00A3 (U3D2.00A3) DVR: v8.2.0.63.3p
QLogic - QLE2562 FW:v5.03.02 DVR: v8.03.01.04.05.05-k

How reproducible:
Frequent.

Comment 1 Martin George 2010-09-09 11:35:55 UTC
Created attachment 446233 [details]
/var/log/messages for the SCSI offline scenario

Above logs taken with lpfc log verbose set to 0x1004

Comment 3 RHEL Program Management 2010-09-09 16:40:00 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Martin George 2010-10-21 10:16:11 UTC
After discussions with Mike, we came to the following conclusions:

1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5 where the offlined SCSI devices were prevented from moving back to running state (in certain scenarios like the one mentioned above in comment #0). This issue is also seen in RHEL6 - tracked in bug 643237.

2) There is also a race bug in the timeout code for FC drivers which may also trigger the SCSI offline issue. This is present in all RHEL4/RHEL5/RHEL6 kernels. But that's a different issue altogether and not tracked here.

Comment 6 Martin George 2010-10-21 10:22:35 UTC
Created attachment 454776 [details]
Mike Christie's reverted block state debug patch addressing the RHEL 5.5 regression

Comment 7 Martin George 2010-10-21 10:24:43 UTC
(In reply to comment #5)
> After discussions with Mike, we came to the following conclusions:
> 
> 1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5
> where the offlined SCSI devices were prevented from moving back to running
> state (in certain scenarios like the one mentioned above in comment #0). 

This is addressed by Mike's patch in comment #6.

Comment 8 Martin George 2010-10-25 16:56:46 UTC
Mike,

Could you also attach the actual patch here (the non debug one)?

Comment 9 Mike Christie 2010-10-25 18:16:40 UTC
(In reply to comment #8)
> Mike,
> 
> Could you also attach the actual patch here (the non debug one)?

It is actually in a kernel you can test already:
https://bugzilla.redhat.com/show_bug.cgi?id=641193#c3

Comment 10 Martin George 2010-11-09 15:28:51 UTC

*** This bug has been marked as a duplicate of bug 641193 ***