Bug 632195 - [NetApp 5.6 bug] SCSI devices offlined on 5.5 FC host during IO with fabric faults
Summary: [NetApp 5.6 bug] SCSI devices offlined on 5.5 FC host during IO with fabric f...
Keywords:
Status: CLOSED DUPLICATE of bug 641193
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5.z
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: 5.6
Assignee: Mike Christie
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks: 557597
TreeView+ depends on / blocked
 
Reported: 2010-09-09 11:34 UTC by Martin George
Modified: 2010-11-09 15:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-11-09 15:28:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/messages for the SCSI offline scenario (1.27 MB, application/x-zip-compressed)
2010-09-09 11:35 UTC, Martin George
no flags Details
Mike Christie's reverted block state debug patch addressing the RHEL 5.5 regression (5.25 KB, application/octet-stream)
2010-10-21 10:22 UTC, Martin George
no flags Details

Description Martin George 2010-09-09 11:34:00 UTC
Description of problem:
On a RHEL 5.5 FC host, one can regularly see SCSI devices marked as offlined during IO with fabric faults. Seen on both Emulex & QLogic FC hosts.

A snippet of the /var/log/messages for the offline scenario shows the following 
on one such Emulex host (with lpfc log verbose set to 0x1004):

Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000 
Sep  1 10:23:01 IBMx346-200-114 kernel: end_request: I/O error, dev sdbx, sector 5285304 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0721 Device Reset rport failure: rdata xffff81007e0a5ca8 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0714 SCSI layer issued Bus Reset Data: x2002 
........
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x1000 x16 x0 x0 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x8000 x16 x0 x0 
.......
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: scsi: Device offlined - not ready after error recovery 
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x07010000
.......
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000


Version-Release number of selected component (if applicable):
RHEL 5.5 Errata (2.6.18-194.11.1.el5)
Emulex - LPe12002 FW: 2.00A3 (U3D2.00A3) DVR: v8.2.0.63.3p
QLogic - QLE2562 FW:v5.03.02 DVR: v8.03.01.04.05.05-k

How reproducible:
Frequent.

Comment 1 Martin George 2010-09-09 11:35:55 UTC
Created attachment 446233 [details]
/var/log/messages for the SCSI offline scenario

Above logs taken with lpfc log verbose set to 0x1004

Comment 3 RHEL Program Management 2010-09-09 16:40:00 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Martin George 2010-10-21 10:16:11 UTC
After discussions with Mike, we came to the following conclusions:

1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5 where the offlined SCSI devices were prevented from moving back to running state (in certain scenarios like the one mentioned above in comment #0). This issue is also seen in RHEL6 - tracked in bug 643237.

2) There is also a race bug in the timeout code for FC drivers which may also trigger the SCSI offline issue. This is present in all RHEL4/RHEL5/RHEL6 kernels. But that's a different issue altogether and not tracked here.

Comment 6 Martin George 2010-10-21 10:22:35 UTC
Created attachment 454776 [details]
Mike Christie's reverted block state debug patch addressing the RHEL 5.5 regression

Comment 7 Martin George 2010-10-21 10:24:43 UTC
(In reply to comment #5)
> After discussions with Mike, we came to the following conclusions:
> 
> 1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5
> where the offlined SCSI devices were prevented from moving back to running
> state (in certain scenarios like the one mentioned above in comment #0). 

This is addressed by Mike's patch in comment #6.

Comment 8 Martin George 2010-10-25 16:56:46 UTC
Mike,

Could you also attach the actual patch here (the non debug one)?

Comment 9 Mike Christie 2010-10-25 18:16:40 UTC
(In reply to comment #8)
> Mike,
> 
> Could you also attach the actual patch here (the non debug one)?

It is actually in a kernel you can test already:
https://bugzilla.redhat.com/show_bug.cgi?id=641193#c3

Comment 10 Martin George 2010-11-09 15:28:51 UTC

*** This bug has been marked as a duplicate of bug 641193 ***


Note You need to log in before you can comment on or make changes to this bug.