Bug 632195

Summary:

[NetApp 5.6 bug] SCSI devices offlined on 5.5 FC host during IO with fabric faults

Product:

Red Hat Enterprise Linux 5

Reporter:

Martin George <marting>

Component:

kernel

Assignee:

Mike Christie <mchristi>

Status:

CLOSED DUPLICATE

QA Contact:

Storage QE <storage-qe>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

5.5.z

CC:

andriusb, bdonahue, coughlan, mchristi, xdl-redhat-bugzilla

Target Milestone:

Keywords:

OtherQA, Regression

Target Release:

5.6

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-11-09 15:28:51 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

557597

Attachments:

Description	Flags
/var/log/messages for the SCSI offline scenario	none
Mike Christie's reverted block state debug patch addressing the RHEL 5.5 regression	none

Description Martin George 2010-09-09 11:34:00 UTC

Description of problem:
On a RHEL 5.5 FC host, one can regularly see SCSI devices marked as offlined during IO with fabric faults. Seen on both Emulex & QLogic FC hosts.

A snippet of the /var/log/messages for the offline scenario shows the following 
on one such Emulex host (with lpfc log verbose set to 0x1004):

Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000 
Sep  1 10:23:01 IBMx346-200-114 kernel: end_request: I/O error, dev sdbx, sector 5285304 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0721 Device Reset rport failure: rdata xffff81007e0a5ca8 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.0: 0:(0):0714 SCSI layer issued Bus Reset Data: x2002 
........
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x1000 x16 x0 x0 
Sep  1 10:23:01 IBMx346-200-114 kernel: lpfc 0000:03:00.1: 1:(0):0730 FCP command x2a failed: x2 SNS xf0000600 x29000000 Data: xa x8000 x16 x0 x0 
.......
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: scsi: Device offlined - not ready after error recovery 
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x07010000
.......
Sep  1 10:23:01 IBMx346-200-114 kernel: sd 0:0:1:34: SCSI error: return code = 0x00010000


Version-Release number of selected component (if applicable):
RHEL 5.5 Errata (2.6.18-194.11.1.el5)
Emulex - LPe12002 FW: 2.00A3 (U3D2.00A3) DVR: v8.2.0.63.3p
QLogic - QLE2562 FW:v5.03.02 DVR: v8.03.01.04.05.05-k

How reproducible:
Frequent.

Comment 1 Martin George 2010-09-09 11:35:55 UTC

Created attachment 446233 [details]
/var/log/messages for the SCSI offline scenario

Above logs taken with lpfc log verbose set to 0x1004

Comment 3 RHEL Program Management 2010-09-09 16:40:00 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Martin George 2010-10-21 10:16:11 UTC

After discussions with Mike, we came to the following conclusions:

1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5 where the offlined SCSI devices were prevented from moving back to running state (in certain scenarios like the one mentioned above in comment #0). This issue is also seen in RHEL6 - tracked in bug 643237.

2) There is also a race bug in the timeout code for FC drivers which may also trigger the SCSI offline issue. This is present in all RHEL4/RHEL5/RHEL6 kernels. But that's a different issue altogether and not tracked here.

Comment 6 Martin George 2010-10-21 10:22:35 UTC

Created attachment 454776 [details]
Mike Christie's reverted block state debug patch addressing the RHEL 5.5 regression

Comment 7 Martin George 2010-10-21 10:24:43 UTC

(In reply to comment #5)
> After discussions with Mike, we came to the following conclusions:
> 
> 1) Firstly the SCSI offline is hit due to a regression introduced in RHEL 5.5
> where the offlined SCSI devices were prevented from moving back to running
> state (in certain scenarios like the one mentioned above in comment #0). 

This is addressed by Mike's patch in comment #6.

Comment 8 Martin George 2010-10-25 16:56:46 UTC

Mike,

Could you also attach the actual patch here (the non debug one)?

Comment 9 Mike Christie 2010-10-25 18:16:40 UTC

(In reply to comment #8)
> Mike,
> 
> Could you also attach the actual patch here (the non debug one)?

It is actually in a kernel you can test already:
https://bugzilla.redhat.com/show_bug.cgi?id=641193#c3

Comment 10 Martin George 2010-11-09 15:28:51 UTC


*** This bug has been marked as a duplicate of bug 641193 ***