Bug 490744

Summary:

UNDERRUN and TIMEOUT status with qla2xxx

Product:

Red Hat Enterprise Linux 4

Reporter:

Bryn M. Reeves <bmr>

Component:

kernel

Assignee:

Marcus Barrow <mbarrow>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

4.7

CC:

andrew.vasquez, andriusb, bdonahue, bubrown, coughlan, cward, james.brown, lalit.chandivade, mchristi, mgahagan, pep, qlogic-redhat-ext, ravi.anand, syeghiay, tao

Target Milestone:

Keywords:

OtherQA

Target Release:

4.8

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-05-18 19:20:49 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

450896

Attachments:

Description	Flags
dont failfast did_error	none
scsi,qla2xxx - remove some DID_BUSY	none

Description Bryn M. Reeves 2009-03-17 19:28:57 UTC

Description of problem:
We have several reports of HBAs abruptly starting to report huge numbers of I/Os failing with UNDERRUN and TIMEOUT status using the qla2400 driver. Typically this looks like:

Feb  6 21:30:36 manic kernel: scsi(0:0:134) UNDERRUN status detected 0x15-0x0.
Feb  6 21:30:36 manic kernel: scsi(0:0:134) UNDERRUN status detected 0x15-0x0.
Feb  6 21:30:36 manic kernel: scsi(0:0:0:134) Dropped frame(s) detected (0 of 20000 bytes)...retrying command.
Feb  6 21:30:36 manic kernel: scsi(0:0:0:134) Dropped frame(s) detected (0 of 20000 bytes)...retrying command.
Feb  6 21:30:36 manic kernel: scsi(0:0:134) qla2x00_done: did_error = 2, comp-scsi= 0x15-0x0 pid=11482283.
Feb  6 21:30:36 manic kernel: scsi(0:0:134) qla2x00_done: did_error = 2, comp-scsi= 0x15-0x0 pid=11482283.
Feb  6 21:30:36 manic kernel: SCSI error : <0 0 0 134> return code = 0x20000
Feb  6 21:30:36 manic kernel: SCSI error : <0 0 0 134> return code = 0x20000

Version-Release number of selected component (if applicable):
2.6.9-67.0.7.EL (8.01.07-d4 / 4.04.00)
2.6.9-78.29.EL  (8.02.09.00.04.08-d / 4.06.01)

How reproducible:
Variable. Happens at random, sometimes giving just a handful of errors other times spitting hundreds of similar errors into the logs.

Steps to Reproduce:
Difficult to say; does not appear to depend on load or other "obvious" triggers
  
Actual results:
Devices attached to the QLA fail with above errors

Expected results:
No errors

Additional info:

Comment 3 Bryn M. Reeves 2009-03-17 19:42:40 UTC

Similar set of errors on another host:

Feb 15 10:06:46 kid-a kernel: scsi(1): LOOP READY
Feb 15 10:06:46 kid-a kernel: qla2x00_restart_queues(1): callback 0 commands.
Feb 15 10:06:46 kid-a kernel: qla2x00_restart_queues(1): active=0, retry=0, pending=0, done=0, scsi retry=0 commands.
Feb 15 10:06:46 kid-a kernel: scsi(1): qla2x00_loop_resync - end
Feb 15 10:06:47 kid-a kernel: scsi(1): Asynchronous PORT UPDATE ignored 0000/0004/0600.
Feb 15 10:06:47 kid-a kernel: scsi(1): Asynchronous PORT UPDATE ignored 0000/0007/0b00.
Feb 16 06:07:43 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0
Feb 16 06:07:43 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1736093.
Feb 16 06:33:58 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0
Feb 16 06:33:58 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1759091.
Feb 16 07:39:18 kid-a kernel: scsi(0:0:0:3): TIMEOUT status detected 0x6-0x0
Feb 16 07:39:18 kid-a kernel: scsi(0:0:3) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1820449.
Feb 16 07:41:39 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0
Feb 16 07:41:39 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1822461.
Feb 16 07:43:13 kid-a kernel: scsi(0:0:0:5): TIMEOUT status detected 0x6-0x0
Feb 16 07:43:13 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1823786.
Feb 16 07:46:27 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0
Feb 16 07:46:27 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1826535.
Feb 16 07:50:08 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0
Feb 16 07:50:08 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1829630.
Feb 16 08:09:24 kid-a kernel: scsi(0:0:0:5): TIMEOUT status detected 0x6-0x0
Feb 16 08:09:24 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1851692.
Feb 16 23:41:39 kid-a kernel: scsi(0:0:0:5): TIMEOUT status detected 0x6-0x0
Feb 16 23:41:39 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=3134479.
Feb 17 01:09:00 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0
Feb 17 01:09:00 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=3237238.
Feb 17 10:12:19 kid-a kernel: scsi(0:0:5) UNDERRUN status detected 0x15-0x0.
Feb 17 10:12:19 kid-a kernel: scsi(0:0:0:5) Dropped frame(s) detected (0 of 1e000 bytes)...retrying command.
Feb 17 10:12:19 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x15-0x0 pid=3783475.
Feb 17 10:12:23 kid-a kernel: scsi(0:0:6) UNDERRUN status detected 0x15-0x0.

Driver load messages:
Feb  6 18:17:30 kid-a kernel: SCSI subsystem initialized
Feb  6 18:17:30 kid-a kernel: QLogic Fibre Channel HBA Driver
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Found an ISP2422, irq 209, iobase 0xffffff0000002000
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Configuring PCI space...
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Configure NVRAM parameters...
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Verifying loaded RISC code...
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Allocated (1061 KB) for firmware dump...
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Waiting for LIP to complete...
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: LOOP UP detected (4 Gbps).
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Topology - (F_Port), Host Loop address 0x0
Feb  6 18:17:30 kid-a kernel: scsi0 : qla2xxx
Feb  6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: 
Feb  6 18:17:30 kid-a kernel:  QLogic Fibre Channel HBA Driver: 8.01.07-d4

Comment 7 Bryn M. Reeves 2009-03-18 18:05:17 UTC

So, this looks very similar to the errors in bug 244967:

    scsi(2:1:16) UNDERRUN status detected 0x15-0x0. 
    resid=0x7fff8fff fw_resid=0x7fff8fff cdb=0x28 
    os_underflow=0xf400 srb_flags=0x2
    scsi(2:0:1:16) Dropped frame(s) detected 
    (7fff8fff of f400 bytes)...retrying command.
    scsi(2:1:16) qla2x00_done: did_error = 2,
    comp-scsi= 0x15-0x0 pid=102056310.
    SCSI error : <2 0 1 16> return code = 0x20000
    end_request: I/O error, dev sdbm, sector 4192702
    end_request: I/O error, dev sdbm, sector 4192708
    device-mapper: dm-multipath: Failing path 68:0.

Only difference is the resid/fw_resid/cdb/os_underflow/srb_flags line as this doesn't exist in the RHEL4 version of the driver.

The patch there changes RHEL5's qla2xxx to use DID_ERROR for dropped frames instead of DID_BUS_BUSY and the midlayer was changed in 5.3 to treat DID_ERROR as a retryable error.

Comment 10 Mike Christie 2009-03-18 18:50:47 UTC

Marcus,

Multiple issues:

- For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY?

- What do you get CS_TIMEOUT for? It was in a email but I cannot find it now
and I just want to make sure I got it right upstream. I think I would hit it
when I pulled cables during testing and that is why I had converted it.

- For RHEL5 and the CS_TIMEOUT issue, upstream, I converted this to use
DID_TRANSPORT_DISRUPTED so the IO will not be failed until the fast io fail
fires or if that is not set then IO is failed when dev loss tmo fires.

For RHEL 5 you guys added fast io fail support in 5.3 right? You can then use
DID_TRANSPORT_DISRUPTED for this and the other problems that got converted
converted upstream:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=056a44834950ffa51fafa6c76a720fa32e86851a
(I think you might want to convert some other errors to use
DID_TRANSPORT_DISRUPTED too).

The only problem may be the change in behavior. Your users now have to set fast
io fail for this to fail quickly if that is what they wanted (some do like it).
You guys might have had your users setting dev loss tmo to a high value and if
they do not set fast io fail then these IO errors will not get failed until the
dev loss fires.

- For RHEL4 and CS_TIMEOUT, are you guys going to use the DID_IMM_RETRY trick
for that too?

Comment 18 Andrew Vasquez 2009-03-23 19:59:39 UTC

> - For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY?

If the midlayer (in 2.6.5-xxx) will not try indefinitely, that seems
like a reasonable alternative.  I recall some earlier emails
indicating such...

> - What do you get CS_TIMEOUT for? It was in a email but I cannot find it now
> and I just want to make sure I got it right upstream. I think I would hit it
> when I pulled cables during testing and that is why I had converted it.

Storage failed to respond in the request timeout (typically 30 or 60
seconds).  RHEL4 qla2xxx drivers perform 'internal' queueing and
timing of requests (usually 2-5 seconds less than the midlayer's
per-command timeout).

> - For RHEL5 and the CS_TIMEOUT issue, upstream, I converted this to use
> DID_TRANSPORT_DISRUPTED so the IO will not be failed until the fast io fail
> fires or if that is not set then IO is failed when dev loss tmo fires.

Yes, RHEL5 driver's now use the 'upstream' variant driver which
contains no internal queuing nor timing.  Firmware can still
(unlikely) return CS_TIMEOUT.  I'd suggest the semantics listed above
still be used.

> For RHEL 5 you guys added fast io fail support in 5.3 right? You can then use
> DID_TRANSPORT_DISRUPTED for this and the other problems that got converted
> converted upstream:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=056a44834950ffa51fafa6c76a720fa32e86851a
> (I think you might want to convert some other errors to use
> DID_TRANSPORT_DISRUPTED too).

Yes, we'll get thise changes posted for 5.4.  Lalit?

> The only problem may be the change in behavior. Your users now have to set fast
> io fail for this to fail quickly if that is what they wanted (some do like it).
> You guys might have had your users setting dev loss tmo to a high value and if
> they do not set fast io fail then these IO errors will not get failed until the
> dev loss fires.
> 
> - For RHEL4 and CS_TIMEOUT, are you guys going to use the DID_IMM_RETRY trick
> for that too?  

Add CCs of the bugzilla to the appropriate folks.

Comment 19 Mike Christie 2009-03-23 20:34:10 UTC

(In reply to comment #18)
> > - For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY?
> 
> If the midlayer (in 2.6.5-xxx) will not try indefinitely, that seems
> like a reasonable alternative.  I recall some earlier emails
> indicating such...
> 

We were going to add it for another driver problem, but I just checked scsi_softirq_done and we never did. I guess we went a different way.

I can add it for RHEL 4 if you need it. I thought because you guys did internal queueing you did not need it and could do some magic in your driver. Let me know, if you need me to port the infinite retry check to RHEL4 and I will try to send a patch by the end of the week.

Comment 20 Andrew Vasquez 2009-03-23 22:38:05 UTC

Re: #19

No, we don't really have any magic we can employ in the qla2xxx
RHEL4.  If you could add the timing logic in the midlayer to reduce
the potential for an infinite retry on RHEL4, that would be great.

Comment 23 Lalit Chandivade 2009-03-25 12:17:28 UTC

(In reply to comment #18)
> > - For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY?
> If the midlayer (in 2.6.5-xxx) will not try indefinitely, that seems
> like a reasonable alternative.  I recall some earlier emails
> indicating such...
> > - What do you get CS_TIMEOUT for? It was in a email but I cannot find it now
> > and I just want to make sure I got it right upstream. I think I would hit it
> > when I pulled cables during testing and that is why I had converted it.
> Storage failed to respond in the request timeout (typically 30 or 60
> seconds).  RHEL4 qla2xxx drivers perform 'internal' queueing and
> timing of requests (usually 2-5 seconds less than the midlayer's
> per-command timeout).
> > - For RHEL5 and the CS_TIMEOUT issue, upstream, I converted this to use
> > DID_TRANSPORT_DISRUPTED so the IO will not be failed until the fast io fail
> > fires or if that is not set then IO is failed when dev loss tmo fires.
> Yes, RHEL5 driver's now use the 'upstream' variant driver which
> contains no internal queuing nor timing.  Firmware can still
> (unlikely) return CS_TIMEOUT.  I'd suggest the semantics listed above
> still be used.
> > For RHEL 5 you guys added fast io fail support in 5.3 right? You can then use
> > DID_TRANSPORT_DISRUPTED for this and the other problems that got converted
> > converted upstream:
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=056a44834950ffa51fafa6c76a720fa32e86851a
> > (I think you might want to convert some other errors to use
> > DID_TRANSPORT_DISRUPTED too).
> Yes, we'll get thise changes posted for 5.4.  Lalit?
Patch sent to Marcus B. for submission.

Comment 24 Andrius Benokraitis 2009-03-25 15:51:56 UTC

Marcus, please post the patch as an attachment here and have it POSTed internally ASAP... thanks!

Comment 25 Mike Christie 2009-03-26 16:07:19 UTC

Created attachment 336844 [details]
dont failfast did_error

Andrius,

I did a patch for the scsi layer. I am discussing it with qlogic (forgot to cc Marcus on it) and emulex and the zfcp guys.

I have not tested the patch yet.

Comment 27 Andrius Benokraitis 2009-03-27 15:20:27 UTC

Mike - is there any work you are expecting from QLogic on this besides the patch in Comment #25?

Comment 28 Marcus Barrow 2009-03-27 15:48:39 UTC

I am just editing the comments on the patch submission now it should be in very soon...

Comment 29 Marcus Barrow 2009-03-27 16:02:11 UTC

Created attachment 337030 [details]
scsi,qla2xxx - remove some DID_BUSY

Comment 30 James G. Brown III 2009-03-27 17:15:44 UTC

Thanks Marcus!

- James

Comment 32 Andrius Benokraitis 2009-03-27 18:17:03 UTC

Setting testing to OtherQA - looks like Cisco IT can do the testing for us.

Comment 33 RHEL Program Management 2009-03-27 18:17:56 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 38 Vivek Goyal 2009-03-31 15:43:55 UTC

Committed in 86.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 40 Chris Ward 2009-04-09 07:43:42 UTC

~~ Attention Partners! Snap 3 Released ~~
RHEL 4.8 Snapshot 3 has been released on partners.redhat.com. There
should be a fix present that resolves this bug.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

Comment 43 Chris Ward 2009-04-16 13:14:03 UTC

~~ Attention! Snap 4 Released ~~
RHEL 4.8 Snapshot 4 has been released on partners.redhat.com. There
should be a fix present that resolves this bug. There's not much more time to test. Please report back results ASAP.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

Comment 45 Chris Ward 2009-04-20 10:54:58 UTC

QLogic, any updates?

Comment 48 Marcus Barrow 2009-04-20 20:49:45 UTC

CLosed after verifying the code present and hearing that testing in the filed with the patch was successful.

Sorry I was slow, I can't read those customer comments...

Comment 50 errata-xmlrpc 2009-05-18 19:20:49 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html