Bug 490744
Summary: | UNDERRUN and TIMEOUT status with qla2xxx | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Bryn M. Reeves <bmr> | ||||||
Component: | kernel | Assignee: | Marcus Barrow <mbarrow> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.7 | CC: | andrew.vasquez, andriusb, bdonahue, bubrown, coughlan, cward, james.brown, lalit.chandivade, mchristi, mgahagan, pep, qlogic-redhat-ext, ravi.anand, syeghiay, tao | ||||||
Target Milestone: | rc | Keywords: | OtherQA | ||||||
Target Release: | 4.8 | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-05-18 19:20:49 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 450896 | ||||||||
Attachments: |
|
Description
Bryn M. Reeves
2009-03-17 19:28:57 UTC
Similar set of errors on another host: Feb 15 10:06:46 kid-a kernel: scsi(1): LOOP READY Feb 15 10:06:46 kid-a kernel: qla2x00_restart_queues(1): callback 0 commands. Feb 15 10:06:46 kid-a kernel: qla2x00_restart_queues(1): active=0, retry=0, pending=0, done=0, scsi retry=0 commands. Feb 15 10:06:46 kid-a kernel: scsi(1): qla2x00_loop_resync - end Feb 15 10:06:47 kid-a kernel: scsi(1): Asynchronous PORT UPDATE ignored 0000/0004/0600. Feb 15 10:06:47 kid-a kernel: scsi(1): Asynchronous PORT UPDATE ignored 0000/0007/0b00. Feb 16 06:07:43 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0 Feb 16 06:07:43 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1736093. Feb 16 06:33:58 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0 Feb 16 06:33:58 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1759091. Feb 16 07:39:18 kid-a kernel: scsi(0:0:0:3): TIMEOUT status detected 0x6-0x0 Feb 16 07:39:18 kid-a kernel: scsi(0:0:3) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1820449. Feb 16 07:41:39 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0 Feb 16 07:41:39 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1822461. Feb 16 07:43:13 kid-a kernel: scsi(0:0:0:5): TIMEOUT status detected 0x6-0x0 Feb 16 07:43:13 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1823786. Feb 16 07:46:27 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0 Feb 16 07:46:27 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1826535. Feb 16 07:50:08 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0 Feb 16 07:50:08 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1829630. Feb 16 08:09:24 kid-a kernel: scsi(0:0:0:5): TIMEOUT status detected 0x6-0x0 Feb 16 08:09:24 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=1851692. Feb 16 23:41:39 kid-a kernel: scsi(0:0:0:5): TIMEOUT status detected 0x6-0x0 Feb 16 23:41:39 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=3134479. Feb 17 01:09:00 kid-a kernel: scsi(0:0:0:6): TIMEOUT status detected 0x6-0x0 Feb 17 01:09:00 kid-a kernel: scsi(0:0:6) qla2x00_done: did_error = 2, comp-scsi= 0x6-0x0 pid=3237238. Feb 17 10:12:19 kid-a kernel: scsi(0:0:5) UNDERRUN status detected 0x15-0x0. Feb 17 10:12:19 kid-a kernel: scsi(0:0:0:5) Dropped frame(s) detected (0 of 1e000 bytes)...retrying command. Feb 17 10:12:19 kid-a kernel: scsi(0:0:5) qla2x00_done: did_error = 2, comp-scsi= 0x15-0x0 pid=3783475. Feb 17 10:12:23 kid-a kernel: scsi(0:0:6) UNDERRUN status detected 0x15-0x0. Driver load messages: Feb 6 18:17:30 kid-a kernel: SCSI subsystem initialized Feb 6 18:17:30 kid-a kernel: QLogic Fibre Channel HBA Driver Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Found an ISP2422, irq 209, iobase 0xffffff0000002000 Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Configuring PCI space... Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Configure NVRAM parameters... Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Verifying loaded RISC code... Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Allocated (1061 KB) for firmware dump... Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Waiting for LIP to complete... Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: LOOP UP detected (4 Gbps). Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Topology - (F_Port), Host Loop address 0x0 Feb 6 18:17:30 kid-a kernel: scsi0 : qla2xxx Feb 6 18:17:30 kid-a kernel: qla2400 0000:03:01.0: Feb 6 18:17:30 kid-a kernel: QLogic Fibre Channel HBA Driver: 8.01.07-d4 So, this looks very similar to the errors in bug 244967: scsi(2:1:16) UNDERRUN status detected 0x15-0x0. resid=0x7fff8fff fw_resid=0x7fff8fff cdb=0x28 os_underflow=0xf400 srb_flags=0x2 scsi(2:0:1:16) Dropped frame(s) detected (7fff8fff of f400 bytes)...retrying command. scsi(2:1:16) qla2x00_done: did_error = 2, comp-scsi= 0x15-0x0 pid=102056310. SCSI error : <2 0 1 16> return code = 0x20000 end_request: I/O error, dev sdbm, sector 4192702 end_request: I/O error, dev sdbm, sector 4192708 device-mapper: dm-multipath: Failing path 68:0. Only difference is the resid/fw_resid/cdb/os_underflow/srb_flags line as this doesn't exist in the RHEL4 version of the driver. The patch there changes RHEL5's qla2xxx to use DID_ERROR for dropped frames instead of DID_BUS_BUSY and the midlayer was changed in 5.3 to treat DID_ERROR as a retryable error. Marcus, Multiple issues: - For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY? - What do you get CS_TIMEOUT for? It was in a email but I cannot find it now and I just want to make sure I got it right upstream. I think I would hit it when I pulled cables during testing and that is why I had converted it. - For RHEL5 and the CS_TIMEOUT issue, upstream, I converted this to use DID_TRANSPORT_DISRUPTED so the IO will not be failed until the fast io fail fires or if that is not set then IO is failed when dev loss tmo fires. For RHEL 5 you guys added fast io fail support in 5.3 right? You can then use DID_TRANSPORT_DISRUPTED for this and the other problems that got converted converted upstream: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=056a44834950ffa51fafa6c76a720fa32e86851a (I think you might want to convert some other errors to use DID_TRANSPORT_DISRUPTED too). The only problem may be the change in behavior. Your users now have to set fast io fail for this to fail quickly if that is what they wanted (some do like it). You guys might have had your users setting dev loss tmo to a high value and if they do not set fast io fail then these IO errors will not get failed until the dev loss fires. - For RHEL4 and CS_TIMEOUT, are you guys going to use the DID_IMM_RETRY trick for that too? > - For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY? If the midlayer (in 2.6.5-xxx) will not try indefinitely, that seems like a reasonable alternative. I recall some earlier emails indicating such... > - What do you get CS_TIMEOUT for? It was in a email but I cannot find it now > and I just want to make sure I got it right upstream. I think I would hit it > when I pulled cables during testing and that is why I had converted it. Storage failed to respond in the request timeout (typically 30 or 60 seconds). RHEL4 qla2xxx drivers perform 'internal' queueing and timing of requests (usually 2-5 seconds less than the midlayer's per-command timeout). > - For RHEL5 and the CS_TIMEOUT issue, upstream, I converted this to use > DID_TRANSPORT_DISRUPTED so the IO will not be failed until the fast io fail > fires or if that is not set then IO is failed when dev loss tmo fires. Yes, RHEL5 driver's now use the 'upstream' variant driver which contains no internal queuing nor timing. Firmware can still (unlikely) return CS_TIMEOUT. I'd suggest the semantics listed above still be used. > For RHEL 5 you guys added fast io fail support in 5.3 right? You can then use > DID_TRANSPORT_DISRUPTED for this and the other problems that got converted > converted upstream: > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=056a44834950ffa51fafa6c76a720fa32e86851a > (I think you might want to convert some other errors to use > DID_TRANSPORT_DISRUPTED too). Yes, we'll get thise changes posted for 5.4. Lalit? > The only problem may be the change in behavior. Your users now have to set fast > io fail for this to fail quickly if that is what they wanted (some do like it). > You guys might have had your users setting dev loss tmo to a high value and if > they do not set fast io fail then these IO errors will not get failed until the > dev loss fires. > > - For RHEL4 and CS_TIMEOUT, are you guys going to use the DID_IMM_RETRY trick > for that too? Add CCs of the bugzilla to the appropriate folks. (In reply to comment #18) > > - For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY? > > If the midlayer (in 2.6.5-xxx) will not try indefinitely, that seems > like a reasonable alternative. I recall some earlier emails > indicating such... > We were going to add it for another driver problem, but I just checked scsi_softirq_done and we never did. I guess we went a different way. I can add it for RHEL 4 if you need it. I thought because you guys did internal queueing you did not need it and could do some magic in your driver. Let me know, if you need me to port the infinite retry check to RHEL4 and I will try to send a patch by the end of the week. Re: #19 No, we don't really have any magic we can employ in the qla2xxx RHEL4. If you could add the timing logic in the midlayer to reduce the potential for an infinite retry on RHEL4, that would be great. (In reply to comment #18) > > - For the RHEL4 underrun issue, did you guys want to handle with DID_IMM_RETRY? > If the midlayer (in 2.6.5-xxx) will not try indefinitely, that seems > like a reasonable alternative. I recall some earlier emails > indicating such... > > - What do you get CS_TIMEOUT for? It was in a email but I cannot find it now > > and I just want to make sure I got it right upstream. I think I would hit it > > when I pulled cables during testing and that is why I had converted it. > Storage failed to respond in the request timeout (typically 30 or 60 > seconds). RHEL4 qla2xxx drivers perform 'internal' queueing and > timing of requests (usually 2-5 seconds less than the midlayer's > per-command timeout). > > - For RHEL5 and the CS_TIMEOUT issue, upstream, I converted this to use > > DID_TRANSPORT_DISRUPTED so the IO will not be failed until the fast io fail > > fires or if that is not set then IO is failed when dev loss tmo fires. > Yes, RHEL5 driver's now use the 'upstream' variant driver which > contains no internal queuing nor timing. Firmware can still > (unlikely) return CS_TIMEOUT. I'd suggest the semantics listed above > still be used. > > For RHEL 5 you guys added fast io fail support in 5.3 right? You can then use > > DID_TRANSPORT_DISRUPTED for this and the other problems that got converted > > converted upstream: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=056a44834950ffa51fafa6c76a720fa32e86851a > > (I think you might want to convert some other errors to use > > DID_TRANSPORT_DISRUPTED too). > Yes, we'll get thise changes posted for 5.4. Lalit? Patch sent to Marcus B. for submission. Marcus, please post the patch as an attachment here and have it POSTed internally ASAP... thanks! Created attachment 336844 [details]
dont failfast did_error
Andrius,
I did a patch for the scsi layer. I am discussing it with qlogic (forgot to cc Marcus on it) and emulex and the zfcp guys.
I have not tested the patch yet.
Mike - is there any work you are expecting from QLogic on this besides the patch in Comment #25? I am just editing the comments on the patch submission now it should be in very soon... Created attachment 337030 [details]
scsi,qla2xxx - remove some DID_BUSY
Thanks Marcus! - James Setting testing to OtherQA - looks like Cisco IT can do the testing for us. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Committed in 86.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ ~~ Attention Partners! Snap 3 Released ~~ RHEL 4.8 Snapshot 3 has been released on partners.redhat.com. There should be a fix present that resolves this bug. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. ~~ Attention! Snap 4 Released ~~ RHEL 4.8 Snapshot 4 has been released on partners.redhat.com. There should be a fix present that resolves this bug. There's not much more time to test. Please report back results ASAP. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. QLogic, any updates? CLosed after verifying the code present and hearing that testing in the filed with the patch was successful. Sorry I was slow, I can't read those customer comments... An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |