I'm putting this in bugzilla because we have _many_ customers who hit this issue and its never completely resolved because of inability to debug and other such road blocks. I have recently gotten a customer who reproduced the problem with extended_error_logging enabled which has narrowed down the issue. I'm going to do my best to explain the situation, and I apologize if I misspeak or use the wrong terminology. The qlogic driver in RHEL4 maintains its own scsi command queue, where it knows all currently scsi commands that it is currently in charge of. When the qlogic driver gets an RSCN update, it sets LOOP_RESYNC_NEEDED on the host's flags. This in turn means that qla2x00_loop_resync gets run, and subsequently qla2x00_restart_queues gets run. This runs through the HA's pending queue and retry queue and marks all of the scsi commands with DID_BUS_BUSY and kicks them back up to the scsi midlayer. Now generally this isn't a problem, since the scsi midlayer will just retry it and everything goes along its merry way, but when using DM multipathing or EMC power path, the REQ_FAILFAST gets set on that particular request, which means the command is never retried, and the command is sent all the way back up to the multipathing layer. Now in the case of DM multipath, it only handles things on the BIO level, so the only error that it sees is -EIO, it has no way to differentiate what happened so there is no way for it to check the error and possibly retry, so having some sort of check in DM multipath isn't an option. Upstream Andrew posted this patch, commit f4f051ebb40e74ad0ba02d2cb3a6c16b0393472b Author: <andrew.vasquez> Date: Sun Apr 17 15:02:26 2005 -0500 [PATCH] qla2xxx: remove internal queuing... which removes all of this internal queueing crap. Unfortunately it also depends on this patch commit 8482e118afa0cb4321ab3d30b1100d27d63130c0 Author: <andrew.vasquez> Date: Sun Apr 17 15:04:54 2005 -0500 [PATCH] qla2xxx: add remote port codes... which in turn relies on a few other patches. So RSCN updates will cause the current path to fail, and then if all of your paths are hooked into switches that also receive RSCN updates at the same time, all of your paths will be failed, and your filesystem will be remounted. Now in DM multipath there are things that you can do to get around this, ie the queue if no path option, but AFAIK there is no such thing for EMC. Mike Christie has offered up a patch that will retry any BIO that comes back with an error for a limited amount of time, but the problem with this is that there is no way to distinguish what kind of error occurred, so if a true error happens, you get the retry delay instead of failing over. And again this leaves customers with EMC power path (which are numerous and large) without a solution. So this bugzilla is in order to facilitate some sort of permanent solution to this problem. I believe the best solution is to keep the qlogic driver from setting DID_BUS_BUSY for these kind of scenarios. Hopefully through this bugzilla we can determine the best course of action.
Created attachment 149474 [details] time based failover for dm multipath this is the time based failover patch that Mike Christie suggested on RHKL in reference to this problem.
(In reply to comment #0) > Now in DM multipath there are things that you can do to get around this, ie > the queue if no path option, but AFAIK there is no such thing for EMC. > Just my 2 cents on this. If EMC does not have the exact same thing it is because how you handle errors isimplementation specific. If we can get some traces from EMC, they have their own path testing and failback scheme. DM decided to haandle the problem partially in userspace. Also the problem of error being propogated back to the FS layer when there are no paths is not limited to the qla2xxx RCSN problem. It occurs with any driver and any transport if there is a single point of failure and multipath layer decideds to fail IO to the FS layer instead of retrying it. For iscsi we have the same problem. If you put all your cables through one switch and reboot the switch, you will get errors on all paths and then if no path retry is set to fail the IO it will fail the IO when there are no paths.
Created attachment 149734 [details] use did imm retry instead of did bus busy Here is the patch from Andrew Vasquez. From Andrew: Essentially it's a backport of changes done in our standard driver which swap DID_BUS_BUSY statuses for DID_IMM_RETRY statuses in 'select' logic paths -- those where the driver uses command recylcing during topology disruptions. Of course the usage of DID_IMM_RETRY implies some care, as to avoid infinite retries. But, given the use of qla2xxx's own internal dev-loss-tmo timers, command recycling will not proceed ad infinitum. I'd suggest RH serious consider this for their RHEL4 qla2xxx driver.
I think this patch should be fine because as Andrew pointed it out the driver has timers so the command is not retried forever and he stated that: RSCN processing is typically very fast. The worse case fabric timeout one must worry about for any type of extended-link-service fabric command is 2 * R_A_TOV, where R_A_TOV is typically 10 seconds. So commands would not sit too long.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Marcus, Is this in your queue for 4.6? If not, please consider it a hight priority. Tom
I will put the it in the queue. QLogic was not sure if the opinion fell in favor of including this.
Internal Status set to 'Resolved' Status set to: Closed by Client Resolution set to: 'Closed by Client' This event sent from IssueTracker by robert.wehner issue 119734
The use imm retry patch was submitted to RHEL4.6
A patch addressing this issue has been included in kernel-2.6.9-55.19.EL.
The reason that kernel package isn't signed is because it is an unofficial build on the way to RHEL 4.6 Beta. If you require an officially supported kernel with this fix prior to RHEL 4.6, please request a hotfix.
*** Bug 180212 has been marked as a duplicate of this bug. ***
A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
thanks for your update
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html