Red Hat Bugzilla – Bug 231319
[QLogic 4.6 bug] Qlogic driver handles RSCN updates in a problematic way
Last modified: 2010-10-22 09:35:57 EDT
I'm putting this in bugzilla because we have _many_ customers who hit this
issue and its never completely resolved because of inability to debug and
other such road blocks. I have recently gotten a customer who reproduced the
problem with extended_error_logging enabled which has narrowed down the issue.
I'm going to do my best to explain the situation, and I apologize if I
misspeak or use the wrong terminology.
The qlogic driver in RHEL4 maintains its own scsi command queue, where it
knows all currently scsi commands that it is currently in charge of. When the
qlogic driver gets an RSCN update, it sets LOOP_RESYNC_NEEDED on the host's
flags. This in turn means that qla2x00_loop_resync gets run, and subsequently
qla2x00_restart_queues gets run. This runs through the HA's pending queue and
retry queue and marks all of the scsi commands with DID_BUS_BUSY and kicks
them back up to the scsi midlayer. Now generally this isn't a problem, since
the scsi midlayer will just retry it and everything goes along its merry way,
but when using DM multipathing or EMC power path, the REQ_FAILFAST gets set on
that particular request, which means the command is never retried, and the
command is sent all the way back up to the multipathing layer. Now in the
case of DM multipath, it only handles things on the BIO level, so the only
error that it sees is -EIO, it has no way to differentiate what happened so
there is no way for it to check the error and possibly retry, so having some
sort of check in DM multipath isn't an option. Upstream Andrew posted this
Date: Sun Apr 17 15:02:26 2005 -0500
[PATCH] qla2xxx: remove internal queuing...
which removes all of this internal queueing crap. Unfortunately it also
depends on this patch
Date: Sun Apr 17 15:04:54 2005 -0500
[PATCH] qla2xxx: add remote port codes...
which in turn relies on a few other patches.
So RSCN updates will cause the current path to fail, and then if all of your
paths are hooked into switches that also receive RSCN updates at the same
time, all of your paths will be failed, and your filesystem will be remounted.
Now in DM multipath there are things that you can do to get around this, ie
the queue if no path option, but AFAIK there is no such thing for EMC.
Mike Christie has offered up a patch that will retry any BIO that comes back
with an error for a limited amount of time, but the problem with this is that
there is no way to distinguish what kind of error occurred, so if a true error
happens, you get the retry delay instead of failing over. And again this
leaves customers with EMC power path (which are numerous and large) without a
So this bugzilla is in order to facilitate some sort of permanent solution to
this problem. I believe the best solution is to keep the qlogic driver from
setting DID_BUS_BUSY for these kind of scenarios. Hopefully through this
bugzilla we can determine the best course of action.
Created attachment 149474 [details]
time based failover for dm multipath
this is the time based failover patch that Mike Christie suggested on RHKL in
reference to this problem.
(In reply to comment #0)
> Now in DM multipath there are things that you can do to get around this, ie
> the queue if no path option, but AFAIK there is no such thing for EMC.
Just my 2 cents on this. If EMC does not have the exact same thing it is because
how you handle errors isimplementation specific. If we can get some traces from
EMC, they have their own path testing and failback scheme. DM decided to haandle
the problem partially in userspace.
Also the problem of error being propogated back to the FS layer when there are
no paths is not limited to the qla2xxx RCSN problem. It occurs with any driver
and any transport if there is a single point of failure and multipath layer
decideds to fail IO to the FS layer instead of retrying it.
For iscsi we have the same problem. If you put all your cables through one
switch and reboot the switch, you will get errors on all paths and then if no
path retry is set to fail the IO it will fail the IO when there are no paths.
Created attachment 149734 [details]
use did imm retry instead of did bus busy
Here is the patch from Andrew Vasquez.
Essentially it's a backport of changes done in our standard driver which swap
DID_BUS_BUSY statuses for DID_IMM_RETRY statuses in 'select' logic
paths -- those where the driver uses command recylcing during topology
Of course the usage of DID_IMM_RETRY implies some care, as to avoid infinite
retries. But, given the use of qla2xxx's own internal dev-loss-tmo timers,
command recycling will not proceed ad infinitum.
I'd suggest RH serious consider this for their RHEL4 qla2xxx driver.
I think this patch should be fine because as Andrew pointed it out the driver
has timers so the command is not retried forever and he stated that:
RSCN processing is typically very fast. The worse case fabric timeout
one must worry about for any type of extended-link-service fabric
command is 2 * R_A_TOV, where R_A_TOV is typically 10 seconds.
So commands would not sit too long.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Is this in your queue for 4.6? If not, please consider it a hight priority.
I will put the it in the queue. QLogic was not sure if the opinion fell in favor of
Internal Status set to 'Resolved'
Status set to: Closed by Client
Resolution set to: 'Closed by Client'
This event sent from IssueTracker by firstname.lastname@example.org
The use imm retry patch was submitted to RHEL4.6
A patch addressing this issue has been included in kernel-2.6.9-55.19.EL.
The reason that kernel package isn't signed is because it is an unofficial build
on the way to RHEL 4.6 Beta. If you require an officially supported kernel with
this fix prior to RHEL 4.6, please request a hotfix.
*** Bug 180212 has been marked as a duplicate of this bug. ***
A fix for this issue should have been included in the packages contained in the
RHEL4.6 Beta released on RHN (also available at partners.redhat.com).
Requested action: Please verify that your issue is fixed to ensure that it is
included in this update release.
After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)
If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.
If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you. If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.
thanks for your update
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.