Under RHEL 5.2, running DM over FC, with a fabric that occasionally congests and
drops a frame, the driver returns a DID_ERROR to communicate a SCSI i/o that
failed to complete due to a communication failure. The i/o, being on a disk, is
valid to be retried, and would likely succeed. THERE HAS BEEN NO LOSS OF
CONNECTIVITY TO THE DISK. However, the policies within DM treats this single i/o
failure as a hard communication loss and fails over to a new path.
The OEM we are testing with is complaining about these spurious failovers for
what should be a retryable scenario.
We would like DM to be corrected in the 5.3 timeframe.
Emulex tracking # is CR 074864
For RHEL5, is there a hardware handler involved? Could one handle this?
Or should core device-mapper have a configurable retry option for all failing I/O?
(We added similar for PG initialisation.)
Created attachment 306307 [details]
wait to fail path
Do you think it would be better to return a different (maybe new) SCSI error
code so the scsi layer does not fast fail it, or does DID_ERROR really make the
most sense here and if so should DID_ERROR get fast failed?
If we think that the multipath layer should be getting this error and we do nto
want to switch paths right away, then I did this patch so we do not fail a path
on the first error. We will fail it if we get errors for some time. It needs
some love from the userspace guys though. Right now the the time to wait is
hardcoded and we want to be able to set that from multipath.conf.
What do you guys think?
Oh yeah, if we are saying the problem is that we do not want to mark the path
failed and want to retry IO from the multipath layer then this could be a dup of
https://bugzilla.redhat.com/show_bug.cgi?id=244967. With that bug we get
DID_BUS_BUSY, but the problem that the BZ submitter wanted fix was that they did
not want the path marked failed by the multipath layer right away.
Oh yeah, but I was just wondering if we end up having to to throw the error to
the multipath layer what is so bad about switching paths? multipathd will figure
things out and online the path. Since we are already going though the error
paths performance is going to be affected. Is the problem with this that people
are not setting queue_if_no_path or no_path_retry to queue IO when all paths are
affected or is it that people do not like the path failure log message or is it
just that we really should not switch paths if we do not have to?
RE: comment #2
Another error code or not -- depends on how it's implemented, which I haven't
looked into. The scsi header says DID_ERROR is an "internal error" which gives
no clue as to its severity or retry ability. And there's another
scsi_mid_low_api.txt reference which says this return code is to be used if
there's a transmission error where the underflow isn't accurately reported. I'd
rather not be ambiguous, so yes, a new error code makes sense - something that
reports a transient transmission error. You could add to the definition (and is
retryable), but really that's a choice of the class driver (sd vs st, etc).
If you get this error - I'd avoid the failover and do N retries, and if the
retries all failed, then do the failover. The retry should succeed, and quickly.
I thought about saying if the retries took longer than the fastfail time, but
you have to be careful if the normal i/o completion is longer than the fastfail
Re: comment #3
I agree with the BZ submitter - you should be able to return BUS_BUSY (or
TARGET_BUSY) and not force the failover. There's lots of reasons that falls into it.
What's the cost - well, if in a scenario where the paths are just different
fabric routes to the same device port, or if they are different ports on a
device where all paths are equally performant and available for access
simultaneously, and there's no impacts of caching behaviors between the ports -
then the cost is little (other than the explanation to the user as to why the
"failover" message occured - which is usually a hint that something is bad).
But - most devices suffer impacts when they fail over. Lun state/data may have
to be migrated between access ports, or controllers - and some of these
migration windows have taken 30-90 seconds, even with Active-Active, as there's
still a controller preference. And this migration impact can affect groups or
all luns, not just the one failing over. Real failovers can be costly. So I
agree with your last sentence - don't failover unless you must - which is
usually a connectivity or a lack of response failure.
Mike, I see you posted a patch:
From: Mike Christie <firstname.lastname@example.org>
To: device-mapper development <email@example.com>, SCSI Mailing List
Subject: [dm-devel] mpath: don't fail paths on first error
Date: Thu, 05 Jun 2008 12:25:31 -0500 (13:25 EDT)
and Hannes approved. What is the status? Can this make RHEL 5.3? (I assume Ben
is the proper one to make that happen.)
(In reply to comment #5)
> Mike, I see you posted a patch:
> From: Mike Christie <firstname.lastname@example.org>
> To: device-mapper development <email@example.com>, SCSI Mailing List
> Subject: [dm-devel] mpath: don't fail paths on first error
> Date: Thu, 05 Jun 2008 12:25:31 -0500 (13:25 EDT)
> and Hannes approved. What is the status? Can this make RHEL 5.3? (I assume Ben
> is the proper one to make that happen.)
Yeah there are two problems.
1. Alasdair needs to ACK it and merge it and push it. Hannes just reviewed it
for me and is not the maitnainer.
2. The timeout is hard coded. I can add a way to configure it from userspace
like other dm-multipath table values, but Ben will have to modify
multipath-tools for me (we probably want a separate bz for that), if we want
this done quickly (I have not touched multipath-tools for 5 years or something).
3. Another related issue is IO getting stuck in blocked queues and drivers not
implementing the fast io fail timeout.
In this patchset:
I fixed all the issues (although we need more testing and work on lpfc). Plus in
the last couple patches I did a fix for the problem where we fail IO too soon in
the scsi layer (BZ 244967). I am not sure how long it will take all those
patches to get upstream, so I cut out the critical part that is just needed to
make the patch you pointed to above work correctly here
We should be able to get that one in soon. I was waiting for JamesS to review
everything. He has been busy with other stuff though. Maybe if JamesS could just
review the last patch I just posted
and we do the dm-multipath patch for 5.3 that will solve several issues.
For 5.4, we can shoot for the scsi fail fast bits problems that is fixed in some
of those other patches (bz 244967).
Jame Smart said he was ok with the kernel parts, so I am going to bug the block and scsi maintainer to push the kernel code.
Removing need info from JamesS.
I heard in BZ 441746 that this set of patches won't make 5.3. This is going to be a major headache for OEM, HP. Any way this can be changed ?
I am not getting any feedback from James and Jens on the block layer fail fast changes. so I conditionaly NAKed my patches for RHEL until I get them upstream. Hey, we even reject our own code because it is not upstream and we are not just jerks to vendors :).
Oh yeah on a book keeping note. Do these patches end up fixing this specific bug? I may have misunderstood and thought they did not. If they do not then I am going to clone this bugzilla and split the two issues.
Oh yeah "conditionaly NAKed" means if I get it upstream then we can put it in. If not then I have to wait. I will try to push them again today and see what happens.
From what our testing showed us - the patches largely solve the initial problem - that a retryable error was ignored by DM and immediately did a path failover. There are a couple of cases in the driver that we don't return the TRANSPORT_DISRUPTED status, so we can still cause the condition to exist, but it isn't the general case. The place that is still broken is the user-space side, that's probing the paths to determine when they are live again. The user-space daemon doesn't look at the error either, doesn't do retries, etc so the failback seems to fail in some cases. But again, that wasn't the general complaint.
Do you have a copy of the patches, cut against the 5.3 kernel ? HP would like to do testing. We're trying to ensure they have the right kernel with the right patches from you, with the right sources for our driver.
(In reply to comment #15)
> Do you have a copy of the patches, cut against the 5.3 kernel ?
Just to make sure we are on the same page because I have sent so many patches for different bzs, the patches are the ones I just resent to the list right:
I do not have a version for RHEL, but will make one.
Just to keep this up to date.
Mike Anderson had some comments about the last patch not being implemented so nicely and I agree with him. He is going to send a patch to linux-scsi for discussion.
Any update on this one, how's it looking for inclusion in RHEL5.3 at this juncture?
It looks like Mike Anderson is going to post the patches for his idea today.
Patches posted to linux-scsi
Can we proceed with getting this fixed in RHEL5.3? Without this change we are very exposed to constant failovers especially on arrays that are easily overloaded.
Created attachment 316556 [details]
port failfast changes from upstream to RHEL 5.3 kernel
I ported the fail fast changes from upstream to the RHEL 5.3 kernel. This also includes the patches from MikeA that JamesB was going to merge, but had conflicts with other patches. The patch in this bz can be applied to the kernel here
(note you might get some errors about the scsi_dh modules if they are not merged yet and if they are not merged it can be ignored).
I only ported the fc class and scsi and block changes. I did not port the iscsi and qla2xxx and other drivers, because I do not think we have the resources to test everything.
I was only able to check for regressions and that the new failfast behavior works as expected by running iscsi and qla2xxx with and without multipath. I could not test lpfc or mptfc because the boxes with lpfc cards will not boot the current kernels (my fc target died today too, and I have to go into the lab to move cards around so I can test lpfc) and I cannot find our mptfc card (I think it is in westford).
Created attachment 316558 [details]
port failfast changes from upstream to RHEL 5.3 kernel (take 2)
The last patch had KABI issues.
Thank you! We'll grab the latest patch/kernel and test it out on lpfc.
I guess there are some other conflicts with the scsi_dh modules. Here is the patches for the scsi_dh modules
If your kernel does not have them yet, you should apply them first then apply the failfast port patch.
Created attachment 316600 [details]
port failfast changes from upstream to RHEL 5.3 kernel (take 3)
This fixes three issues:
If 3rd party drivers were checking for REQ_FAILFAST with multipath, then it was not getting set and they would retry IO longer than they should. This patch sets both the failfast and transport failfast bits. The scsi layer will check for the transport failfast bit first, and if it is set it will ignore the failfast bit and do the new code paths.
This patch also fixes a potential compilation problem found during review by Tomas, and it removes the KABI proffing that was not needed.
iSCSI was getting bit the failfast problem especially during boot, so this patch converts it to use the new host byte errors like upstream.
Created attachment 316605 [details]
merge all scsi changes
Hey this is just the failfast patch with the scsi dh changes merged in to make it easier to test. Just apply and run this - no patch dependency problems. If you are reviewing the patch, then review the other one.
Created attachment 316608 [details]
merge all scsi changes (take2)
forgot to do git-add on some files so they are missing. This should compile now.
Does that last attachement have the fastfail changes in it as well? After a quick glance it looks like it does.
The last attachement in comment #33, has the failfast changes.
Emulex has been running tests for the past 2 days and things look good.
We're seeing path failover after 360 second for host error code DID_ERROR,DID_BUS_BUSY, DID_TRANSPORT_DISRUPTED.
You can download this test kernel from http://people.redhat.com/dzickus/el5
~~~ Attention Partners! ~~~
Please test this URGENT / HIGH priority bug at your earliest convenience to ensure it makes it into the upcoming RHEL 5.3 release. The fix should be present in the Partner Snapshot #2 (kernel*-122), available NOW at ftp://partners.redhat.com. As we are approaching the end of the RHEL 5.3 test cycle, it is critical that you report back testing results as soon as possible.
If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field to indicate this. If you find that this issue has not been properly fixed, set the bug status to ASSIGNED with a comment describing the issues you encountered.
All NEW issues encountered (not part of this bug fix) should have a new bug created with the proper keywords and flags set to trigger a review for their inclusion in the upcoming RHEL 5.3 or other future release. Post a link in this bugzilla pointing to the new issue to ensure it is not overlooked.
For any additional questions, speak with your Partner Manager.
~~ Snapshot 3 is now available ~~
Snapshot 3 is now available for Partner Testing, which should contain a fix that resolves this bug. ISO's available as usual at ftp://partners.redhat.com. Your testing feedback is vital! Please let us know if you encounter any NEW issues (file a new bug) or if you have VERIFIED the fix is present and functioning as expected (add PartnerVerified Keyword).
Ping your Partner Manager with any additional questions. Thanks!
Guys- were' all set with this, Emulex has already verified this fix. Ok to close.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** Bug 505123 has been marked as a duplicate of this bug. ***