Bug 627836 - retry rather than fastfail DID_REQUEUE scsi errors with dm-multipath
retry rather than fastfail DID_REQUEUE scsi errors with dm-multipath
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.6
All Linux
high Severity high
: rc
: ---
Assigned To: Mike Christie
Gris Ge
:
: 516303 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-27 01:19 EDT by Mark Goodwin
Modified: 2012-08-14 15:18 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 16:13:20 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
don't failfast scsi eh commands (1018 bytes, patch)
2010-09-15 19:16 EDT, Mike Christie
no flags Details | Diff

  None (edit)
Description Mark Goodwin 2010-08-27 01:19:49 EDT
Description of problem:

DID_REQUEUE errors returned by a scsi LLD are fast-failed for dm-multipath
in the ML when they could be retried. For cases where this is caused by a 
short-term I/O glitch (and an immediate retry would succeed), this causes 
unnecessary and excessive dm-multipath failovers, logs full of nasty looking 
messages and fencing with gfs quorum devices. Retry is the default behaviour
on RHEL6 and upstream, but not for RHEL5 or earlier.

Version-Release number of selected component (if applicable): RHEL5

How reproducible: widely seen in the wild, causing many support issues

Steps to Reproduce:
1. set up a dm device
2. lower timeouts to 5 seconds for the sd paths to the dm device
3. run a heavy workload on the dm device until intermittent timeouts occur
  
Actual results: dm-multipath failover

Expected results: no failover until retries are exhausted

Additional info: this is an offshoot from BZ 516303, specifically to
deal with the DID_REQUEUE behaviour on RHEL5. BZ 516303 remains open
to address the root cause of the HBA specific I/O timeouts.
Comment 1 RHEL Product and Program Management 2010-08-30 11:49:13 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 2 Mark Goodwin 2010-09-14 21:26:00 EDT
Mike, you mentioned the desired retry behaviour is already the default
on RHEL6 and upstream - is there an existing back-portable patch for that?
(need to get something out to the affected customers pretty soon).

Would it be something like the following, or would you rather
change the DID_REQUEUE error in the lpfc driver itself to something
that would be retried rather than fastfailed?

--- linux-2.6.18-128.1.16.el5/include/scsi/scsi.h.orig  2010-09-13 11:14:40.000000000 +1000
+++ linux-2.6.18-128.1.16.el5/include/scsi/scsi.h       2010-09-13 11:24:11.000000000 +1000
@@ -395,8 +395,7 @@ enum {
                SCSI_MLQUEUE_DIS_RETRY | SCSI_IGN_BLK_FAILFAST |
                SCSI_IGN_ALLOWED,
        SCSI_MLQUEUE_DEVICE_BUSY2       = (0x02 << SCSI_MLQUEUE_BUSY_SHIFT) |
-               SCSI_MLQUEUE_DIS_RETRY | SCSI_IGN_BLK_FAILFAST |
-               SCSI_IGN_ALLOWED,
+               SCSI_MLQUEUE_DIS_RETRY,
        SCSI_MLQUEUE_TARGET_BUSY2       = (0x04 << SCSI_MLQUEUE_BUSY_SHIFT) |
                SCSI_MLQUEUE_DIS_RETRY | SCSI_IGN_BLK_FAILFAST |
                SCSI_IGN_ALLOWED,
Comment 3 Mike Christie 2010-09-15 18:00:18 EDT
(In reply to comment #2)
> Mike, you mentioned the desired retry behaviour is already the default
> on RHEL6 and upstream - is there an existing back-portable patch for that?
> (need to get something out to the affected customers pretty soon).
> 

I have a patch it is like 2 lines. Where are we on figuring out why the command is timing out?
Comment 4 Mike Christie 2010-09-15 18:00:59 EDT
(In reply to comment #3)
> (In reply to comment #2)
> > Mike, you mentioned the desired retry behaviour is already the default
> > on RHEL6 and upstream - is there an existing back-portable patch for that?
> > (need to get something out to the affected customers pretty soon).
> > 
> 
> I have a patch it is like 2 lines. Where are we on figuring out why the command
> is timing out?

I mean have you run the debug kernel and got the debug info yet?
Comment 5 Mike Christie 2010-09-15 19:16:00 EDT
Created attachment 447588 [details]
don't failfast scsi eh commands

This patch has the scsi eh not fail fast commands.

If we have people that can test the patch that are also hitting the DID_REQUEUE problem then also have them test with https://bugzilla.redhat.com/attachment.cgi?id=436375, because we must find the root problem or else this is just going to make it look like we solved a problem.
Comment 6 Mark Goodwin 2010-09-16 02:39:18 EDT

(In reply to comment #5)
> This patch has the scsi eh not fail fast commands.

Thanks Mike - reading the patch, it looks like that will
turn off fastfail for all scsi errs, not just DID_REQUEUE.
Is that correct?

> If we have people that can test the patch that are also hitting the DID_REQUEUE
> problem then also have them test with
> https://bugzilla.redhat.com/attachment.cgi?id=436375, because we must find the
> root problem or else this is just going to make it look like we solved a
> problem.

Yep I added that extra debug stuff to the patch and built & tested it.
Kernel and patch uploaded to http://people.redhat.com/mgoodwin/bz627836/

Seems to work OK. dm-multipath still failsover correctly if I manually
offline a device/path, so it seems like a suitable workaround. I've updated
one of the affected sites, so hopefully we'll get some testing. If that
works out, I'll build a few more kernel versions with this patch for some
of the other sites.

Regarding the root cause - lets deal with that in bz516303 and keep the
workaround in this bz.

Thanks
-- Mark Goodwin
Comment 7 Mike Christie 2010-09-16 14:32:16 EDT
(In reply to comment #6)
> 
> (In reply to comment #5)
> > This patch has the scsi eh not fail fast commands.
> 
> Thanks Mike - reading the patch, it looks like that will
> turn off fastfail for all scsi errs, not just DID_REQUEUE.
> Is that correct?

Yeah. This is done on purpose because upstream we are not really converting drivers to set proper error codes in the scsi eh path. We used to not even evaluate the codes in the scsi eh path, so drivers sometimes set nothing, or whatever they want. And so right now for RHEL I think it is safest to just not fast fail anything. AFAIK, there are no drivers intentionally setting a cmd result to make the cmd fast fail in the scsi eh path.
Comment 14 Jarod Wilson 2010-11-16 11:57:36 EST
in kernel-2.6.18-232.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.
Comment 16 Mark Goodwin 2010-11-22 18:39:22 EST
This could affect timeout tuning for applications using multipath devices
(e.g. Oracle voting disks, gfs quorum disks, etc), so maybe we should include
some tech notes or release notes? Setting requires_release_note=?
Comment 17 Gris Ge 2010-12-05 21:02:44 EST
Mike,

I create a multipath using iscsi.
On both 2.6.18-194.el5 and 2.6.18-233.el5 kernel, when I generate big I/O on multipath, link checker will report path down after change timeout value into 1 second.

The tool I am using to generate large I/O is:
dt of=/mnt/testfile flags=direct bs=32k limit=1000m  procs=100 passes=1 log=/root/dt.log

Big I/O might not cause DID_REQUEUE error.
I am going to investigate on SystemTap by injecting a DID_REQUEUE error into scsi layer. If you have better test suggestion or SystemTap script, please kindly inform me.

Thank you.
Comment 18 Mike Christie 2010-12-06 19:49:03 EST
(In reply to comment #17)
> Mike,
> 
> I create a multipath using iscsi.
> On both 2.6.18-194.el5 and 2.6.18-233.el5 kernel, when I generate big I/O on
> multipath, link checker will report path down after change timeout value into 1
> second.

What is the error code that it is failing with?

> 
> The tool I am using to generate large I/O is:
> dt of=/mnt/testfile flags=direct bs=32k limit=1000m  procs=100 passes=1
> log=/root/dt.log
> 
> Big I/O might not cause DID_REQUEUE error.
> I am going to investigate on SystemTap by injecting a DID_REQUEUE error into
> scsi layer. If you have better test suggestion or SystemTap script, please
> kindly inform me.
> 

You are using lpfc, right?

If you are using system tap make sure scsi_eh_flush_done_q is run and check that the scsi_cmnd->result is set to DID_REQUEUE << 16.
Comment 20 Gris Ge 2010-12-21 04:09:17 EST
(In reply to comment #18)
> What is the error code that it is failing with?
> 

> 
> You are using lpfc, right?
Sorry, I didn't noticed that we need lpfc. The previous test was base on iscsi over normal NIC.

with lpfc, I changed the checker_timeout of multipath into 1 sec.
Even with 300 thread of dt, no error came out.

> 
> If you are using system tap make sure scsi_eh_flush_done_q is run and check
> that the scsi_cmnd->result is set to DID_REQUEUE << 16.

Mike Christie,
It take a while for me to deep into the SCSI of code which is need for System Tap.

Do you have any system tap code for this bug or other reproduce step?
Comment 21 Mike Christie 2010-12-21 19:28:52 EST
(In reply to comment #20)
> with lpfc, I changed the checker_timeout of multipath into 1 sec.
> Even with 300 thread of dt, no error came out.
>

Yeah, with a fast FC link it is possible that IO is going to complete within a second and not time out.
 
> > 
> > If you are using system tap make sure scsi_eh_flush_done_q is run and check
> > that the scsi_cmnd->result is set to DID_REQUEUE << 16.
> 
> Mike Christie,
> It take a while for me to deep into the SCSI of code which is need for System
> Tap.
> 
> Do you have any system tap code for this bug or other reproduce step?

You do not even need system tap. Just enable scsi eh logging.

echo 0x3F > /sys/module/scsi_mod/parameters/scsi_logging_level

Run your test and monitor /var/log/messsages for

"Error handler scsi_eh_$N waking up"

Here N is the host number of the lpfc host you are sending IO to. And then you should see:

"flush attempt retry cmd"
Comment 22 Gris Ge 2010-12-23 02:54:57 EST
Enable 0x3F logging level for FC HBA is a disaster.
Creating timeout in testing FC environment is also a tough job. 

As this change is for scsi layer, I assembled a dm-multipath with iscsi disk via scsi-targe-utils (aka. tgt).

The timeout was caused by the limited network speed and package lose.
Use "makefs.ext3 /dev/mapper/mpath3" for I/O testing.

In kernel-2.6.18-236.el5, 
==========================================================
Dec 23 11:47:09 VMC2 kernel: Waking error handler thread
Dec 23 11:47:09 VMC2 kernel: Error handler scsi_eh_4 waking up
Dec 23 11:47:09 VMC2 kernel: sd 4:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 29
Dec 23 11:47:09 VMC2 kernel: Total of 29 commands on 1 devices require eh work
Dec 23 11:47:09 VMC2 kernel: scsi_eh_4: aborting cmd:0xffff81000519f080


After several scsi_eh_tur and scsi_eh_done scmd log, we got:
Dec 23 10:11:13 VMC2 kernel: Buffer I/O error on device sda, logical block 262155
Dec 23 10:11:13 VMC2 kernel: lost page write due to I/O error on sda

Dec 23 11:47:50 VMC2 kernel: scsi_eh_4: flushattempt retry cmd: ffff81000a23b9c0
Dec 23 11:47:50 VMC2 kernel: scsi_restart_operations: waking up host to restart
Dec 23 11:47:50 VMC2 kernel: scsi_add_timer: scmd: ffff81000a23bb40, time: 30000, (ffffffff880a7012)

Then we got:
Dec 23 11:47:50 VMC2 kernel: Error handler scsi_eh_4 sleeping
====================================

During these time, multipath didn't perform fastfail.

In RHEL 5.5 GA kernel-2.6.18-194.el5,

multipath perform failover after "Waking error handler thread", these are the whole logs for one failover:
==========================================
Dec 23 12:55:34 VMC2 kernel: Waking error handler thread
Dec 23 12:55:34 VMC2 kernel: Error handler scsi_eh_0 waking up
Dec 23 12:55:34 VMC2 kernel: sd 0:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1
Dec 23 12:55:34 VMC2 kernel: Total of 1 commands on 1 devices require eh work
Dec 23 12:55:34 VMC2 kernel: scsi_eh_0: aborting cmd:0xffff81000f2029c0
Dec 23 12:55:35 VMC2 kernel: scsi_eh_done scmd: ffff81000f2029c0 result: 0
Dec 23 12:55:35 VMC2 kernel: scsi_send_eh_cmnd: scmd: ffff81000f2029c0, timeleft: 9704
Dec 23 12:55:35 VMC2 kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 10
Dec 23 12:55:35 VMC2 kernel: scsi_eh_tur: scmd ffff81000f2029c0 rtn 2002
Dec 23 12:55:35 VMC2 kernel: scsi_eh_0: flushattempt retry cmd: ffff81000f2029c0
Dec 23 12:55:35 VMC2 kernel: sd 0:0:0:1: timing out command, waited 7s
Dec 23 12:55:35 VMC2 kernel: scsi_restart_operations: waking up host to restart
Dec 23 12:55:35 VMC2 kernel: Error handler scsi_eh_0 sleeping
Dec 23 12:55:35 VMC2 kernel: device-mapper: multipath: Failing path 8:0.
Dec 23 12:55:35 VMC2 kernel: scsi_block_when_processing_errors: rtn: 1
Dec 23 12:55:35 VMC2 kernel: scsi_block_when_processing_errors: rtn: 1
Dec 23 12:55:35 VMC2 kernel: scsi_add_timer: scmd: ffff81000f2029c0, time: 7000, (ffffffff880a6e31)
Dec 23 12:55:35 VMC2 multipathd: sda: readsector0 checker reports path is down 
=====================================

Mike,
Does these log indicate the "DID_REQUEUE" error was generated and multipath handle it properly?
Comment 23 Mike Christie 2010-12-26 23:06:49 EST
What did you set the disk command timeout to?
Comment 24 Gris Ge 2010-12-27 00:49:40 EST
The multipath checker timeout was set to 1000 ms.
The timeout of iscsi disk 
/sys/devices/platform/host2/session3/target2:0:0/2:0:0:1/timeout 
is 30
I tried to change that sys file into 10 s, but did not work.

The network speed between initiator (in China) and target (in US) was limited and will cause timeout.
Comment 25 Gris Ge 2010-12-29 01:07:35 EST
Code reviewed, set as Sanity Only.

Once Mike Christie finished his investigation, I will test this bug again.
Comment 27 errata-xmlrpc 2011-01-13 16:13:20 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html
Comment 28 Mark Goodwin 2011-11-14 17:55:33 EST
*** Bug 516303 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.