Red Hat Bugzilla – Bug 627836
retry rather than fastfail DID_REQUEUE scsi errors with dm-multipath
Last modified: 2018-10-27 13:38:20 EDT
Description of problem: DID_REQUEUE errors returned by a scsi LLD are fast-failed for dm-multipath in the ML when they could be retried. For cases where this is caused by a short-term I/O glitch (and an immediate retry would succeed), this causes unnecessary and excessive dm-multipath failovers, logs full of nasty looking messages and fencing with gfs quorum devices. Retry is the default behaviour on RHEL6 and upstream, but not for RHEL5 or earlier. Version-Release number of selected component (if applicable): RHEL5 How reproducible: widely seen in the wild, causing many support issues Steps to Reproduce: 1. set up a dm device 2. lower timeouts to 5 seconds for the sd paths to the dm device 3. run a heavy workload on the dm device until intermittent timeouts occur Actual results: dm-multipath failover Expected results: no failover until retries are exhausted Additional info: this is an offshoot from BZ 516303, specifically to deal with the DID_REQUEUE behaviour on RHEL5. BZ 516303 remains open to address the root cause of the HBA specific I/O timeouts.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Mike, you mentioned the desired retry behaviour is already the default on RHEL6 and upstream - is there an existing back-portable patch for that? (need to get something out to the affected customers pretty soon). Would it be something like the following, or would you rather change the DID_REQUEUE error in the lpfc driver itself to something that would be retried rather than fastfailed? --- linux-2.6.18-128.1.16.el5/include/scsi/scsi.h.orig 2010-09-13 11:14:40.000000000 +1000 +++ linux-2.6.18-128.1.16.el5/include/scsi/scsi.h 2010-09-13 11:24:11.000000000 +1000 @@ -395,8 +395,7 @@ enum { SCSI_MLQUEUE_DIS_RETRY | SCSI_IGN_BLK_FAILFAST | SCSI_IGN_ALLOWED, SCSI_MLQUEUE_DEVICE_BUSY2 = (0x02 << SCSI_MLQUEUE_BUSY_SHIFT) | - SCSI_MLQUEUE_DIS_RETRY | SCSI_IGN_BLK_FAILFAST | - SCSI_IGN_ALLOWED, + SCSI_MLQUEUE_DIS_RETRY, SCSI_MLQUEUE_TARGET_BUSY2 = (0x04 << SCSI_MLQUEUE_BUSY_SHIFT) | SCSI_MLQUEUE_DIS_RETRY | SCSI_IGN_BLK_FAILFAST | SCSI_IGN_ALLOWED,
(In reply to comment #2) > Mike, you mentioned the desired retry behaviour is already the default > on RHEL6 and upstream - is there an existing back-portable patch for that? > (need to get something out to the affected customers pretty soon). > I have a patch it is like 2 lines. Where are we on figuring out why the command is timing out?
(In reply to comment #3) > (In reply to comment #2) > > Mike, you mentioned the desired retry behaviour is already the default > > on RHEL6 and upstream - is there an existing back-portable patch for that? > > (need to get something out to the affected customers pretty soon). > > > > I have a patch it is like 2 lines. Where are we on figuring out why the command > is timing out? I mean have you run the debug kernel and got the debug info yet?
Created attachment 447588 [details] don't failfast scsi eh commands This patch has the scsi eh not fail fast commands. If we have people that can test the patch that are also hitting the DID_REQUEUE problem then also have them test with https://bugzilla.redhat.com/attachment.cgi?id=436375, because we must find the root problem or else this is just going to make it look like we solved a problem.
(In reply to comment #5) > This patch has the scsi eh not fail fast commands. Thanks Mike - reading the patch, it looks like that will turn off fastfail for all scsi errs, not just DID_REQUEUE. Is that correct? > If we have people that can test the patch that are also hitting the DID_REQUEUE > problem then also have them test with > https://bugzilla.redhat.com/attachment.cgi?id=436375, because we must find the > root problem or else this is just going to make it look like we solved a > problem. Yep I added that extra debug stuff to the patch and built & tested it. Kernel and patch uploaded to http://people.redhat.com/mgoodwin/bz627836/ Seems to work OK. dm-multipath still failsover correctly if I manually offline a device/path, so it seems like a suitable workaround. I've updated one of the affected sites, so hopefully we'll get some testing. If that works out, I'll build a few more kernel versions with this patch for some of the other sites. Regarding the root cause - lets deal with that in bz516303 and keep the workaround in this bz. Thanks -- Mark Goodwin
(In reply to comment #6) > > (In reply to comment #5) > > This patch has the scsi eh not fail fast commands. > > Thanks Mike - reading the patch, it looks like that will > turn off fastfail for all scsi errs, not just DID_REQUEUE. > Is that correct? Yeah. This is done on purpose because upstream we are not really converting drivers to set proper error codes in the scsi eh path. We used to not even evaluate the codes in the scsi eh path, so drivers sometimes set nothing, or whatever they want. And so right now for RHEL I think it is safest to just not fast fail anything. AFAIK, there are no drivers intentionally setting a cmd result to make the cmd fast fail in the scsi eh path.
in kernel-2.6.18-232.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
This could affect timeout tuning for applications using multipath devices (e.g. Oracle voting disks, gfs quorum disks, etc), so maybe we should include some tech notes or release notes? Setting requires_release_note=?
Mike, I create a multipath using iscsi. On both 2.6.18-194.el5 and 2.6.18-233.el5 kernel, when I generate big I/O on multipath, link checker will report path down after change timeout value into 1 second. The tool I am using to generate large I/O is: dt of=/mnt/testfile flags=direct bs=32k limit=1000m procs=100 passes=1 log=/root/dt.log Big I/O might not cause DID_REQUEUE error. I am going to investigate on SystemTap by injecting a DID_REQUEUE error into scsi layer. If you have better test suggestion or SystemTap script, please kindly inform me. Thank you.
(In reply to comment #17) > Mike, > > I create a multipath using iscsi. > On both 2.6.18-194.el5 and 2.6.18-233.el5 kernel, when I generate big I/O on > multipath, link checker will report path down after change timeout value into 1 > second. What is the error code that it is failing with? > > The tool I am using to generate large I/O is: > dt of=/mnt/testfile flags=direct bs=32k limit=1000m procs=100 passes=1 > log=/root/dt.log > > Big I/O might not cause DID_REQUEUE error. > I am going to investigate on SystemTap by injecting a DID_REQUEUE error into > scsi layer. If you have better test suggestion or SystemTap script, please > kindly inform me. > You are using lpfc, right? If you are using system tap make sure scsi_eh_flush_done_q is run and check that the scsi_cmnd->result is set to DID_REQUEUE << 16.
(In reply to comment #18) > What is the error code that it is failing with? > > > You are using lpfc, right? Sorry, I didn't noticed that we need lpfc. The previous test was base on iscsi over normal NIC. with lpfc, I changed the checker_timeout of multipath into 1 sec. Even with 300 thread of dt, no error came out. > > If you are using system tap make sure scsi_eh_flush_done_q is run and check > that the scsi_cmnd->result is set to DID_REQUEUE << 16. Mike Christie, It take a while for me to deep into the SCSI of code which is need for System Tap. Do you have any system tap code for this bug or other reproduce step?
(In reply to comment #20) > with lpfc, I changed the checker_timeout of multipath into 1 sec. > Even with 300 thread of dt, no error came out. > Yeah, with a fast FC link it is possible that IO is going to complete within a second and not time out. > > > > If you are using system tap make sure scsi_eh_flush_done_q is run and check > > that the scsi_cmnd->result is set to DID_REQUEUE << 16. > > Mike Christie, > It take a while for me to deep into the SCSI of code which is need for System > Tap. > > Do you have any system tap code for this bug or other reproduce step? You do not even need system tap. Just enable scsi eh logging. echo 0x3F > /sys/module/scsi_mod/parameters/scsi_logging_level Run your test and monitor /var/log/messsages for "Error handler scsi_eh_$N waking up" Here N is the host number of the lpfc host you are sending IO to. And then you should see: "flush attempt retry cmd"
Enable 0x3F logging level for FC HBA is a disaster. Creating timeout in testing FC environment is also a tough job. As this change is for scsi layer, I assembled a dm-multipath with iscsi disk via scsi-targe-utils (aka. tgt). The timeout was caused by the limited network speed and package lose. Use "makefs.ext3 /dev/mapper/mpath3" for I/O testing. In kernel-2.6.18-236.el5, ========================================================== Dec 23 11:47:09 VMC2 kernel: Waking error handler thread Dec 23 11:47:09 VMC2 kernel: Error handler scsi_eh_4 waking up Dec 23 11:47:09 VMC2 kernel: sd 4:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 29 Dec 23 11:47:09 VMC2 kernel: Total of 29 commands on 1 devices require eh work Dec 23 11:47:09 VMC2 kernel: scsi_eh_4: aborting cmd:0xffff81000519f080 After several scsi_eh_tur and scsi_eh_done scmd log, we got: Dec 23 10:11:13 VMC2 kernel: Buffer I/O error on device sda, logical block 262155 Dec 23 10:11:13 VMC2 kernel: lost page write due to I/O error on sda Dec 23 11:47:50 VMC2 kernel: scsi_eh_4: flushattempt retry cmd: ffff81000a23b9c0 Dec 23 11:47:50 VMC2 kernel: scsi_restart_operations: waking up host to restart Dec 23 11:47:50 VMC2 kernel: scsi_add_timer: scmd: ffff81000a23bb40, time: 30000, (ffffffff880a7012) Then we got: Dec 23 11:47:50 VMC2 kernel: Error handler scsi_eh_4 sleeping ==================================== During these time, multipath didn't perform fastfail. In RHEL 5.5 GA kernel-2.6.18-194.el5, multipath perform failover after "Waking error handler thread", these are the whole logs for one failover: ========================================== Dec 23 12:55:34 VMC2 kernel: Waking error handler thread Dec 23 12:55:34 VMC2 kernel: Error handler scsi_eh_0 waking up Dec 23 12:55:34 VMC2 kernel: sd 0:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1 Dec 23 12:55:34 VMC2 kernel: Total of 1 commands on 1 devices require eh work Dec 23 12:55:34 VMC2 kernel: scsi_eh_0: aborting cmd:0xffff81000f2029c0 Dec 23 12:55:35 VMC2 kernel: scsi_eh_done scmd: ffff81000f2029c0 result: 0 Dec 23 12:55:35 VMC2 kernel: scsi_send_eh_cmnd: scmd: ffff81000f2029c0, timeleft: 9704 Dec 23 12:55:35 VMC2 kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 10 Dec 23 12:55:35 VMC2 kernel: scsi_eh_tur: scmd ffff81000f2029c0 rtn 2002 Dec 23 12:55:35 VMC2 kernel: scsi_eh_0: flushattempt retry cmd: ffff81000f2029c0 Dec 23 12:55:35 VMC2 kernel: sd 0:0:0:1: timing out command, waited 7s Dec 23 12:55:35 VMC2 kernel: scsi_restart_operations: waking up host to restart Dec 23 12:55:35 VMC2 kernel: Error handler scsi_eh_0 sleeping Dec 23 12:55:35 VMC2 kernel: device-mapper: multipath: Failing path 8:0. Dec 23 12:55:35 VMC2 kernel: scsi_block_when_processing_errors: rtn: 1 Dec 23 12:55:35 VMC2 kernel: scsi_block_when_processing_errors: rtn: 1 Dec 23 12:55:35 VMC2 kernel: scsi_add_timer: scmd: ffff81000f2029c0, time: 7000, (ffffffff880a6e31) Dec 23 12:55:35 VMC2 multipathd: sda: readsector0 checker reports path is down ===================================== Mike, Does these log indicate the "DID_REQUEUE" error was generated and multipath handle it properly?
What did you set the disk command timeout to?
The multipath checker timeout was set to 1000 ms. The timeout of iscsi disk /sys/devices/platform/host2/session3/target2:0:0/2:0:0:1/timeout is 30 I tried to change that sys file into 10 s, but did not work. The network speed between initiator (in China) and target (in US) was limited and will cause timeout.
Code reviewed, set as Sanity Only. Once Mike Christie finished his investigation, I will test this bug again.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html
*** Bug 516303 has been marked as a duplicate of this bug. ***