Bug 1370212 - IBM zfcp driver recovery after a fabric events lands up in "Medium access timeout failure. Offlining disk!"
Summary: IBM zfcp driver recovery after a fabric events lands up in "Medium access tim...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.8
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Ewan D. Milne
QA Contact: guazhang@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1374441 1461138
TreeView+ depends on / blocked
 
Reported: 2016-08-25 14:26 UTC by loberman
Modified: 2021-03-11 14:40 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-08 14:44:46 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2575901 0 None None None 2016-08-26 05:47:57 UTC

Description loberman 2016-08-25 14:26:21 UTC
Description of problem:
We have at least 3 customers now having devices taken offline due to the 
sdxx->medium_access_timed_out being incremented in sd_eh_action() when the zfcp driver goes through error handling.

The result of this is that after recovery, we wont have multipath paths re-enabled and we are exposed to losing complete device access if the surviving path fails.

Version-Release number of selected component (if applicable):
Seen on RHEL 6.6+ kernels, likely also an issue in earlier kernels.

How reproducible:
During fabric events and recovery we will see the sequence of recovery events lead to medium_access_timed_out++ and when it exceeds 2 (default setting for max_medium_access_timeouts) then we take the devices offline.

We had a similar issue with the fnic driver fixed by Cisco in BZ 1341298 and I was concerned that the zfcp driver may be going through the same sequence so reached out to IBM.

Steps to Reproduce:
1.System is running
2.Fabric events happen
3.We lose device 

Actual results:
Devices taken offline, when they actually are still accessible

Expected results:
We do not take devices offline

Additional info:
This is a tough problem to solve with zfcp changes as detailed below in the response from Benjamin Block at IBM.

For now we are going to suggest setting the max_medium_access_timeouts to a high value to avoid false disconnects.

From customers log

root@xxxxxx:PROD:~> zcat /var/log/messages-20160815.gz | grep sdy

Aug 15 00:32:42 xxxxxx kernel: sd 0:0:1:4: [sdy]  Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Aug 15 00:32:42 xxxxxx kernel: sd 0:0:1:4: [sdy] CDB: Read(10): 28 00 03 78 ea c0 00 00 20 00
Aug 15 00:32:42 xxxxxx kernel: end_request: I/O error, dev sdy, sector 58256064
Aug 15 00:52:17 xxxxxx multipathd: mpathm: sdy - tur checker reports path is up
root@nzvmds728:PROD:~>

The key here is that we receive DID_TIME_OUT so we offline the disk

	1360 static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
	1361 {
	1362         struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
	1363 
	1364         if (!scsi_device_online(scmd->device) ||
	1365             !scsi_medium_access_command(scmd) ||
	1366             host_byte(scmd->result) != DID_TIME_OUT ||
	1367             eh_disp != SUCCESS)
	1368                 return eh_disp;
	1369 
	1370         /*
	1371          * The device has timed out executing a medium access command.
	1372          * However, the TEST UNIT READY command sent during error
	1373          * handling completed successfully. Either the device is in the
	1374          * process of recovering or has it suffered an internal failure
	1375          * that prevents access to the storage medium.
	1376          */
	1377         sdkp->medium_access_timed_out++;        
	1378 
	1379         /*
	1380          * If the device keeps failing read/write commands but TEST UNIT
	1381          * READY always completes successfully we assume that medium
	1382          * access is no longer possible and take the device offline.
	1383          */
	1384         if (sdkp->medium_access_timed_out >= sdkp->max_medium_access_timeouts) {
	1385                 scmd_printk(KERN_ERR, scmd,
	1386                             "Medium access timeout failure. Offlining disk!\n");		<<----------
	1387                 scsi_device_set_state(scmd->device, SDEV_OFFLINE);
	1388 
	1389                 return FAILED;
	1390         }
	1391 
	1392         return eh_disp;
	1393 }


Response from Benjamin Block @IBM

Hello Laurence,

some small update for the problems your customers see. I removed some
parts of the history so it doesn't get too long.

On 00:21 Wed 24 Aug     , Laurence Oberman wrote:
> ----- Original Message -----
> > From: "Steffen Maier" <maier.ibm.com>
> > To: "Laurence Oberman" <loberman>
> > Cc: "Benjamin Block" <bblock.ibm.com>
> > Sent: Wednesday, June 22, 2016 10:20:07 AM
> > Subject: Re: Issue seen with zfcp seems to match the known issue with the fnic driver whihc was not returning
> > DID_ABORT
> >
> > Hi Laurence,
> >
> > On 06/20/2016 09:13 PM, Laurence Oberman wrote:
> > > I have a customer using the zfcp driver in RHEL7 and they are seeing
[:snip:]
> >
> > We are aware of a few zfcp bugs regarding recovery and we're almost done
> > fixing them, so stay tuned:
> >
> > 1)
> > Race in blocking fc_rport on fabric RSCN unnecessarily causing and
> > potentially escalating scsi_eh.
> > I suspect this to also erroneously trigger sd's medium access control
> > because a TUR might succeed but a subsequent I/O command might again
> > fail (with DID_TIME_OUT) due to the race (fooling fc_timed_out()).
> > This might be related to (where we did not yet know that it's in zfcp)
> > LTC bug 129581 / RH bug 1258680
> > "RHEL6.7 - I/O lockup on FS or dm layer after a few target port cable
> > pull iterations"
> >

We have since then fixed this one Bug mentioned above by Steffen. Martin
K. Petersen accepted those fixes for 4.9. This should ease the situation
with Medium access timeout.

But with recent discoveries in our code, we are not sure whether this
fixes all the problems we currently see. But the Medium access timeouts
themself are still buggy in quite some other ways, see below for that.

> >
> > 2)
> > Use-after-free for lun and target reset TMF causing kernel panic in
> > response handler path.
> >

We have a working fix for this too, but haven't had yet time to fully
review this.

> >
> > > We had a similar issue with the fnic driver and recently Cisco has
> > > addressed this with a patch.
> > > See below.
> > >
> > > I am wondering if we need to also address this in the zfcp driver.
[:snip:]
> > >         }
> > > }
> >
> > So this looks like DID_TRANSPORT_DISRUPTED is also a case which does not
> > lead to an error (handling) result, so maybe zfcp is good (enough) here.
> >
> > What do you think?
> >

Like Steffen said in his last mail, we still think that returning
commands with DID_TRANSPORT_DISRUPTED doesn't causes the issues you/your
customers see here.

>
> We have another 2 customers seeing this now where fabric aborts on zfcp lead to the
> "Medium access timeout failure.Offlining disk!" issue.
>
> This make me wonder if we should look into changing the response in the zfcp driver.
>
> Aug 15 00:32:42 xxxxxxx kernel: sd 0:0:1:4: [sdy]  Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
> Aug 15 00:32:42 xxxxxxx kernel: sd 0:0:1:4: [sdy] CDB: Read(10): 28 00 03 78 ea c0 00 00 20 00
> Aug 15 00:32:42 xxxxxxx kernel: end_request: I/O error, dev sdy, sector 58256064
> Aug 15 00:52:17 xxxxxxx multipathd: mpathm: sdy - tur checker reports path is up
> root@nzvmds728ROD:~>
>
> The key here is that we receive DID_TIME_OUT so we offline the disk
>
[:snip:]
>

So receiving DID_TIME_OUT for all commands that in fact did time out is
really not the problem here, but rather how the SD code counts this.
While it is not yet clear whether we sometimes cause those timeouts
ourself - by causing race-condition like the one Steffen described in
Bug (1) -, the handling in SD is still bad. Lets have a short look
(the source-code shown is from a RHEL7.2 kernel):

Lets assume SCSI EH starts with 2 or more commands for a single sdev
that did timeout - for whatever reason. In case of zFCP - where we
implement the different eh-hooks for aborts, resets and the like
manually and don't use the general eh_strategy_handler()-hook - EH will
call scsi_unjam_host() for the host in question:

2162 int scsi_error_handler(void *data)
2163 {
....
2205                 if (shost->transportt->eh_strategy_handler)
2206                         shost->transportt->eh_strategy_handler(shost);
2207                 else
2208                         scsi_unjam_host(shost);
....
2229 }

In scsi_unjam_host() it will put all pending commands into its own
eh_work_q. Then it will get sense-data for each of those commands that
contain a Check Condition.. but with commands that did timeout, this is
not the case, so getting sense will be a no-op here:

2131 static void scsi_unjam_host(struct Scsi_Host *shost)
2132 {
....
2143         if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
2144                 if (!scsi_eh_abort_cmds(&eh_work_q, &eh_done_q))
2145                         scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
....
2152 }

EH will next try to abort all these commands in scsi_eh_abort_cmds(). So
next this will lead to zFCP's abort hook zfcp_scsi_eh_abort_handler() to
be called for each command, in the effort to abort them. For this
example, lets assume that all those aborts succeeded - it might be that
the commands that did timeout got lost or something, but the storage is
fine otherwise, so new SCSI commands succeed as they should.
    This will lead to all those commands being put into the local
check_list and this to be given to scsi_eh_test_devices() to be tested.

1315 static int scsi_eh_abort_cmds(struct list_head *work_q,
1316                               struct list_head *done_q)
1317 {
1318         struct scsi_cmnd *scmd, *next;
1319         LIST_HEAD(check_list);
....
1323         list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
....
1338                 rtn = scsi_try_to_abort_cmd(shost->hostt, scmd);
....
1347                 scmd->eh_eflags &= ~SCSI_EH_CANCEL_CMD;
1348                 if (rtn == FAST_IO_FAIL)
1349                         scsi_eh_finish_cmd(scmd, done_q);
1350                 else
1351                         list_move_tail(&scmd->eh_entry, &check_list);
1352         }
1353
1354         return scsi_eh_test_devices(&check_list, work_q, done_q, 0);
1355 }

Because all commands in work_q got successfully aborted, work_q will be
empty; and check_list contains all the commands that previously where in
that list.

In scsi_eh_test_devices() now is the crux of this Medium access timeout
behaviour.

1260 static int scsi_eh_test_devices(struct list_head *cmd_list,
1261                                 struct list_head *work_q,
1262                                 struct list_head *done_q, int try_stu)
1263 {
1264         struct scsi_cmnd *scmd, *next;
1265         struct scsi_device *sdev;
1266         int finish_cmds;
1267
1268         while (!list_empty(cmd_list)) {
1269                 scmd = list_entry(cmd_list->next, struct scsi_cmnd, eh_entry);
1270                 sdev = scmd->device;
1271
1272                 if (!try_stu) {
1273                         if (scsi_host_eh_past_deadline(sdev->host)) {
1274                                 /* Push items back onto work_q */
1275                                 list_splice_init(cmd_list, work_q);
1276                                 SCSI_LOG_ERROR_RECOVERY(3,
1277                                         sdev_printk(KERN_INFO, sdev,
1278                                                     "%s: skip test device, past eh deadline",
1279                                                     current->comm));
1280                                 break;
1281                         }
1282                 }
1283
1284                 finish_cmds = !scsi_device_online(scmd->device) ||
1285                         (try_stu && !scsi_eh_try_stu(scmd) &&
1286                          !scsi_eh_tur(scmd)) ||
1287                         !scsi_eh_tur(scmd);
1288
1289                 list_for_each_entry_safe(scmd, next, cmd_list, eh_entry)
1290                         if (scmd->device == sdev) {
1291                                 if (finish_cmds &&
1292                                     (try_stu ||
1293                                      scsi_eh_action(scmd, SUCCESS) == SUCCESS))
1294                                         scsi_eh_finish_cmd(scmd, done_q);
1295                                 else
1296                                         list_move_tail(&scmd->eh_entry, work_q);
1297                         }
1298         }
1299         return list_empty(work_q);
1300 }

So, we iterate over all SCSI-commands for that host, that previously
successfully got aborted; and before that, did time out.
    In line 1284 finish_cmds will become 1 because scsi_eh_tur() will be
successful - like I said before, lets just assume the storage healed
itself via recovery or something else happened, and its now working just
fine.
    The loop in line 1289 will go over all remaining commands in the
cmd_list, and if those commands are for the same sdev, as the one for
which the TUR just was successful (a typical sign that the sdev is
working fine), it will call scsi_eh_action().

This is now the point where we reach towards the scsi-disk driver. For
SD, scsi_eh_action() will call the function sd_eh_action():

1574 static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
1575 {
1576         struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
1577
1578         if (!scsi_device_online(scmd->device) ||
1579             !scsi_medium_access_command(scmd) ||
1580             host_byte(scmd->result) != DID_TIME_OUT ||
1581             eh_disp != SUCCESS)
1582                 return eh_disp;
1583
1584         /*
1585          * The device has timed out executing a medium access command.
1586          * However, the TEST UNIT READY command sent during error
1587          * handling completed successfully. Either the device is in the
1588          * process of recovering or has it suffered an internal failure
1589          * that prevents access to the storage medium.
1590          */
1591         sdkp->medium_access_timed_out++;
1592
1593         /*
1594          * If the device keeps failing read/write commands but TEST UNIT
1595          * READY always completes successfully we assume that medium
1596          * access is no longer possible and take the device offline.
1597          */
1598         if (sdkp->medium_access_timed_out >= sdkp->max_medium_access_timeouts) {
1599                 scmd_printk(KERN_ERR, scmd,
1600                             "Medium access timeout failure. Offlining disk!\n");
1601                 scsi_device_set_state(scmd->device, SDEV_OFFLINE);
1602
1603                 return FAILED;
1604         }
1605
1606         return eh_disp;
1607 }

This is the one function you already looked at. The argument eh_disp is
SUCCESS; and lets assume that medium_access_timed_out is zero at the
start of this overall SCSI EH run.
    In line 1578 we have some tests, the disk is online (TUR worked
before), the command in question is a I/O command, it did run into a
timeout, and eh_disp is SUCCESS. So this test will fail and not use the
early-out path.
    So it will increase the Medium access timeout counter for each (!!!) of
the SCSI commands that are currently in EH. Which of course will lead to
the error you see in the kernel message buffer.
    And more, because it will return FAILED with at least one command,
this command will be put back into the work_q in the calling function
scsi_eh_test_devices(), and thus cause SCSI EH to escalate to more
severe steps - like device reset. And this although EH just healed the
state of its devices and everything is working fine (and with how
strange some storage servers react to device and/or bus reset TMFs, this
can cause for this situation to osculate to even more commands running
into timeouts and bad responses - I have seen this live already).

So this semantic here is plain wrong, if you ask me. We are in a single
EH run, because something in the path towards the storage had a hicup,
and (at least) 2 I/O commands for a single sdev did run into a timeout.
This is really nothing special right here, its annoying for the
work-load, and it should not happen in an ideal world, but we can not
guarantee that.
    But with the code-flow I described above, this will immediately lead
SD to disable that disk without a chance to automatically recover this,
ever (the operator has to intervene).

So yeah, we still are not sure whether we as zFCP can do more to avoid
commands from even running into this timeout-situation - there might be,
even with the patch for problem (1) above, still situations where we
make the midlayer run into a timeout, although we know that the cable is
pulled or just got re-plugged - but then this behaviours here would
still be bad.

Like I said before, with the patch for problem (1), this should get
better.

> I am wondering if we should increase the medium_access_timed_out count
> to a high number as a workaround here until we hear back from IBM.
>
> The default is 2
>
> 12:0:0:1]# cat max_medium_access_timeouts
> 2
>
> #!/bin/bash
> cd /sys/block
> for i in  sd*/device/scsi_disk/*
>   do

You can just iterate over /sys/class/scsi_disk/* for all SCSI-disks.

>     cat $i/max_medium_access_timeouts
>     echo 5 > $i/max_medium_access_timeouts
>     cat $i/max_medium_access_timeouts
>   done
>

If the code stays as it is right now, increasing the timeout can help,
but doesn't fix the overall problem. We did, after finding this
behaviour here, also already recommend increasing the timeout to a
customers that did run into a similar problem. But 5 is still low for
what we are talking about. This only means that if 5 commands and not 2,
like in my example above, timeout, this will lead to the disk being
offlined.  Also, if you want to make such changes persistent, you are
better of using a udev-rule to adapt the value for
max_medium_access_timeouts; maybe something like this:

ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{max_medium_access_timeouts}:="4294967295"

But please make sure that the customers understand, that this
is only a work-around. It will also not prevent SCSI EH from happening
in these scenarios and thus degrade the running workload. But it should
prevent SD from permanently disabling disks and SCSI EH from escalating
although this is really not necessary.

I hope this helps you a bit.

                                                    Beste Grüße / Best regards,
                                                      - Benjamin Block

>
> Perhaps IBM can attempt to reproduce in-house by causing fabric
> events, as I dont have access to zfcp and S390 here for this sort of
> reproducer.
>

-- 
Linux on z Systems Development         /         IBM Systems & Technology Group
                  IBM Deutschland Research & Development GmbH
Vorsitz. AufsR.: Martina Koederitz     /        Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294

Comment 2 Mark Goodwin 2016-08-26 05:03:28 UTC
This is very similar to BZ #1182838, where a switch firmware rolling update caused path flapping (see also linked customer case 01321891). In that BZ/case, the scsi TUR path checker was also in use, and during the firmware update TURs issued by multipathd were succeeding, but subsequent I/O was failing .. i.e. we had failover/failback flapping every checker_interval seconds. The bug is that TUR should NOT have been succeeding (because the device is not ready even though technically it is accessible). The workaround/solution was to change to the directio path_checker and the case was resolved.

In the case attached to this BZ, if TURs were not succeeding during the firmware update, then the paths would have remained failed until they returned after the upgrade, and hence we would not have tripped max_medium_access_timeouts and hence the scsi err handler would not have eventually offlined the devices ... which is when everything went downhill with manual intervention required to recover.

So it seems an alternative solution/workaround here would be to change to path_checker = directio in multipath.conf to avoid the TUR issue.

Comment 17 Ewan D. Milne 2016-09-23 16:25:15 UTC
Test kernel with patch to SCSI error handling available for testing at:

http://people.redhat.com/emilne/RPMS/.bz1370212/

Contains the following change:

commit 8ab5d0046f69034fb7f74abc25f209262d2098c1
Author: Ewan D. Milne <emilne>
Date:   Fri Sep 23 09:50:12 2016 -0400

    scsi_error: count medium access timeout only once per EH run
    
    The current medium access timeout counter will be increased for
    each command, so if there are enough failed commands we'll hit
    the medium access timeout for even a single failure.
    Fix this by making the timeout per EH run, ie the counter will
    only be increased once per device and EH run.
    
    Signed-off-by: Hannes Reinecke <hare>
    
    (Modified for RHEL6 -- KABI changes, also changed to add argument
     to scsi_eh_action, scsi_driver.eh_action, and sd_eh_action instead
     of overloading the existing eh_disp argument with a reset flag.)
    
    Signed-off-by: Ewan D. Milne <emilne>

---
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 047cc20..d3e4550 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -50,6 +50,7 @@
 #define HOST_RESET_SETTLE_TIME  (10)
 
 static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
+static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn, bool reset);
 
 /* called with shost->host_lock held */
 void scsi_eh_wakeup(struct Scsi_Host *shost)
@@ -130,6 +131,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
 
        ret = 1;
        scmd->eh_eflags |= eh_flag;
+       scsi_eh_action(scmd, 0, 1);
        list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
        shost->host_failed++;
        scsi_eh_wakeup(shost);
@@ -975,12 +977,12 @@ static int scsi_request_sense(struct scsi_cmnd *scmd)
        return scsi_send_eh_cmnd(scmd, NULL, 0, scmd->device->eh_timeout, ~0);
 }
 
-static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn)
+static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn, bool reset)
 {
        if (scmd->request->cmd_type != REQ_TYPE_BLOCK_PC) {
                struct scsi_driver *sdrv = scsi_cmd_to_driver(scmd);
                if (sdrv->eh_action)
-                       rtn = sdrv->eh_action(scmd, rtn);
+                       rtn = sdrv->eh_action(scmd, rtn, reset);
        }
        return rtn;
 }
@@ -1155,7 +1157,7 @@ static int scsi_eh_test_devices(struct list_head *cmd_list,
                        if (scmd->device == sdev) {
                                if (finish_cmds &&
                                    (try_stu ||
-                                    scsi_eh_action(scmd, SUCCESS) == SUCCESS))
+                                    scsi_eh_action(scmd, SUCCESS, 0) == SUCCESS))
                                        scsi_eh_finish_cmd(scmd, done_q);
                                else
                                        list_move_tail(&scmd->eh_entry, work_q);
@@ -1289,7 +1291,7 @@ static int scsi_eh_stu(struct Scsi_Host *shost,
                                list_for_each_entry_safe(scmd, next,
                                                          work_q, eh_entry) {
                                        if (scmd->device == sdev &&
-                                           scsi_eh_action(scmd, SUCCESS) == SUCCESS)
+                                           scsi_eh_action(scmd, SUCCESS, 0) == SUCCESS)
                                                scsi_eh_finish_cmd(scmd, done_q);
                                }
                        }
@@ -1353,7 +1355,7 @@ static int scsi_eh_bus_device_reset(struct Scsi_Host *shost,
                                list_for_each_entry_safe(scmd, next,
                                                         work_q, eh_entry) {
                                        if (scmd->device == sdev &&
-                                           scsi_eh_action(scmd, rtn) != FAILED)
+                                           scsi_eh_action(scmd, rtn, 0) != FAILED)
                                                scsi_eh_finish_cmd(scmd,
                                                                   done_q);
                                }
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index f812367..d6ed528 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -105,7 +105,7 @@ static int sd_suspend(struct device *, pm_message_t state);
 static int sd_resume(struct device *);
 static void sd_rescan(struct device *);
 static int sd_done(struct scsi_cmnd *);
-static int sd_eh_action(struct scsi_cmnd *, int);
+static int sd_eh_action(struct scsi_cmnd *, int, bool);
 static void sd_read_capacity(struct scsi_disk *sdkp, unsigned char *buffer);
 static void scsi_disk_release(struct device *cdev);
 static void sd_print_sense_hdr(struct scsi_disk *, struct scsi_sense_hdr *);
@@ -1349,6 +1349,7 @@ static const struct block_device_operations sd_fops = {
  *     sd_eh_action - error handling callback
  *     @scmd:          sd-issued command that has failed
  *     @eh_disp:       The recovery disposition suggested by the midlayer
+ *     @reset:         Reset the medium access timed out increment flag
  *
  *     This function is called by the SCSI midlayer upon completion of an
  *     error test command (currently TEST UNIT READY). The result of sending
@@ -1357,10 +1358,14 @@ static const struct block_device_operations sd_fops = {
  *     test unit ready (so wrongly see the device as having a successful
  *     recovery)
  **/
-static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
+static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp, bool reset)
 {
        struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
 
+       if (reset) {
+               sdkp->medium_access_reset = 0;
+               return eh_disp;
+       }
        if (!scsi_device_online(scmd->device) ||
            !scsi_medium_access_command(scmd) ||
            host_byte(scmd->result) != DID_TIME_OUT ||
@@ -1374,7 +1379,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
         * process of recovering or has it suffered an internal failure
         * that prevents access to the storage medium.
         */
-       sdkp->medium_access_timed_out++;
+       if (!sdkp->medium_access_reset) {
+               sdkp->medium_access_timed_out++;
+               sdkp->medium_access_reset = 1;
+       }
 
        /*
         * If the device keeps failing read/write commands but TEST UNIT
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index ebf68e3..c7c7434 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -90,6 +90,7 @@ struct scsi_disk {
        unsigned        lbpvpd : 1;
 #ifndef __GENKSYMS__
        unsigned        cache_override : 1; /* temp override of WCE,RCD */
+       unsigned        medium_access_reset : 1;
 #endif
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
diff --git a/include/scsi/scsi_driver.h b/include/scsi/scsi_driver.h
index 20fdfc2..e1dd47a 100644
--- a/include/scsi/scsi_driver.h
+++ b/include/scsi/scsi_driver.h
@@ -16,7 +16,7 @@ struct scsi_driver {
 
        void (*rescan)(struct device *);
        int (*done)(struct scsi_cmnd *);
-       int (*eh_action)(struct scsi_cmnd *, int);
+       int (*eh_action)(struct scsi_cmnd *, int, bool);
 };
 #define to_scsi_driver(drv) \
        container_of((drv), struct scsi_driver, gendrv)

Comment 18 loberman 2016-09-23 16:33:50 UTC
Many Thanks Ewan,
I have offered this to the customer to test it for us.

Comment 19 Ewan D. Milne 2016-09-23 16:39:31 UTC
Thanks.  Let me know how it works out.  We have had a bit of discussion, and
I think there are other issues that this does not fix, but I would rather fix
95% of the problem for the customer now and worry about other aspects later.

This is a -660 kernel and as far as I can tell it does not have any recent
zfcp fixes, we are going to need a separate BZ to track those if we don't have
one already.

Comment 20 loberman 2016-09-23 16:48:05 UTC
Understood

I will open a new BZ and track the coming zfcp changes when they show up upstream
Steffen did not say when they would be coming out, rather he just said for me to keep an eye out.

When they show up will open the BZ to get them back-ported as it would appear they are important in the general stability sense for the zfcp driver.

This new BZ to tyrack the zfcp changes will be have to be 2 BZ'S one for 6.9 and 7.2+

Thanks!!

Comment 26 loberman 2016-10-13 00:35:29 UTC
Patch was missing a line

diff -Nurp linux-2.6.32-573.18.1.el6.orig/drivers/scsi/libfc/fc_exch.c linux-2.6.32-573.18.1.el6/drivers/scsi/libfc/fc_exch.c
--- linux-2.6.32-573.18.1.el6.orig/drivers/scsi/libfc/fc_exch.c	2016-01-06 10:15:32.000000000 -0500
+++ linux-2.6.32-573.18.1.el6/drivers/scsi/libfc/fc_exch.c	2016-10-12 20:33:54.558469871 -0400
@@ -815,14 +815,19 @@ err:
  * EM is selected when a NULL match function pointer is encountered
  * or when a call to a match function returns true.
  */
-static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport,
-					    struct fc_frame *fp)
+static struct fc_exch *fc_exch_alloc(struct fc_lport *lport,
+                                    struct fc_frame *fp)
 {
 	struct fc_exch_mgr_anchor *ema;
+	struct fc_exch *ep;
 
-	list_for_each_entry(ema, &lport->ema_list, ema_list)
-		if (!ema->match || ema->match(fp))
-			return fc_exch_em_alloc(lport, ema->mp);
+	list_for_each_entry(ema, &lport->ema_list, ema_list) {
+		if (!ema->match || ema->match(fp)) {
+			ep = fc_exch_em_alloc(lport, ema->mp);
+		if (ep)
+			return ep;
+       	        }
+	}
 	return NULL;
 }
 
diff -Nurp linux-2.6.32-573.18.1.el6.orig/include/linux/netdevice.h linux-2.6.32-573.18.1.el6/include/linux/netdevice.h
--- linux-2.6.32-573.18.1.el6.orig/include/linux/netdevice.h	2016-01-06 10:15:59.000000000 -0500
+++ linux-2.6.32-573.18.1.el6/include/linux/netdevice.h	2016-10-12 20:08:52.828043196 -0400
@@ -1103,6 +1103,10 @@ struct net_device
 
 #define NETIF_F_ALL_TSO 	(NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_TSO_ECN)
 
+#define NETIF_F_ALL_FCOE	(NETIF_F_FCOE_CRC | NETIF_F_FCOE_MTU | \
+				 NETIF_F_FSO)
+
+
 	/*
 	 * If one device supports one of these features, then enable them
 	 * for all in netdev_increment_features.
diff -Nurp linux-2.6.32-573.18.1.el6.orig/include/scsi/fc_frame.h linux-2.6.32-573.18.1.el6/include/scsi/fc_frame.h
--- linux-2.6.32-573.18.1.el6.orig/include/scsi/fc_frame.h	2016-01-06 10:15:10.000000000 -0500
+++ linux-2.6.32-573.18.1.el6/include/scsi/fc_frame.h	2016-10-12 20:08:52.829043197 -0400
@@ -137,6 +137,8 @@ static inline struct fc_frame *fc_frame_
 		fp = fc_frame_alloc_fill(dev, len);
 	else
 		fp = _fc_frame_alloc(len);
+		if(!fp)
+			printk("RHDEBUG: In fcp_frame_alloc, we returned fp = %p\n",fp);
 	return fp;
 }
 
diff -Nurp linux-2.6.32-573.18.1.el6.orig/net/8021q/vlan_dev.c linux-2.6.32-573.18.1.el6/net/8021q/vlan_dev.c
--- linux-2.6.32-573.18.1.el6.orig/net/8021q/vlan_dev.c	2016-01-06 10:15:58.000000000 -0500
+++ linux-2.6.32-573.18.1.el6/net/8021q/vlan_dev.c	2016-10-12 20:08:52.829043197 -0400
@@ -522,7 +522,9 @@ static int vlan_dev_init(struct net_devi
 
 	netdev_extended(dev)->hw_features = NETIF_F_ALL_CSUM | NETIF_F_SG |
 					    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
-					    NETIF_F_HIGHDMA | NETIF_F_SCTP_CSUM;
+ 					    NETIF_F_HIGHDMA | NETIF_F_SCTP_CSUM |
+ 					    NETIF_F_ALL_FCOE;
+
 
 	dev->features |= real_dev->vlan_features | NETIF_F_LLTX;
 	dev->gso_max_size = real_dev->gso_max_size;
[loberman@dhcp-33-21 SOURCES]$

Comment 29 loberman 2016-10-13 09:34:10 UTC
Yes mistake. 26 is for another BZ.
Thanks for catching that.

Yes build from source.

Ignore comnent 26

Comment 30 Milan P. Gandhi 2016-10-13 09:35:59 UTC
(In reply to loberman from comment #29)
> Yes mistake. 26 is for another BZ.
> Thanks for catching that.
> 
> Yes build from source.
> 
> Ignore comnent 26

Thanks Laurence!
Let me build the test kernel for all arch with patch in comment 17.

Comment 33 Ewan D. Milne 2016-10-13 16:00:29 UTC
I have put an s390x version of the test kernel with patch to SCSI error handling in:

http://people.redhat.com/emilne/RPMS/.bz1370212/

Comment 37 Ewan D. Milne 2016-12-13 13:48:53 UTC
>Hi Ewan, I have got update from the customer, that they could test a kernel for >s390 system. The link in comment#17 shows test kernel for x86_64 arch only. >Could you please let me know if there is a test kernel available for s390 arch. >that I can share with the customer.
>
>Thanks,
>Milan.
>
>(In reply to Ewan D. Milne from comment #33)
> I have put an s390x version of the test kernel with patch to SCSI error
> handling in:
> 
> http://people.redhat.com/emilne/RPMS/.bz1370212/

Is there any update on whether the customer was able to test this?
We are past the 6.9 deadline at this point.

Comment 43 loberman 2017-03-05 12:54:58 UTC
Upstream has a fix forthcoming:

https://marc.info/?l=linux-scsi&m=148827743226480&w=2

Regards
Laurence

Comment 49 Ewan D. Milne 2017-11-08 14:44:46 UTC
Per discussion w/support, closing as WONTFIX.  There is a workaround available,
which is to increase max_medium_access_timeouts to a large value (see KB article
https://access.redhat.com/site/solutions/2575901).  It appears as if the problem
may no longer be occurring due to fixes on the array side as well as we are no
longer receiving reports of this problem.


Note You need to log in before you can comment on or make changes to this bug.