Bug 1273421 - Need to add deps on kernel] vdsm iscsi failover taking too long during controller maintenance
Need to add deps on kernel] vdsm iscsi failover taking too long during contro...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.5.0
x86_64 Linux
high Severity high
: ovirt-3.5.6
: 3.5.6
Assigned To: Nir Soffer
Aharon Canan
storage
: ZStream
Depends On: 980139
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-20 08:19 EDT by rhev-integ
Modified: 2016-02-10 14:21 EST (History)
43 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, the multipath device configuration was overwritten by the iSCSI daemon using the default timeout 120 seconds, and resulted in long delays before VDSM attempts to try the next path. With this update, using an updated kernel, the iSCSI daemon now uses the multipath device timeout instead of overwriting it with a longer timeout time. There is now minimal delay during failover.
Story Points: ---
Clone Of: 980139
Environment:
Last Closed: 2015-12-01 15:41:12 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 413503 None None None Never
oVirt gerrit 47078 master MERGED spec: Require newer kernel version for RHEL Never
oVirt gerrit 47500 ovirt-3.6 MERGED spec: Require newer kernel version for RHEL Never
oVirt gerrit 47502 ovirt-3.5 MERGED spec: Require newer kernel version for RHEL Never

  None (edit)
Comment 1 Allon Mureinik 2015-10-26 09:42:52 EDT
Nir, can you please add some doctext on the impact this has on the customer?
Comment 2 Nir Soffer 2015-10-26 10:49:54 EDT
Added
Comment 4 Aharon Canan 2015-11-05 07:30:03 EST
Following my discussion with Nir, the only thing to verify here is that we require kernel 3.10.0-229.17.1.el7

Verified using vt18.1

[root@camel-vdsb ~]# rpm -q --requires vdsm-4.16.28-1.el7ev.x86_64 |grep kernel 
kernel >= 3.10.0-229.17.1.el7
Comment 5 Julie 2015-11-10 23:03:25 EST
Hi Nir and Aharon,
   I'm not sure if I fully understand this bug. Please have a look at my updated text and let me know if you have any feedback. 

Cheers,
Julie
Comment 6 Nir Soffer 2015-11-12 02:40:47 EST
(In reply to Julie from comment #5)
> Hi Nir and Aharon,
>    I'm not sure if I fully understand this bug. Please have a look at my
> updated text and let me know if you have any feedback. 

Previously, the multipath device configuration was overwritten by the iSCSI daemon using the default timeout 120 seconds, and resulted in long delays before VDSM attempts to re-establish connections with the Manager. 

re-establishing connections with the Manager is very creative but
completely bogus, I wonder how it sneak it into this text :-)

What happens is this:

1. vdsm or another process run by vdsm (e.g. lvm, multipath) try
   to write or read from storage
2. the request should have time out after 5 seconds (according to 
   multipath configuration, but the timeout was overridden to 120 
   seconds
3. The request times out after 120 seconds
4. Multipath try the next path (we may have several paths to the 
   same storage device
5. This request also times out after 120 seconds

This cause delays of minutes during various operations, instead of 
seconds when devices are configured properly.

With this update, using an updated kernel, the iSCSI daemon now uses the multipath device timeout instead of overwriting it with a longer timeout time. There is now minimal delays in re-establishing connections during failover.

I would remove the "re-establishing connections" part, I don't know
if this is technically correct regarding the scsi layer.

The important part is "failover" - when one path to storage is failing,
multipath detect the failure and try the next path, or fail the operation
because all paths are faulty.

Ben, would like to correct my description if needed?
Comment 7 Ben Marzinski 2015-11-12 11:39:31 EST
The only clarification I have that step 5 is hopefully not not happening serially. Multipathd will be running path checks (by default every 20 seconds on working paths).  Most of the path checker functions (but not all) run asynchronously, which means that the path checker thread is free to check other paths while one is still waiting for a reply. This means that within 20 seconds after you lose your connection to your storage, multipathd should have tried to check your path.  Once this check times out after 120 seconds, multipathd will mark the path as failed, and multipath won't try to failover to it.

So, assuming your device uses an asynchronous checker, the maximum time till multipath notices all the paths are down should be somewhere around 140 seconds in this case.  If your device needs to use a synchronous checker, then it will work as you have outlined above.
Comment 8 Julie 2015-11-16 23:01:24 EST
Thanks for your feedback, Ben & Nir.
Doc text updated.

Cheers,
Julie
Comment 10 errata-xmlrpc 2015-12-01 15:41:12 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2530.html

Note You need to log in before you can comment on or make changes to this bug.