Bug 1845909 - [RFE] Support sanlock io_timeout modification with respect to IBM 2145 storage at DC level.
Summary: [RFE] Support sanlock io_timeout modification with respect to IBM 2145 storag...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: sanlock
Version: 4.3.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.9
: ---
Assignee: Nir Soffer
QA Contact: Evelina Shames
URL:
Whiteboard:
: 1705289 (view as bug list)
Depends On: 1508098 1902468
Blocks: 1417161
TreeView+ depends on / blocked
 
Reported: 2020-06-10 12:10 UTC by Siddhant Rao
Modified: 2024-12-20 19:07 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
In the current release, the sanlock_io timeout is configurable. Before configuring sanlock_io timeout, it is recommended that you contact Red Hat support. Please refer to https://access.redhat.com/solutions/6338821. Red Hat is not responsible for testing different timeout values other than the defaults. Red Hat support will only provide guidance on how to change those values consistently across the RHV setup.
Clone Of:
Environment:
Last Closed: 2021-11-16 15:12:47 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
wangxiao.bj: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1705289 0 unspecified CLOSED [RFE] Support NetApp MetroCluster planned switchover, sanlock_io timeout config at DC level 2022-02-01 19:24:20 UTC
Red Hat Knowledge Base (Solution) 5152311 0 None None None 2020-06-11 16:51:01 UTC
Red Hat Knowledge Base (Solution) 6338821 0 None None None 2021-09-17 17:08:20 UTC
Red Hat Product Errata RHBA-2021:4704 0 None None None 2021-11-16 15:12:59 UTC

Description Siddhant Rao 2020-06-10 12:10:25 UTC
Description of problem:

Allow users the ability to configure sanlock timeouts which is now presently needed at 80 seconds.

The IBM storage system is a stretched storage with a hyperscale implementation. They have controllers at various different geographical regions.

In the event of a an IO failover, when one controller is down or rebooted, the other one which needs to start can be at a different geographical region altogether. So many a times this failover can cause much more than 80 seconds, where all the attempts to renew delta leases fails and sanlock starts killing processes.

Eventually it resets the host.

  After the failure of few SCSI sub paths, the SCSI error handling process is triggered and IO failover is started, but it is often observed that depending up on storage setup, it may take more than 80 seconds for an IO failover to complete. But here the sanlock timeout of 80 seconds expires. eventually sanlock invokes wdmd and resets the host. 

  In this case, as we see in this situation, it is very likely that an IO failover would have got completed in 85-90 or 100 seconds and the whole RHEV host reboot could have been avoided.  So, using the rigid sanlock timeout does not help avert above fence event.


 We are trying to apply all the SCSI error handler and multipath tuning options, but it has been observed in the current case that a rigid timeout of 80 seconds is not always sufficient to complete IO failover.  The IO failover could have been completed if given a 5-10 or 15 seconds more, but the customer suffered a complete host reboot, only do to the fixed sanlock timeout.

Version-Release number of selected component (if applicable):
vdsm-4.30.44-1.el7ev.x86_64
sanlock-3.7.3-1.el7.x86_64
device-mapper-multipath-0.4.9-131.el7.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

- Sanlock sends the IO request and does not get it, eventually in 80 seconds, sanlock starts killing processes and restarts the host.

Expected results:

- Sanlock io_timeout configuration should be configurable/modifyable so that we can increase the sanlock timeouts to avoid sanlock killing processes and reset the host.

Additional info:

Comment 4 Marina Kalinin 2020-06-16 15:12:31 UTC
Related BZ#1705289:
[RFE] Support NetApp MetroCluster planned switchover, sanlock_io timeout config at DC level

Comment 12 Marina Kalinin 2020-11-19 21:37:32 UTC
*** Bug 1705289 has been marked as a duplicate of this bug. ***

Comment 14 Razvan Corneliu C.R. VILT 2020-11-20 19:58:17 UTC
As already mentioned in #1705289:

> We're facing similar issues all our 15 oVirt clusters with a
> multitude of storage engines. While multipathd recovers, the
> oVirt hosts go crazy for any operation:
> 
> * online firmware upgrades for Brocade switches (once every
> about 20 switches upgraded, we take an entire cluster down),
> even if we have two independent and redundant fabrics. The
> path fails are usually shorter than 30seconds, but if one
> is 31 seconds, everything is haywire.
> * graceful takedown of a storage node (with NPIV, and ALUA)
> for any legitimate reason such as firmware updates, hardware
> or software maintenance. It applies to all IBM SVC style
> storages (v5030, v5100, v7200, v9000). We've also had incidents
> with EMC VMAX250f, multiple 3Par 8000 series and PureStorage
> FlashArray X50G2/G3. I'm not sure about our NetApp, but
> I assume that it behaves identically if stressed.
> 
> We believe that the I/O timeouts for Sanlock should be a
> multiple of the underlying block device I/O timeout (most
> vendors go with 30s, but IBM requires 120s in their
> documentation). Since the DMMP block devices inherit the
> timeouts of the underlying SCSI block devices (configured
> via UDev scripts straight from the vendor documentation),
> Sanlock should first look at the value declared on the
> block device. We also believe that we should be able to
> disable the qemu suspend on timeout functionality. The VM OS
> has internal methods of dealing with timeouts. VMware doesn't
> suspend VMs on very long I/Os and leaves the OS to treat the
> error. I've seen journal aborts and R/O remounts on I/O that
> took longer than 5 minutes due to a Denial of Service situation.
> 
> Fencing during a stress period for the FC Fabric leads to
> denial of service. It's the worst possible idea. You can
> get duplicate VMs even with oVirt 4.4.
> 
> There are also some issues with DMMP which sometimes fails
> to reinstate paths, DMMP doesn't actually use the RSCN and
> PLOGI information from the FC Fabric and depends almost
> exclusively on ALUA, TUR and timeouts. This means that it's
> very slow to react to changes in the fabric.
> 
> In our experience, FC Multipathing in oVirt works as long as
> you don't need it for redundancy and you only use the round-robin
> I/O distribution. If you have paths failing (even gracefully
> with ALUA notifications), it's undependable.

Note that the IBM documentation requires a SCSI Inquiry timeout of 70s and a SCSI command timeout of 120s. As documented in https://www.ibm.com/support/knowledgecenter/en/ST3FR7_8.4.0/com.ibm.fs7200_840.doc/svc_linrequiremnts_21u99v.html .

I've created two clusters of RHEV 4.3 and 4.4 that I will test this behavior on. Do you have any recommendations for tracing this? I want to have 60 VMs producing about 64KB 30kIOPS running on 16 blades connecting to different IBM storages (V9000, FS900) going through upgrades from older firmware versions to the newest, going through a graceful node takedown (service mode), doing firmware upgrades on the fabrics, etc. All this to find out where the issue is. I will compare with a PowerMax2000 going through the same process.

Comment 15 Germano Veit Michel 2020-11-22 21:31:55 UTC
It seems needinfo was set by mistake on me, clearing it.

Comment 23 Nir Soffer 2020-11-29 11:42:37 UTC
Testing with 20 seconds io timeout (which seems to be what is needed
for this bug), show a large delay when activating hosts, up to 30 
minutes with 50 storage domains. Sanlock bug 1902468 tracks this issue.

Comment 24 Nir Soffer 2020-11-29 12:17:09 UTC
(In reply to Razvan Corneliu C.R. VILT from comment #14)
...
> Note that the IBM documentation requires a SCSI Inquiry timeout of 70s and a
> SCSI command timeout of 120s. As documented in
> https://www.ibm.com/support/knowledgecenter/en/ST3FR7_8.4.0/com.ibm.
> fs7200_840.doc/svc_linrequiremnts_21u99v.html .

Looking in "multipathd show config" show:

        device {
                vendor "IBM"
                product "^2145"
                path_grouping_policy "group_by_prio"
                prio "alua"
                failback "immediate"
                no_path_retry "queue"
        }

RHV cannot work with "no_path_retry queue". A host using this settings will
just hang for unlimited time when storage is not responsive, causing timeotus
in APIs calls, or causing vms to become unresponsive, even vms that do not
use this storage. It is also likely to become non-operational and then
RHV may try to migrate all the VMs to other hosts.

RHV currently enforces "non_path_retry 4":

$ cat /etc/multipath.conf
...
overrides {
    no_path_retry   4
}

It is not clear what is the value needed to support "SCSI command
timeout of 120s". You may try to use "no_path_retry 24" to allow 
120 seconds of queuing if all path have failed.

If we need to wait 120 seconds before failing a path, this is unlikely
to work with RHV.

Also it is not clear what is the expectation from the system during
failover/upgrade scenarios.

With current settings, VMs are likey to pause after 20 seconds of
downtime, regardless of sanlock timeout. When vms pause they release
the storage lease, so they are not killed by sanlock.

The SPM host may be rebooted if sanlock cannot terminate vdsm after
the lease expire. Sanlock reboot the host only if a process holding
a lease cannot be terminated after a lease expire. If this happens
it means that multipath is not configured correctly for vdsm, maybe
using "no_path_retry queue"?

Do we expect that VMs will continue to run (possibly hang on I/O)
during upgrade/failover on the storage side? For this multipath
no_path_retry must be equal or larger than sanlock renewal timeout
(8 * sanlock io timeout), and both timeouts should be larger than
the time required to completed the failover.

Comment 26 Nir Soffer 2020-11-30 19:12:15 UTC
Using larger timeouts increase the timeout significantly when activating 
a host after unclean shutdown. Setting sanlock host name can minimize this
delay. Add bug 1508098 as depedency.

Comment 27 Nir Soffer 2020-12-10 21:48:32 UTC
In 4.4.4, vdsm supports configuring sanlock io timeout. See this document
for more info:
https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md

QE did not test non default timeout so we cannot grantee that the system
works with non default timeout and compatible with IBM 2145 storage or
any other storage that we cannot test. 

Storage vendor can using the new configuration for testing RHV and providing
recommendation to user on how to configure the system for specific storage.

More changes may be needed in other parts of the system (hosted engine, engine)
to work better with longer io timeouts.

Comment 47 Peter Lauterbach 2021-10-06 14:58:15 UTC
As discussed, this is implemented and documented in the attached KCS.
Target to RHV 4.4.9 errata and release notes.

Comment 66 errata-xmlrpc 2021-11-16 15:12:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.9]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4704

Comment 67 Red Hat Bugzilla 2023-09-18 00:21:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.