Description of problem: Allow users the ability to configure sanlock timeouts which is now presently needed at 80 seconds. The IBM storage system is a stretched storage with a hyperscale implementation. They have controllers at various different geographical regions. In the event of a an IO failover, when one controller is down or rebooted, the other one which needs to start can be at a different geographical region altogether. So many a times this failover can cause much more than 80 seconds, where all the attempts to renew delta leases fails and sanlock starts killing processes. Eventually it resets the host. After the failure of few SCSI sub paths, the SCSI error handling process is triggered and IO failover is started, but it is often observed that depending up on storage setup, it may take more than 80 seconds for an IO failover to complete. But here the sanlock timeout of 80 seconds expires. eventually sanlock invokes wdmd and resets the host. In this case, as we see in this situation, it is very likely that an IO failover would have got completed in 85-90 or 100 seconds and the whole RHEV host reboot could have been avoided. So, using the rigid sanlock timeout does not help avert above fence event. We are trying to apply all the SCSI error handler and multipath tuning options, but it has been observed in the current case that a rigid timeout of 80 seconds is not always sufficient to complete IO failover. The IO failover could have been completed if given a 5-10 or 15 seconds more, but the customer suffered a complete host reboot, only do to the fixed sanlock timeout. Version-Release number of selected component (if applicable): vdsm-4.30.44-1.el7ev.x86_64 sanlock-3.7.3-1.el7.x86_64 device-mapper-multipath-0.4.9-131.el7.x86_64 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: - Sanlock sends the IO request and does not get it, eventually in 80 seconds, sanlock starts killing processes and restarts the host. Expected results: - Sanlock io_timeout configuration should be configurable/modifyable so that we can increase the sanlock timeouts to avoid sanlock killing processes and reset the host. Additional info:
Related BZ#1705289: [RFE] Support NetApp MetroCluster planned switchover, sanlock_io timeout config at DC level
*** Bug 1705289 has been marked as a duplicate of this bug. ***
As already mentioned in #1705289: > We're facing similar issues all our 15 oVirt clusters with a > multitude of storage engines. While multipathd recovers, the > oVirt hosts go crazy for any operation: > > * online firmware upgrades for Brocade switches (once every > about 20 switches upgraded, we take an entire cluster down), > even if we have two independent and redundant fabrics. The > path fails are usually shorter than 30seconds, but if one > is 31 seconds, everything is haywire. > * graceful takedown of a storage node (with NPIV, and ALUA) > for any legitimate reason such as firmware updates, hardware > or software maintenance. It applies to all IBM SVC style > storages (v5030, v5100, v7200, v9000). We've also had incidents > with EMC VMAX250f, multiple 3Par 8000 series and PureStorage > FlashArray X50G2/G3. I'm not sure about our NetApp, but > I assume that it behaves identically if stressed. > > We believe that the I/O timeouts for Sanlock should be a > multiple of the underlying block device I/O timeout (most > vendors go with 30s, but IBM requires 120s in their > documentation). Since the DMMP block devices inherit the > timeouts of the underlying SCSI block devices (configured > via UDev scripts straight from the vendor documentation), > Sanlock should first look at the value declared on the > block device. We also believe that we should be able to > disable the qemu suspend on timeout functionality. The VM OS > has internal methods of dealing with timeouts. VMware doesn't > suspend VMs on very long I/Os and leaves the OS to treat the > error. I've seen journal aborts and R/O remounts on I/O that > took longer than 5 minutes due to a Denial of Service situation. > > Fencing during a stress period for the FC Fabric leads to > denial of service. It's the worst possible idea. You can > get duplicate VMs even with oVirt 4.4. > > There are also some issues with DMMP which sometimes fails > to reinstate paths, DMMP doesn't actually use the RSCN and > PLOGI information from the FC Fabric and depends almost > exclusively on ALUA, TUR and timeouts. This means that it's > very slow to react to changes in the fabric. > > In our experience, FC Multipathing in oVirt works as long as > you don't need it for redundancy and you only use the round-robin > I/O distribution. If you have paths failing (even gracefully > with ALUA notifications), it's undependable. Note that the IBM documentation requires a SCSI Inquiry timeout of 70s and a SCSI command timeout of 120s. As documented in https://www.ibm.com/support/knowledgecenter/en/ST3FR7_8.4.0/com.ibm.fs7200_840.doc/svc_linrequiremnts_21u99v.html . I've created two clusters of RHEV 4.3 and 4.4 that I will test this behavior on. Do you have any recommendations for tracing this? I want to have 60 VMs producing about 64KB 30kIOPS running on 16 blades connecting to different IBM storages (V9000, FS900) going through upgrades from older firmware versions to the newest, going through a graceful node takedown (service mode), doing firmware upgrades on the fabrics, etc. All this to find out where the issue is. I will compare with a PowerMax2000 going through the same process.
It seems needinfo was set by mistake on me, clearing it.
Testing with 20 seconds io timeout (which seems to be what is needed for this bug), show a large delay when activating hosts, up to 30 minutes with 50 storage domains. Sanlock bug 1902468 tracks this issue.
(In reply to Razvan Corneliu C.R. VILT from comment #14) ... > Note that the IBM documentation requires a SCSI Inquiry timeout of 70s and a > SCSI command timeout of 120s. As documented in > https://www.ibm.com/support/knowledgecenter/en/ST3FR7_8.4.0/com.ibm. > fs7200_840.doc/svc_linrequiremnts_21u99v.html . Looking in "multipathd show config" show: device { vendor "IBM" product "^2145" path_grouping_policy "group_by_prio" prio "alua" failback "immediate" no_path_retry "queue" } RHV cannot work with "no_path_retry queue". A host using this settings will just hang for unlimited time when storage is not responsive, causing timeotus in APIs calls, or causing vms to become unresponsive, even vms that do not use this storage. It is also likely to become non-operational and then RHV may try to migrate all the VMs to other hosts. RHV currently enforces "non_path_retry 4": $ cat /etc/multipath.conf ... overrides { no_path_retry 4 } It is not clear what is the value needed to support "SCSI command timeout of 120s". You may try to use "no_path_retry 24" to allow 120 seconds of queuing if all path have failed. If we need to wait 120 seconds before failing a path, this is unlikely to work with RHV. Also it is not clear what is the expectation from the system during failover/upgrade scenarios. With current settings, VMs are likey to pause after 20 seconds of downtime, regardless of sanlock timeout. When vms pause they release the storage lease, so they are not killed by sanlock. The SPM host may be rebooted if sanlock cannot terminate vdsm after the lease expire. Sanlock reboot the host only if a process holding a lease cannot be terminated after a lease expire. If this happens it means that multipath is not configured correctly for vdsm, maybe using "no_path_retry queue"? Do we expect that VMs will continue to run (possibly hang on I/O) during upgrade/failover on the storage side? For this multipath no_path_retry must be equal or larger than sanlock renewal timeout (8 * sanlock io timeout), and both timeouts should be larger than the time required to completed the failover.
Using larger timeouts increase the timeout significantly when activating a host after unclean shutdown. Setting sanlock host name can minimize this delay. Add bug 1508098 as depedency.
In 4.4.4, vdsm supports configuring sanlock io timeout. See this document for more info: https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md QE did not test non default timeout so we cannot grantee that the system works with non default timeout and compatible with IBM 2145 storage or any other storage that we cannot test. Storage vendor can using the new configuration for testing RHV and providing recommendation to user on how to configure the system for specific storage. More changes may be needed in other parts of the system (hosted engine, engine) to work better with longer io timeouts.
As discussed, this is implemented and documented in the attached KCS. Target to RHV 4.4.9 errata and release notes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.9]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4704
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days