1705289 – [RFE] Support NetApp MetroCluster planned switchover, sanlock_io timeout config at DC level

Bug 1705289 - [RFE] Support NetApp MetroCluster planned switchover, sanlock_io timeout config at DC level

Summary: [RFE] Support NetApp MetroCluster planned switchover, sanlock_io timeout conf...

Keywords:
Status:	CLOSED DUPLICATE of bug 1845909
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	sanlock
Sub Component:
Version:	4.2.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Tal Nisan
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:	FutureFeature
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-01 22:41 UTC by Germano Veit Michel
Modified:	2022-02-01 19:18 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-19 21:37:32 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6338821	0	None	None	None	2021-09-17 17:08:20 UTC

Internal Links: 1845909

Description Germano Veit Michel 2019-05-01 22:41:25 UTC

Currently, due to sanlock timeouts, the planned failover of NetApp MetroCluster ends up killing VMs with Leases and causing outages due to the period of time the storage can be unavailable. According to NetApp documentation, the outage can last up to 120s (RTO - Recovery Time Objective). Our current (default) sanlock timeouts are too strict for this to work and outages are caused bye to leases expiring.

Our customer requests us to implement and test a Data-Center level option to allow the storage attached to that DC to fully support this failover.

Planned failovers are required when doing upgrades without downtime. This request is to enable seamless upgrades of the storage software without shutting down the data-center.

I understand this is a DC level sanlock io_timeout option configuration, managed by RHV. The customer does not wish to set and ensure this setting manually on all hosts, as it is dangerous to get it wrong.

Comment 5 Klaas Demter 2019-07-15 06:09:41 UTC

Hi,
is there any news about this problem? Did you get into contact with NetApp and talk about this problem?

Greetings
Klaas

Comment 12 RHEL Program Management 2019-09-25 11:18:38 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 15 Klaas Demter 2020-05-05 06:29:06 UTC

Hi,
is there any news on this? A year later this is still a problem that makes RHV, properly configured with storage leases, a high risk component in our setup.

Just to make this clear again: during a metro cluster failure that leads to a takeover action you may also have a sanlock action that leads to all VMs being rebooted that have storage leases.

For other people having the same issues I'll describe our workaround (usable for planned downtimes used in upgrades, not for emergencies during hardware failure):

1) We remove all storage leases from all HA VMs via python sdk
2) We upgrade the metro cluster
3) We re-add all storage leases to all HA VMs via python sdk

Greetings
Klaas

Comment 18 Siddhant Rao 2020-06-08 14:06:13 UTC

Hi,

We have another customer who is facing problems due to timeout value of Sanlock being set to 80.

In the following case, the customer is using a multi-site stretched storage IBM 2175 in the RHV Environment.

Such setup involves provisioning the LUNs to the same RHEV host through storage controllers located in more than one site.

When one of the storage controllers in such multi-site storage setup is failed, then RHEV hosts reports failure of few SCSI sub paths as expected.

After the failure of few SCSI sub paths, the SCSI error handling process is triggered and IO failover is started, but it is often observed that depending up on storage setup, it may take more than 80 seconds for an IO failover to complete. But the RHEV hosts triggers a fence event as soon as the sanlock timeout of 80 seconds is expired.

In this case, as we see in this situation, it is very likely that an IO failover would have got completed in 85-90 or 100 seconds and the whole RHEV host reboot could have been avoided. So, using the rigid sanlock timeout does not help avert above fence event.

We are trying to apply all the SCSI error handler and multipath tuning options, but it has been observed in the current case that a rigid timeout of 80 seconds is not always sufficient to complete IO failover. The IO failover could have been completed if given a 5-10 or 15 seconds more, but the customer suffered a complete host reboot, only do to the fixed sanlock timeout.

Even with Oracle and Red Hat cluster there is an option to fine tune the Qdisk and Oracle ASM, voting disk timeouts to accommodate the time required for IO failover. So, we really need some option to tune the sanlock timeout in RHEV.

--------

The Bug is reported with NetApp metrocluster, in this case however the customer is using IBM 2175 streched storage. So let me know if we can continue in this Bug or if we need to open a new bug.

Comment 19 Klaas Demter 2020-06-08 14:50:53 UTC

(In reply to Siddhant Rao from comment #18)
> Hi,
> 
> We have another customer who is facing problems due to timeout value of
> Sanlock being set to 80.
> 
[...]
> 
> The Bug is reported with NetApp metrocluster, in this case however the
> customer is using IBM 2175 streched storage. So let me know if we can
> continue in this Bug or if we need to open a new bug.

I think this depends on the direction this bug is going. From my point of view we have two options:
1) Generalize this bz into a sanlock issue (not sure if that is actually the direction the ovirt/rhv people want to go)
2) Create a new bz "RHV support for IBM streched storage", create a generalized bz like 1) and then set the netapp/ibm storage bugs "depends on" the new one.

I am without preference, at the end of the day I just want my metro cluster to be fully supported :) But you see the timeline of this issue (more than one year without any progress) so I wouldn't get the customers hopes up.

Comment 20 Germano Veit Michel 2020-06-09 01:03:13 UTC

Recently had another case on which the failover is sometimes just quick enough to avoid the sanlock timeouts. It uncovered more problems.

Looks like even if RHV allows tuning sanlock to allow more time it won't work as NFS locks expire (qemu OFD locks on the file) and qemu
will get EIO when doing IO after the failover. There is a recover_lost_locks option but seems to be unsafe.
More info about it here: https://access.redhat.com/solutions/1179643

Comment 29 Razvan Corneliu C.R. VILT 2020-08-26 16:07:20 UTC

We're facing similar issues all our 15 oVirt clusters with a multitude of storage engines. While multipathd recovers, the oVirt hosts go crazy for any operation:

* online firmware upgrades for Brocade switches (once every about 20 switches upgraded, we take an entire cluster down), even if we have two independent and redundant fabrics. The path fails are usually shorter than 30seconds, but if one is 31 seconds, everything is haywire.
* graceful takedown of a storage node (with NPIV, and ALUA) for any legitimate reason such as firmware updates, hardware or software maintenance. It applies to all IBM SVC style storages (v5030, v5100, v7200, v9000). We've also had incidents with EMC VMAX250f, multiple 3Par 8000 series and PureStorage FlashArray X50G2/G3. I'm not sure about our NetApp, but I assume that it behaves identically if stressed.

We believe that the I/O timeouts for Sanlock should be a multiple of the underlying block device I/O timeout (most vendors go with 30s, but IBM requires 120s in their documentation). Since the DMMP block devices inherit the timeouts of the underlying SCSI block devices (configured via UDev scripts straight from the vendor documentation), Sanlock should first look at the value declared on the block device. We also believe that we should be able to disable the qemu suspend on timeout functionality. The VM OS has internal methods of dealing with timeouts. VMware doesn't suspend VMs on very long I/Os and leaves the OS to treat the error. I've seen journal aborts and R/O remounts on I/O that took longer than 5 minutes due to a Denial of Service situation.

Fencing during a stress period for the FC Fabric leads to denial of service. It's the worst possible idea. You can get duplicate VMs even with oVirt 4.4.

There are also some issues with DMMP which sometimes fails to reinstate paths, DMMP doesn't actually use the RSCN and PLOGI information from the FC Fabric and depends almost exclusively on ALUA, TUR and timeouts. This means that it's very slow to react to changes in the fabric.

In our experience, FC Multipathing in oVirt works as long as you don't need it for redundancy and you only use the round-robin I/O distribution. If you have paths failing (even gracefully with ALUA notifications), it's undependable.

Comment 30 Marina Kalinin 2020-11-19 21:37:32 UTC

I am closing this bug as duplicate of this bz#1845909.
To make sure we concentrate all the conversations in one bug and avoid confusion. It has been mentioned multiple times that these two bugs are about the same request.

*** This bug has been marked as a duplicate of bug 1845909 ***

Comment 31 Marina Kalinin 2020-11-19 21:39:15 UTC

(In reply to Razvan Corneliu C.R. VILT from comment #29)
> We're facing similar issues all our 15 oVirt clusters with a multitude of
> storage engines. While multipathd recovers, the oVirt hosts go crazy for any
> operation:
> 
> * online firmware upgrades for Brocade switches (once every about 20
> switches upgraded, we take an entire cluster down), even if we have two
> independent and redundant fabrics. The path fails are usually shorter than
> 30seconds, but if one is 31 seconds, everything is haywire.
> * graceful takedown of a storage node (with NPIV, and ALUA) for any
> legitimate reason such as firmware updates, hardware or software
> maintenance. It applies to all IBM SVC style storages (v5030, v5100, v7200,
> v9000). We've also had incidents with EMC VMAX250f, multiple 3Par 8000
> series and PureStorage FlashArray X50G2/G3. I'm not sure about our NetApp,
> but I assume that it behaves identically if stressed.
> 
> We believe that the I/O timeouts for Sanlock should be a multiple of the
> underlying block device I/O timeout (most vendors go with 30s, but IBM
> requires 120s in their documentation). Since the DMMP block devices inherit
> the timeouts of the underlying SCSI block devices (configured via UDev
> scripts straight from the vendor documentation), Sanlock should first look
> at the value declared on the block device. We also believe that we should be
> able to disable the qemu suspend on timeout functionality. The VM OS has
> internal methods of dealing with timeouts. VMware doesn't suspend VMs on
> very long I/Os and leaves the OS to treat the error. I've seen journal
> aborts and R/O remounts on I/O that took longer than 5 minutes due to a
> Denial of Service situation.
> 
> Fencing during a stress period for the FC Fabric leads to denial of service.
> It's the worst possible idea. You can get duplicate VMs even with oVirt 4.4.
> 
> There are also some issues with DMMP which sometimes fails to reinstate
> paths, DMMP doesn't actually use the RSCN and PLOGI information from the FC
> Fabric and depends almost exclusively on ALUA, TUR and timeouts. This means
> that it's very slow to react to changes in the fabric.
> 
> In our experience, FC Multipathing in oVirt works as long as you don't need
> it for redundancy and you only use the round-robin I/O distribution. If you
> have paths failing (even gracefully with ALUA notifications), it's
> undependable.

Can you please put your comment in bz#1845909? 
I can do it, but I would like to see comments showing the original poster.

Comment 32 Razvan Corneliu C.R. VILT 2020-11-25 09:42:07 UTC

Done on the original bug.

Note You need to log in before you can comment on or make changes to this bug.