1084611 – [RFE] RHEV-M networking went down, 90% of hosts were fenced causing a massive outage

Bug 1084611 - [RFE] RHEV-M networking went down, 90% of hosts were fenced causing a massive outage

Summary: [RFE] RHEV-M networking went down, 90% of hosts were fenced causing a massive...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Martin Perina
QA Contact:	Pavol Brilla
Docs Contact:
URL:
Whiteboard:	infra
Depends On:	1090799 1118879 1119922 1120829 1120858 1188504 1190653
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-04-04 20:25 UTC by Robert McSwain
Modified:	2019-09-12 07:51 UTC (History)
CC List:	14 users (show)
Fixed In Version:	vt2.2
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-11 18:00:22 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0158	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC

Description Robert McSwain 2014-04-04 20:25:21 UTC

Description of problem:
The switch that connects their RHEV-M had hardware issues. The switch has since been replaced, however this behavior caused the NIC to switch between up and down on the RHEV-M and it believed it had lost all connection to the hosts, as they went into a Non-Responsive mode as did the Data Center. Due to this, the RHEV-M sent fence commands to a majority of the hosts. This ultimately caused an outage of "~90% of the virtual environment", as we understand it.

The storage is connected via fibre, so the switch shouldn't have caused issues there explicitly

Version-Release number of selected component (if applicable):
rhevm-3.2.0-11.33.el6ev.noarch

How reproducible:
Unknown how frequently

Steps to Reproduce:
1. Cause the switch the RHEV-M connects to hosts on to flap up/down
2. Make sure power management for the hosts is configured
3. Watch for the hosts to be set to Non-Responsive
4. Observe if the hosts are fenced

Comment 9 Eli Mesika 2014-04-13 09:34:42 UTC

Barak, how can we solve that, I see no way accept of adding handling for the management network uptime and persisting it to the database

Comment 10 Barak 2014-04-13 10:35:09 UTC

(In reply to Eli Mesika from comment #9)
> Barak, how can we solve that, I see no way accept of adding handling for the
> management network uptime and persisting it to the database

correct

The plan is:
- have external daemon that always check the specific network status (the one used to communicate to hypervisor), this will be done also by the same daemon to be introduced by the fence_kdump feature).
- That daemon will update the DB 
- Every time we enter a fencing flow a preliminary check will be performed for that network to be up for the last X seconds (x configurable).

Comment 17 Scott Herold 2014-08-19 17:19:41 UTC

We have 4 related BZs that have been created to alleviate this particular issue starting in RHEV 3.5.

BZ 1119922 - This will determine whether a host targeted to be fenced is maintaining its connectivity to its storage domains, indicating that VMs are still running, and the fence request should be disrupted.

BZ 1120829 - This will integrate some logic to determine that if a certain % of hosts appear to be in a non-responsive state that fencing should be discontinued due to a risk of potential fencing storms.

BZ 1120858 - This will provide an option to globally enable/disable fencing for a cluster.  This will be useful for periods of known or scheduled downtime such as network switch maintenance.

BZ 1118879 - This is a configuration screen for a cluster that enables a user to enable or disable the previously described policies.

Comment 19 Martin Perina 2014-11-26 19:01:54 UTC

(In reply to Scott Herold from comment #17)
> We have 4 related BZs that have been created to alleviate this particular
> issue starting in RHEV 3.5.
> 
> BZ 1119922 - This will determine whether a host targeted to be fenced is
> maintaining its connectivity to its storage domains, indicating that VMs are
> still running, and the fence request should be disrupted.
> 
> BZ 1120829 - This will integrate some logic to determine that if a certain %
> of hosts appear to be in a non-responsive state that fencing should be
> discontinued due to a risk of potential fencing storms.
> 
> BZ 1120858 - This will provide an option to globally enable/disable fencing
> for a cluster.  This will be useful for periods of known or scheduled
> downtime such as network switch maintenance.
> 
> BZ 1118879 - This is a configuration screen for a cluster that enables a
> user to enable or disable the previously described policies.

There's no need to add additional description into documentation, because everything is described in above mention bugs.

Comment 20 Pavol Brilla 2014-11-27 14:11:36 UTC

tested on 3.5 vt11, hosts behaved according to setup of cluster:
Cluster -> Edit -> Fencing Policy -> Skip fencing on cluster connectivity issues

Comment 22 errata-xmlrpc 2015-02-11 18:00:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.