Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1084611

Summary: [RFE] RHEV-M networking went down, 90% of hosts were fenced causing a massive outage
Product: Red Hat Enterprise Virtualization Manager Reporter: Robert McSwain <rmcswain>
Component: ovirt-engineAssignee: Martin Perina <mperina>
Status: CLOSED ERRATA QA Contact: Pavol Brilla <pbrilla>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.3.0CC: bazulay, howey.vernon, iheim, lpeer, mperina, nyechiel, oourfali, pdwyer, pstehlik, rbalakri, Rhev-m-bugs, sherold, slitmano, yeylon
Target Milestone: ---Keywords: FutureFeature
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: vt2.2 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-11 18:00:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1090799, 1118879, 1119922, 1120829, 1120858, 1188504, 1190653    
Bug Blocks: 1142923, 1156165    

Description Robert McSwain 2014-04-04 20:25:21 UTC
Description of problem:
The switch that connects their RHEV-M had hardware issues. The switch has since been replaced, however this behavior caused the NIC to switch between up and down on the RHEV-M and it believed it had lost all connection to the hosts, as they went into a Non-Responsive mode as did the Data Center. Due to this, the RHEV-M sent fence commands to a majority of the hosts. This ultimately caused an outage of "~90% of the virtual environment", as we understand it.

The storage is connected via fibre, so the switch shouldn't have caused issues there explicitly

Version-Release number of selected component (if applicable):
rhevm-3.2.0-11.33.el6ev.noarch

How reproducible:
Unknown how frequently

Steps to Reproduce:
1. Cause the switch the RHEV-M connects to hosts on to flap up/down
2. Make sure power management for the hosts is configured
3. Watch for the hosts to be set to Non-Responsive
4. Observe if the hosts are fenced

Comment 9 Eli Mesika 2014-04-13 09:34:42 UTC
Barak, how can we solve that, I see no way accept of adding handling for the management network uptime and persisting it to the database

Comment 10 Barak 2014-04-13 10:35:09 UTC
(In reply to Eli Mesika from comment #9)
> Barak, how can we solve that, I see no way accept of adding handling for the
> management network uptime and persisting it to the database

correct

The plan is:
- have external daemon that always check the specific network status (the one used to communicate to hypervisor), this will be done also by the same daemon to be introduced by the fence_kdump feature).
- That daemon will update the DB 
- Every time we enter a fencing flow a preliminary check will be performed for that network to be up for the last X seconds (x configurable).

Comment 17 Scott Herold 2014-08-19 17:19:41 UTC
We have 4 related BZs that have been created to alleviate this particular issue starting in RHEV 3.5.

BZ 1119922 - This will determine whether a host targeted to be fenced is maintaining its connectivity to its storage domains, indicating that VMs are still running, and the fence request should be disrupted.

BZ 1120829 - This will integrate some logic to determine that if a certain % of hosts appear to be in a non-responsive state that fencing should be discontinued due to a risk of potential fencing storms.

BZ 1120858 - This will provide an option to globally enable/disable fencing for a cluster.  This will be useful for periods of known or scheduled downtime such as network switch maintenance.

BZ 1118879 - This is a configuration screen for a cluster that enables a user to enable or disable the previously described policies.

Comment 19 Martin Perina 2014-11-26 19:01:54 UTC
(In reply to Scott Herold from comment #17)
> We have 4 related BZs that have been created to alleviate this particular
> issue starting in RHEV 3.5.
> 
> BZ 1119922 - This will determine whether a host targeted to be fenced is
> maintaining its connectivity to its storage domains, indicating that VMs are
> still running, and the fence request should be disrupted.
> 
> BZ 1120829 - This will integrate some logic to determine that if a certain %
> of hosts appear to be in a non-responsive state that fencing should be
> discontinued due to a risk of potential fencing storms.
> 
> BZ 1120858 - This will provide an option to globally enable/disable fencing
> for a cluster.  This will be useful for periods of known or scheduled
> downtime such as network switch maintenance.
> 
> BZ 1118879 - This is a configuration screen for a cluster that enables a
> user to enable or disable the previously described policies.

There's no need to add additional description into documentation, because everything is described in above mention bugs.

Comment 20 Pavol Brilla 2014-11-27 14:11:36 UTC
tested on 3.5 vt11, hosts behaved according to setup of cluster:
Cluster -> Edit -> Fencing Policy -> Skip fencing on cluster connectivity issues

Comment 22 errata-xmlrpc 2015-02-11 18:00:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html