Description of problem: The switch that connects their RHEV-M had hardware issues. The switch has since been replaced, however this behavior caused the NIC to switch between up and down on the RHEV-M and it believed it had lost all connection to the hosts, as they went into a Non-Responsive mode as did the Data Center. Due to this, the RHEV-M sent fence commands to a majority of the hosts. This ultimately caused an outage of "~90% of the virtual environment", as we understand it. The storage is connected via fibre, so the switch shouldn't have caused issues there explicitly Version-Release number of selected component (if applicable): rhevm-3.2.0-11.33.el6ev.noarch How reproducible: Unknown how frequently Steps to Reproduce: 1. Cause the switch the RHEV-M connects to hosts on to flap up/down 2. Make sure power management for the hosts is configured 3. Watch for the hosts to be set to Non-Responsive 4. Observe if the hosts are fenced
Barak, how can we solve that, I see no way accept of adding handling for the management network uptime and persisting it to the database
(In reply to Eli Mesika from comment #9) > Barak, how can we solve that, I see no way accept of adding handling for the > management network uptime and persisting it to the database correct The plan is: - have external daemon that always check the specific network status (the one used to communicate to hypervisor), this will be done also by the same daemon to be introduced by the fence_kdump feature). - That daemon will update the DB - Every time we enter a fencing flow a preliminary check will be performed for that network to be up for the last X seconds (x configurable).
We have 4 related BZs that have been created to alleviate this particular issue starting in RHEV 3.5. BZ 1119922 - This will determine whether a host targeted to be fenced is maintaining its connectivity to its storage domains, indicating that VMs are still running, and the fence request should be disrupted. BZ 1120829 - This will integrate some logic to determine that if a certain % of hosts appear to be in a non-responsive state that fencing should be discontinued due to a risk of potential fencing storms. BZ 1120858 - This will provide an option to globally enable/disable fencing for a cluster. This will be useful for periods of known or scheduled downtime such as network switch maintenance. BZ 1118879 - This is a configuration screen for a cluster that enables a user to enable or disable the previously described policies.
(In reply to Scott Herold from comment #17) > We have 4 related BZs that have been created to alleviate this particular > issue starting in RHEV 3.5. > > BZ 1119922 - This will determine whether a host targeted to be fenced is > maintaining its connectivity to its storage domains, indicating that VMs are > still running, and the fence request should be disrupted. > > BZ 1120829 - This will integrate some logic to determine that if a certain % > of hosts appear to be in a non-responsive state that fencing should be > discontinued due to a risk of potential fencing storms. > > BZ 1120858 - This will provide an option to globally enable/disable fencing > for a cluster. This will be useful for periods of known or scheduled > downtime such as network switch maintenance. > > BZ 1118879 - This is a configuration screen for a cluster that enables a > user to enable or disable the previously described policies. There's no need to add additional description into documentation, because everything is described in above mention bugs.
tested on 3.5 vt11, hosts behaved according to setup of cluster: Cluster -> Edit -> Fencing Policy -> Skip fencing on cluster connectivity issues
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html