Bug 1084611
| Summary: | [RFE] RHEV-M networking went down, 90% of hosts were fenced causing a massive outage | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Robert McSwain <rmcswain> |
| Component: | ovirt-engine | Assignee: | Martin Perina <mperina> |
| Status: | CLOSED ERRATA | QA Contact: | Pavol Brilla <pbrilla> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 3.3.0 | CC: | bazulay, howey.vernon, iheim, lpeer, mperina, nyechiel, oourfali, pdwyer, pstehlik, rbalakri, Rhev-m-bugs, sherold, slitmano, yeylon |
| Target Milestone: | --- | Keywords: | FutureFeature |
| Target Release: | 3.5.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | infra | ||
| Fixed In Version: | vt2.2 | Doc Type: | Enhancement |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-02-11 18:00:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1090799, 1118879, 1119922, 1120829, 1120858, 1188504, 1190653 | ||
| Bug Blocks: | 1142923, 1156165 | ||
|
Description
Robert McSwain
2014-04-04 20:25:21 UTC
Barak, how can we solve that, I see no way accept of adding handling for the management network uptime and persisting it to the database (In reply to Eli Mesika from comment #9) > Barak, how can we solve that, I see no way accept of adding handling for the > management network uptime and persisting it to the database correct The plan is: - have external daemon that always check the specific network status (the one used to communicate to hypervisor), this will be done also by the same daemon to be introduced by the fence_kdump feature). - That daemon will update the DB - Every time we enter a fencing flow a preliminary check will be performed for that network to be up for the last X seconds (x configurable). We have 4 related BZs that have been created to alleviate this particular issue starting in RHEV 3.5. BZ 1119922 - This will determine whether a host targeted to be fenced is maintaining its connectivity to its storage domains, indicating that VMs are still running, and the fence request should be disrupted. BZ 1120829 - This will integrate some logic to determine that if a certain % of hosts appear to be in a non-responsive state that fencing should be discontinued due to a risk of potential fencing storms. BZ 1120858 - This will provide an option to globally enable/disable fencing for a cluster. This will be useful for periods of known or scheduled downtime such as network switch maintenance. BZ 1118879 - This is a configuration screen for a cluster that enables a user to enable or disable the previously described policies. (In reply to Scott Herold from comment #17) > We have 4 related BZs that have been created to alleviate this particular > issue starting in RHEV 3.5. > > BZ 1119922 - This will determine whether a host targeted to be fenced is > maintaining its connectivity to its storage domains, indicating that VMs are > still running, and the fence request should be disrupted. > > BZ 1120829 - This will integrate some logic to determine that if a certain % > of hosts appear to be in a non-responsive state that fencing should be > discontinued due to a risk of potential fencing storms. > > BZ 1120858 - This will provide an option to globally enable/disable fencing > for a cluster. This will be useful for periods of known or scheduled > downtime such as network switch maintenance. > > BZ 1118879 - This is a configuration screen for a cluster that enables a > user to enable or disable the previously described policies. There's no need to add additional description into documentation, because everything is described in above mention bugs. tested on 3.5 vt11, hosts behaved according to setup of cluster: Cluster -> Edit -> Fencing Policy -> Skip fencing on cluster connectivity issues Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html |