Bug 1097923
Summary: | Network outage causes RHEV-M to fence all hypervisors in environment | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jake Hunsaker <jhunsaker> |
Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.3.0 | CC: | bazulay, iheim, jhunsaker, lpeer, nyechiel, oourfali, rbalakri, Rhev-m-bugs, sherold, yeylon |
Target Milestone: | --- | ||
Target Release: | 3.6.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | infra | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-02-26 14:37:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1119932 | ||
Bug Blocks: |
The work that has gone into 3.5 addresses this item. This has been discussed directly with the customer and validated. Closing BZ as CURRENTRELEASE. |
Created attachment 895642 [details] engine.log Description of problem: Customer had a failing switch that needed to be taken down, during this time RHEV-M lost all routes to hypervisors and IPMI devices (though hypervisors still had full network connectivity, with the except of communication to the manager). When the connectivity was restored, RHEV-M fenced every hypervisor in the environment. In the attached engine.log it seems that the engine used each hypervisor to fence every other hypervisor - and continued generating these fence events even after it detected a failure to fence the hypervisors (due to the IPMI devices also being inaccessible). Version-Release number of selected component (if applicable): rhevm-3.3.2 vdsm-4.13.2-0.13 Additional info: First, it was only the manager that lost network connectivity. The hypervisors and IPMI devices are on separate subnets than the manager and did not lose connectivity with each other or the rest of the infrastructure. Also, I am not sure if this is a race condition, or something else where the engine is continually queue'ing fence commands that get fired off once networking is restored. Perhaps a final check of "is the host REALLY down" before the retries for fencing the hypervisor would help this?