Bug 1097923

Summary: Network outage causes RHEV-M to fence all hypervisors in environment
Product: Red Hat Enterprise Virtualization Manager Reporter: Jake Hunsaker <jhunsaker>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.3.0CC: bazulay, iheim, jhunsaker, lpeer, nyechiel, oourfali, rbalakri, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---   
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-26 14:37:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1119932    
Bug Blocks:    

Description Jake Hunsaker 2014-05-14 21:19:56 UTC
Created attachment 895642 [details]
engine.log

Description of problem:

Customer had a failing switch that needed to be taken down, during this time RHEV-M lost all routes to hypervisors and IPMI devices (though hypervisors still had full network connectivity, with the except of communication to the manager). When the connectivity was restored, RHEV-M fenced every hypervisor in the environment.

In the attached engine.log it seems that the engine used each hypervisor to fence every other hypervisor - and continued generating these fence events even after it detected a failure to fence the hypervisors (due to the IPMI devices also being inaccessible).

Version-Release number of selected component (if applicable):

rhevm-3.3.2
vdsm-4.13.2-0.13


Additional info:

First, it was only the manager that lost network connectivity. The hypervisors and IPMI devices are on separate subnets than the manager and did not lose connectivity with each other or the rest of the infrastructure. 

Also, I am not sure if this is a race condition, or something else where the engine is continually queue'ing fence commands that get fired off once networking is restored. 

Perhaps a final check of "is the host REALLY down" before the retries for fencing the hypervisor would help this?

Comment 8 Scott Herold 2015-02-26 14:37:10 UTC
The work that has gone into 3.5 addresses this item.  This has been discussed directly with the customer and validated.  Closing BZ as CURRENTRELEASE.