Bug 1097923 - Network outage causes RHEV-M to fence all hypervisors in environment
Summary: Network outage causes RHEV-M to fence all hypervisors in environment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.6.0
Assignee: Eli Mesika
QA Contact:
URL:
Whiteboard: infra
Depends On: 1119932
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-14 21:19 UTC by Jake Hunsaker
Modified: 2019-04-28 09:26 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-26 14:37:10 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jake Hunsaker 2014-05-14 21:19:56 UTC
Created attachment 895642 [details]
engine.log

Description of problem:

Customer had a failing switch that needed to be taken down, during this time RHEV-M lost all routes to hypervisors and IPMI devices (though hypervisors still had full network connectivity, with the except of communication to the manager). When the connectivity was restored, RHEV-M fenced every hypervisor in the environment.

In the attached engine.log it seems that the engine used each hypervisor to fence every other hypervisor - and continued generating these fence events even after it detected a failure to fence the hypervisors (due to the IPMI devices also being inaccessible).

Version-Release number of selected component (if applicable):

rhevm-3.3.2
vdsm-4.13.2-0.13


Additional info:

First, it was only the manager that lost network connectivity. The hypervisors and IPMI devices are on separate subnets than the manager and did not lose connectivity with each other or the rest of the infrastructure. 

Also, I am not sure if this is a race condition, or something else where the engine is continually queue'ing fence commands that get fired off once networking is restored. 

Perhaps a final check of "is the host REALLY down" before the retries for fencing the hypervisor would help this?

Comment 8 Scott Herold 2015-02-26 14:37:10 UTC
The work that has gone into 3.5 addresses this item.  This has been discussed directly with the customer and validated.  Closing BZ as CURRENTRELEASE.


Note You need to log in before you can comment on or make changes to this bug.