1097923 – Network outage causes RHEV-M to fence all hypervisors in environment

Bug 1097923 - Network outage causes RHEV-M to fence all hypervisors in environment

Summary: Network outage causes RHEV-M to fence all hypervisors in environment

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.6.0
Assignee:	Eli Mesika
QA Contact:
Docs Contact:
URL:
Whiteboard:	infra
Depends On:	1119932
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-14 21:19 UTC by Jake Hunsaker
Modified:	2019-04-28 09:26 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-26 14:37:10 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jake Hunsaker 2014-05-14 21:19:56 UTC

Created attachment 895642 [details]
engine.log

Description of problem:

Customer had a failing switch that needed to be taken down, during this time RHEV-M lost all routes to hypervisors and IPMI devices (though hypervisors still had full network connectivity, with the except of communication to the manager). When the connectivity was restored, RHEV-M fenced every hypervisor in the environment.

In the attached engine.log it seems that the engine used each hypervisor to fence every other hypervisor - and continued generating these fence events even after it detected a failure to fence the hypervisors (due to the IPMI devices also being inaccessible).

Version-Release number of selected component (if applicable):

rhevm-3.3.2
vdsm-4.13.2-0.13


Additional info:

First, it was only the manager that lost network connectivity. The hypervisors and IPMI devices are on separate subnets than the manager and did not lose connectivity with each other or the rest of the infrastructure. 

Also, I am not sure if this is a race condition, or something else where the engine is continually queue'ing fence commands that get fired off once networking is restored. 

Perhaps a final check of "is the host REALLY down" before the retries for fencing the hypervisor would help this?

Comment 8 Scott Herold 2015-02-26 14:37:10 UTC

The work that has gone into 3.5 addresses this item.  This has been discussed directly with the customer and validated.  Closing BZ as CURRENTRELEASE.

Note You need to log in before you can comment on or make changes to this bug.