Bug 1072282

Summary: VM split brain caused by network outage
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: ovirt-engineAssignee: Roy Golan <rgolan>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.3.0CC: acathrow, alukiano, iheim, jentrena, lpeer, mavital, michal.skrivanek, mkalinin, ofrenkel, pbandark, rgolan, Rhev-m-bugs, rhodain, sherold, tpoitras, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.4.0   
Hardware: All   
OS: Linux   
Whiteboard: virt
Fixed In Version: av3 Doc Type: Bug Fix
Doc Text:
Previously, the GetVmStats request failed. This caused high-availability virtual machines to be incorrectly listed as "down". They were then rescheduled on another host, causing two instances of the same virtual machine to be running at once. Now, the system ignores virtual machine updates until the next monitoring cycle, allowing the system to retrieve the virtual machine stats.
Story Points: ---
Clone Of:
: 1074578 (view as bug list) Environment:
Last Closed: 2014-06-09 15:05:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1074578, 1078909, 1142926    

Description Roman Hodain 2014-03-04 09:53:50 UTC
Description of problem:
	After a network outage some VMs are started on more then one hypervisor
as they are incorrectly considered as in down state.

Version-Release number of selected component (if applicable):
	rhevm-3.3.0-0.46.el6ev.noarch

How reproducible:
	Not clear yet

Steps to Reproduce:
	Not clear yet, but the scenario could be:
		1. intall more then on hypervisor
		2. prevent RHEV-M to connecto to those VMs (also power management
		   is defunct due to netwrok outage)
		3. Let one hyoervisor to be reachable by RHEV-M

Actual results:
	Some VMs are considered as down and are started on another hypervisor

Expected results:
	Vms are marked as in unknown stated

Comment 7 Roy Golan 2014-03-06 12:32:00 UTC
Roman - can you get the logs from the host bl460-282 with data time after the one attached - i.e 2014-02-15 15:01:01 and onward

I want to see what this host reported to backend as its internal vm list
and this might explain this.

If the host reported that vm svcz0plgfa50 is not in its list currently, 
then its only natural to see this

2014-02-15 15:09:59,229 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-61) [5989b7e9] Highly Available VM went down. Attempting to restart. VM Name: svcz0plgfa50-bwd00, VM Id:1
f91022e-ef39-4120-877a-05d15432dfac

Comment 9 Michal Skrivanek 2014-03-09 12:30:12 UTC
Roy, I'm all for option 1

Comment 11 Artyom 2014-03-18 17:19:30 UTC
Verified on av3
Until host on what runs vms, not change status to up, vms stay in unknown status

Comment 18 Michal Skrivanek 2014-04-24 14:39:27 UTC
*** Bug 1090536 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2014-06-09 15:05:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html