Bug 1090536

Summary:	VM started twice by HA leading to data corruption
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Julio Entrena Perez <jentrena>
Component:	ovirt-engine	Assignee:	Michal Skrivanek <michal.skrivanek>
Status:	CLOSED DUPLICATE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.3.0	CC:	acathrow, fromani, iheim, jentrena, lpeer, michal.skrivanek, rgolan, Rhev-m-bugs, sputhenp, yeylon
Target Milestone:	---
Target Release:	3.4.0
Hardware:	All
OS:	Linux
Whiteboard:	virt
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-04-24 14:39:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Julio Entrena Perez 2014-04-23 14:42:27 UTC

Description of problem:
VM ends up being started by RHEV-M HA in two hosts at a time leading to data corruption.

Version-Release number of selected component (if applicable):
rhevm-3.3.1-0.48.el6ev

How reproducible:
Frequently.

Steps to Reproduce:
1. Set cluster policy to "Evenly_Distributed" and enable HA for the VMs. 
2.
3.

Actual results:
A (false) failed migration is eventually detected by RHEV-M and VM is started on a host while already running on another host leading to data corruption in the VM.
The live migration is actually successful.

Expected results:
VMs are started once only.

Additional info:

Comment 3 Roy Golan 2014-04-24 08:55:48 UTC

there is a hole in the engine.log between 2014-04-18 03:24:03 - 2014-04-22 10:32:29 where the engine is restarting.

probably this is why -

server.log
2014-04-18 05:00:19,681 ERROR [stderr] (Timer-1)   java.io.IOException: No space left on device


I don't know how long this "no space" situation continued. Julio can you shed light on the disk condition between 18th-22nd?

by looking at the VDSM log I'm not sure engine is sending the list command so this is also weird.


Julio is there a place where this happens again and not in the scope of the engine going out of disk space?

Comment 5 Roy Golan 2014-04-24 09:49:32 UTC

its the same as bug 1072282

2014-04-10 05:09:43,299 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method

suggesting this as a duplicate of the above mentioned. we can only assume this was the same for the lost period of 18-22nd

Comment 6 Julio Entrena Perez 2014-04-24 09:54:56 UTC

(In reply to Roy Golan from comment #5)
> its the same as bug 1072282
> 
> 2014-04-10 05:09:43,299 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand]
> (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method
> 
> suggesting this as a duplicate of the above mentioned. we can only assume
> this was the same for the lost period of 18-22nd

Are you sure? That event happened _after_ the second instance of the VM was started at 05:09:37 (six seconds earlier).

Comment 7 Roy Golan 2014-04-24 10:07:54 UTC

(In reply to Julio Entrena Perez from comment #6)
> (In reply to Roy Golan from comment #5)
> > its the same as bug 1072282
> > 
> > 2014-04-10 05:09:43,299 ERROR
> > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand]
> > (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method
> > 
> > suggesting this as a duplicate of the above mentioned. we can only assume
> > this was the same for the lost period of 18-22nd
> 
> Are you sure? That event happened _after_ the second instance of the VM was
> started at 05:09:37 (six seconds earlier).

egrep "went down|GetVmStatsVDS execution" engine.log-20140411  | grep went -B 1

it will give you a sort of breakdown.

Comment 8 Julio Entrena Perez 2014-04-24 10:11:01 UTC

(In reply to Roy Golan from comment #7)
> 
> egrep "went down|GetVmStatsVDS execution" engine.log-20140411  | grep went
> -B 1
> 
> it will give you a sort of breakdown.

Indeed, thanks Roy:

2014-04-10 05:07:18,412 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-52) [43abb01a] Command GetVmStatsVDS execution failed. Exception: VDSNetworkException: java.net.SocketTimeoutException: connect timed out
2014-04-10 05:07:18,611 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-52) [43abb01a] Highly Available VM went down. Attempting to restart. VM Name: yystorm07, VM Id:caae70cd-978b-456e-ae70-19c6b0ab82e6

Comment 9 Michal Skrivanek 2014-04-24 10:38:57 UTC

please consider this a duplicate of bug 1072282 for now

I want to keep this open while it's still fresh to check the vdsm behavior in these cases.

Comment 11 Michal Skrivanek 2014-04-24 14:15:57 UTC

worth noting there is an inherent race in the check on the destination host whether the same VM is already there or not…and if we are at the beginning of createVM on destination at that time, we're screwed.
In such situations only the engine can serve as a synchronization element

we just need to diligently keep engine bug-free:-)

Anyway - I'd suggest to close this bug as a duplicate

And the disk space problem needs to be carefully examined…it must not happen. If some logrotate or values are incorrect or we're just thrashing logs - we need to do something

Comment 12 Julio Entrena Perez 2014-04-24 14:29:33 UTC

(In reply to Michal Skrivanek from comment #11)
> 
> Anyway - I'd suggest to close this bug as a duplicate
Agreed, thanks Michal.
> 
> And the disk space problem needs to be carefully examined…it must not
> happen. If some logrotate or values are incorrect or we're just thrashing
> logs - we need to do something
That should be under control, thank you.

Comment 13 Michal Skrivanek 2014-04-24 14:39:27 UTC


*** This bug has been marked as a duplicate of bug 1072282 ***