Description of problem: VM ends up being started by RHEV-M HA in two hosts at a time leading to data corruption. Version-Release number of selected component (if applicable): rhevm-3.3.1-0.48.el6ev How reproducible: Frequently. Steps to Reproduce: 1. Set cluster policy to "Evenly_Distributed" and enable HA for the VMs. 2. 3. Actual results: A (false) failed migration is eventually detected by RHEV-M and VM is started on a host while already running on another host leading to data corruption in the VM. The live migration is actually successful. Expected results: VMs are started once only. Additional info:
there is a hole in the engine.log between 2014-04-18 03:24:03 - 2014-04-22 10:32:29 where the engine is restarting. probably this is why - server.log 2014-04-18 05:00:19,681 ERROR [stderr] (Timer-1) java.io.IOException: No space left on device I don't know how long this "no space" situation continued. Julio can you shed light on the disk condition between 18th-22nd? by looking at the VDSM log I'm not sure engine is sending the list command so this is also weird. Julio is there a place where this happens again and not in the scope of the engine going out of disk space?
its the same as bug 1072282 2014-04-10 05:09:43,299 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method suggesting this as a duplicate of the above mentioned. we can only assume this was the same for the lost period of 18-22nd
(In reply to Roy Golan from comment #5) > its the same as bug 1072282 > > 2014-04-10 05:09:43,299 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] > (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method > > suggesting this as a duplicate of the above mentioned. we can only assume > this was the same for the lost period of 18-22nd Are you sure? That event happened _after_ the second instance of the VM was started at 05:09:37 (six seconds earlier).
(In reply to Julio Entrena Perez from comment #6) > (In reply to Roy Golan from comment #5) > > its the same as bug 1072282 > > > > 2014-04-10 05:09:43,299 ERROR > > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] > > (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method > > > > suggesting this as a duplicate of the above mentioned. we can only assume > > this was the same for the lost period of 18-22nd > > Are you sure? That event happened _after_ the second instance of the VM was > started at 05:09:37 (six seconds earlier). egrep "went down|GetVmStatsVDS execution" engine.log-20140411 | grep went -B 1 it will give you a sort of breakdown.
(In reply to Roy Golan from comment #7) > > egrep "went down|GetVmStatsVDS execution" engine.log-20140411 | grep went > -B 1 > > it will give you a sort of breakdown. Indeed, thanks Roy: 2014-04-10 05:07:18,412 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-52) [43abb01a] Command GetVmStatsVDS execution failed. Exception: VDSNetworkException: java.net.SocketTimeoutException: connect timed out 2014-04-10 05:07:18,611 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-52) [43abb01a] Highly Available VM went down. Attempting to restart. VM Name: yystorm07, VM Id:caae70cd-978b-456e-ae70-19c6b0ab82e6
please consider this a duplicate of bug 1072282 for now I want to keep this open while it's still fresh to check the vdsm behavior in these cases.
worth noting there is an inherent race in the check on the destination host whether the same VM is already there or not…and if we are at the beginning of createVM on destination at that time, we're screwed. In such situations only the engine can serve as a synchronization element we just need to diligently keep engine bug-free:-) Anyway - I'd suggest to close this bug as a duplicate And the disk space problem needs to be carefully examined…it must not happen. If some logrotate or values are incorrect or we're just thrashing logs - we need to do something
(In reply to Michal Skrivanek from comment #11) > > Anyway - I'd suggest to close this bug as a duplicate Agreed, thanks Michal. > > And the disk space problem needs to be carefully examined…it must not > happen. If some logrotate or values are incorrect or we're just thrashing > logs - we need to do something That should be under control, thank you.
*** This bug has been marked as a duplicate of bug 1072282 ***