Bug 1090536
Summary: | VM started twice by HA leading to data corruption | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Julio Entrena Perez <jentrena> |
Component: | ovirt-engine | Assignee: | Michal Skrivanek <michal.skrivanek> |
Status: | CLOSED DUPLICATE | QA Contact: | |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.3.0 | CC: | acathrow, fromani, iheim, jentrena, lpeer, michal.skrivanek, rgolan, Rhev-m-bugs, sputhenp, yeylon |
Target Milestone: | --- | ||
Target Release: | 3.4.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | virt | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-04-24 14:39:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Julio Entrena Perez
2014-04-23 14:42:27 UTC
there is a hole in the engine.log between 2014-04-18 03:24:03 - 2014-04-22 10:32:29 where the engine is restarting. probably this is why - server.log 2014-04-18 05:00:19,681 ERROR [stderr] (Timer-1) java.io.IOException: No space left on device I don't know how long this "no space" situation continued. Julio can you shed light on the disk condition between 18th-22nd? by looking at the VDSM log I'm not sure engine is sending the list command so this is also weird. Julio is there a place where this happens again and not in the scope of the engine going out of disk space? its the same as bug 1072282 2014-04-10 05:09:43,299 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method suggesting this as a duplicate of the above mentioned. we can only assume this was the same for the lost period of 18-22nd (In reply to Roy Golan from comment #5) > its the same as bug 1072282 > > 2014-04-10 05:09:43,299 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] > (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method > > suggesting this as a duplicate of the above mentioned. we can only assume > this was the same for the lost period of 18-22nd Are you sure? That event happened _after_ the second instance of the VM was started at 05:09:37 (six seconds earlier). (In reply to Julio Entrena Perez from comment #6) > (In reply to Roy Golan from comment #5) > > its the same as bug 1072282 > > > > 2014-04-10 05:09:43,299 ERROR > > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] > > (DefaultQuartzScheduler_Worker-74) [be56dd4] Failed in GetVmStatsVDS method > > > > suggesting this as a duplicate of the above mentioned. we can only assume > > this was the same for the lost period of 18-22nd > > Are you sure? That event happened _after_ the second instance of the VM was > started at 05:09:37 (six seconds earlier). egrep "went down|GetVmStatsVDS execution" engine.log-20140411 | grep went -B 1 it will give you a sort of breakdown. (In reply to Roy Golan from comment #7) > > egrep "went down|GetVmStatsVDS execution" engine.log-20140411 | grep went > -B 1 > > it will give you a sort of breakdown. Indeed, thanks Roy: 2014-04-10 05:07:18,412 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-52) [43abb01a] Command GetVmStatsVDS execution failed. Exception: VDSNetworkException: java.net.SocketTimeoutException: connect timed out 2014-04-10 05:07:18,611 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-52) [43abb01a] Highly Available VM went down. Attempting to restart. VM Name: yystorm07, VM Id:caae70cd-978b-456e-ae70-19c6b0ab82e6 please consider this a duplicate of bug 1072282 for now I want to keep this open while it's still fresh to check the vdsm behavior in these cases. worth noting there is an inherent race in the check on the destination host whether the same VM is already there or not…and if we are at the beginning of createVM on destination at that time, we're screwed. In such situations only the engine can serve as a synchronization element we just need to diligently keep engine bug-free:-) Anyway - I'd suggest to close this bug as a duplicate And the disk space problem needs to be carefully examined…it must not happen. If some logrotate or values are incorrect or we're just thrashing logs - we need to do something (In reply to Michal Skrivanek from comment #11) > > Anyway - I'd suggest to close this bug as a duplicate Agreed, thanks Michal. > > And the disk space problem needs to be carefully examined…it must not > happen. If some logrotate or values are incorrect or we're just thrashing > logs - we need to do something That should be under control, thank you. *** This bug has been marked as a duplicate of bug 1072282 *** |