Bug 1052024

Summary: After a power outage two VMs marked as HA failed to start automatically, they were required to be started manually.
Product: Red Hat Enterprise Virtualization Manager Reporter: Aval <avyadav>
Component: ovirt-engineAssignee: Gilad Chaplik <gchaplik>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.2.0CC: acathrow, avyadav, dfediuck, gchaplik, iheim, lpeer, mavital, pmukhedk, Rhev-m-bugs, sherold, sputhenp, tpoitras, vgaikwad, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: av3 Doc Type: Bug Fix
Doc Text:
Previously, some virtual machines did not automatically restart after a power failure. As a result, they would have to be manually restarted. Now, the issue has been corrected and all virtual machines restart as expected.
Story Points: ---
Clone Of:
: 1074478 (view as bug list) Environment:
Last Closed: 2014-06-09 15:08:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1074478, 1078909, 1142926    

Comment 2 Doron Fediuck 2014-01-13 15:25:58 UTC
Hi, can you please provide the exact rhev version, and the relevant engine log files?

Comment 3 Aval 2014-01-13 23:05:07 UTC
Created attachment 849684 [details]
engine.log

Comment 4 Aval 2014-01-13 23:22:05 UTC
(In reply to Doron Fediuck from comment #2)
> Hi, can you please provide the exact rhev version, and the relevant engine
> log files?

- Version-Release number of selected component (if applicable):

rhevm-3.2.2-0.41.el6ev.noarch

- Attached engine.log file

Comment 8 Itamar Heim 2014-03-07 12:13:10 UTC
Description of problem : After a power outage two of the VMs( out of 8 ) marked as HA failed to start automatically. Customer was required to start them manually after waiting for few hours expecting RHEV-M to handle this automatically.  

Environment details : 
- 2 Hypervisors with 24GB RAM each
- 10 VMs 

- Example of HA VM "mastro03srv" start failure ( already two VMs were started successfully before this VM ) . Later customer started it manually without any error.

~~~
2013-12-29 07:10:25,456 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (QuartzScheduler_Worker-13) [33217069] Failed to start Highly Available VM. Attempting to restart. VM Name: mastro03srv, VM Id:c52b7bdb-9c3a-4d76-9f06-42cbb7687a17
2013-12-29 07:10:25,468 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] Lock Acquired to object EngineLock [exclusiveLocks= key: c52b7bdb-9c3a-4d76-9f06-42cbb7687a17 value: VM
, sharedLocks= ]
2013-12-29 07:10:25,476 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (QuartzScheduler_Worker-13) [33217069] START, IsVmDuringInitiatingVDSCommand( vmId = c52b7bdb-9c3a-4d76-9f06-42cbb7687a17), log id: 583aa7de
2013-12-29 07:10:25,477 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (QuartzScheduler_Worker-13) [33217069] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 583aa7de
2013-12-29 07:10:25,490 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] Running command: RunVmCommand internal: true. Entities affected :  ID: c52b7bdb-9c3a-4d76-9f06-42cbb7687a17 Type: VM
2013-12-29 07:10:25,514 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] Lock freed to object EngineLock [exclusiveLocks= key: c52b7bdb-9c3a-4d76-9f06-42cbb7687a17 value: VM
, sharedLocks= ]
2013-12-29 07:10:25,514 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] Failed to run desktop mastro03srv, rerun 
2013-12-29 07:10:25,519 INFO  
[org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-13) [33217069] START, UpdateVdsDynamicDataVDSCommand(HostName = rhev-hv01.xxxxxx.com, HostId = 0a4f8d16-ed7e-4d54-8199-ccb3f5e31baf, vdsDynamic=org.ovirt.engine.core.common.businessentities.VdsDynamic@5bdabd1b), log id: 4e7d3ce3
2013-12-29 07:10:25,521 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-13) [33217069] FINISH, UpdateVdsDynamicDataVDSCommand, log id: 4e7d3ce3

[...]

2013-12-29 07:10:25,549 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] Lock Acquired to object EngineLock [exclusiveLocks= key: c52b7bdb-9c3a-4d76-9f06-42cbb7687a17 value: VM
, sharedLocks= ]
2013-12-29 07:10:25,566 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (QuartzScheduler_Worker-13) [33217069] START, IsVmDuringInitiatingVDSCommand( vmId = c52b7bdb-9c3a-4d76-9f06-42cbb7687a17), log id: 474630fa
2013-12-29 07:10:25,566 INFO  [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (QuartzScheduler_Worker-13) [33217069] FINISH, IsVmDuringInitiatingVDSCommand, return: false, log id: 474630fa
2013-12-29 07:10:25,576 INFO  [org.ovirt.engine.core.bll.VdsSelector] (QuartzScheduler_Worker-13) [33217069]  VDS rhev-hv01.xxxxxx.com 0a4f8d16-ed7e-4d54-8199-ccb3f5e31baf have failed running this VM in the current selection cycle VDS rhev-hv02.xxxxxx.com 4e942526-ac3a-4a46-b969-4bbe139c67d5 is not in up status or belongs to the VM's cluster
2013-12-29 07:10:25,577 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_CLUSTER  

2013-12-29 07:10:25,577 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] Lock freed to object EngineLock [exclusiveLocks= key: c52b7bdb-9c3a-4d76-9f06-42cbb7687a17 value: VM
, sharedLocks= ]
~~~

After the above failure second HA VM "mx" failed with error
~~~
2013-12-29 07:10:36,839 INFO  [org.ovirt.engine.core.bll.VdsSelector] (QuartzScheduler_Worker-13) [33217069]  VDS rhev-hv01.xxxxxx.com 0a4f8d16-ed7e-4d54-8199-ccb3f5e31baf has insufficient memory to run the VM VDS rhev-hv02.xxxxxx.com 4e942526-ac3a-4a46-b969-4bbe139c67d5 is not in up status or belongs to the VM's cluster
2013-12-29 07:10:36,839 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (QuartzScheduler_Worker-13) [33217069] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY
~~~

Which customer does not agree with as total required memory for all the VMs is 15GB and they have 2 hosts with 24GB each.

Version-Release number of selected component (if applicable):
rhevm-3.2.2-0.41.el6ev.noarch

How reproducible:
No consistent way, it happened after power outage on one the customers setup

Steps to Reproduce:
1.
2.
3.

Actual results:
2 out of 8 VMs marked as HA failed to automatically start after power outage

Expected results:
All VMs marked HA should be started automatically by RHEV-M

Additional info:

Comment 10 Artyom 2014-03-17 17:07:20 UTC
Verified on av3
Add host with 16G, and run on it four HA vms, 3 with 4096MB and one with 2048MB, after it powered off host, wait 5 minutes and power on host, all vms start fine.
Also test run under 'None' cluster policy

Comment 11 errata-xmlrpc 2014-06-09 15:08:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html