Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1149135

Summary: Prestarted VMs dissapear from UI after failure to restore snapshot once VM turns from Unknown status to Down
Product: Red Hat Enterprise Virtualization Manager Reporter: Tomas Dosek <tdosek>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED ERRATA QA Contact: Ilanit Stein <istein>
Severity: medium Docs Contact:
Priority: high    
Version: 3.4.0CC: adahms, ahoness, bkorren, ecohen, eedri, iheim, lpeer, lsurette, mavital, ofrenkel, pablo.iranzo, rbalakri, Rhev-m-bugs, scohen, tdosek, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: org.ovirt.engine-root-3.5.0-19 Doc Type: Bug Fix
Doc Text:
Previously, updating a virtual machine from a pool that was set to use the latest version of the template on which the pool is based would sometimes fail. This resulted in virtual machines that could not be updated to the latest version of the template being removed from the pool. Now, the version of the template for virtual machines in pools has been corrected so that virtual machines are no longer removed from pools under these circumstances.
Story Points: ---
Clone Of:
: 1155557 (view as bug list) Environment:
Last Closed: 2015-02-11 18:09:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1133612, 1155557    

Description Tomas Dosek 2014-10-03 10:14:56 UTC
Description of problem:
Prestarted VMs dissapear from UI after failure to restore snapshot once VM turns from Unknown status to Down

These VMs get deleted by Noninteractive internal user

Version-Release number of selected component (if applicable):
3.4.2

How reproducible:
Specific scenario

Steps to Reproduce:
1. Have a pool of prestarted VMs with latest template versioning policy
2. Cause the m to go to Unknown status (i.e. cut of storage they run on)
3. Reboot hypervisor they run on
4) Start the storage again (you might need to update vm status from 7 to 0)

Actual results:
Vm gets deleted after next attempt to run it.

Expected results:
VM should restore snapshot and run

Additional info:
Full logcollector, etc coming in next comment - engine events:
engine.log:82010:2014-10-02 11:47:10,995 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (org.ovirt.thread.pool-4-thread-27) [56c2bdf2] START, SetVmStatusVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a, status = Unknown, exit status = Normal), log id: 67d3e891
engine.log:84702:2014-10-02 11:52:36,096 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-93) [2bdc9616] VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) is running in db and not running in VDS RHVH-8_ucs3b3
engine.log:100314:2014-10-02 12:22:14,675 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (DefaultQuartzScheduler_Worker-70) START, IsVmDuringInitiatingVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a), log id: 43c6b8d1
engine.log:100349:2014-10-02 12:22:17,662 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] Lock Acquired to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100358:2014-10-02 12:22:18,288 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] START, IsVmDuringInitiatingVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a), log id: 47d61ab2
engine.log:100393:2014-10-02 12:22:20,569 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] Running command: RunVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100405:2014-10-02 12:22:21,073 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) already contains stateless snapshot, removing it
engine.log:100412:2014-10-02 12:22:21,201 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (DefaultQuartzScheduler_Worker-70) [1d40ac73] Deleting snapshot for stateless vm 8c8ecd90-f5b1-4950-a773-91cf3016c28a
engine.log:100414:2014-10-02 12:22:21,365 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (DefaultQuartzScheduler_Worker-70) [33cb1992] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100415:2014-10-02 12:22:21,467 INFO [org.ovirt.engine.core.bll.UpdateVmVersionCommand] (DefaultQuartzScheduler_Worker-70) [15482126] Running command: UpdateVmVersionCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100420:2014-10-02 12:22:21,872 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-4-thread-51) [48c849b2] Deleting snapshot for stateless vm 8c8ecd90-f5b1-4950-a773-91cf3016c28a
engine.log:100423:2014-10-02 12:22:22,103 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-4-thread-51) [9b33044] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100425:2014-10-02 12:22:22,320 INFO [org.ovirt.engine.core.bll.UpdateVmVersionCommand] (org.ovirt.thread.pool-4-thread-51) [72006e4d] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100429:2014-10-02 12:22:22,321 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-4-thread-51) [72006e4d] Couldnt update VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) version from its template, continue with restoring stateless snapshot.
engine.log:100430:2014-10-02 12:22:22,544 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (org.ovirt.thread.pool-4-thread-51) [4f7f9ee3] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100435:2014-10-02 12:22:22,590 INFO [org.ovirt.engine.core.bll.RemoveVmFromPoolCommand] (DefaultQuartzScheduler_Worker-70) [77b4cc4] Lock freed to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100455:2014-10-02 12:22:23,759 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Running command: RemoveVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100456:2014-10-02 12:22:23,760 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] START, SetVmStatusVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a, status = ImageLocked, exit status = Normal), log id: b5b180e
engine.log:100462:2014-10-02 12:22:23,992 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Trying to release exclusive lock which does not exist, lock key: 8c8ecd90-f5b1-4950-a773-91cf3016c28aVM
engine.log:100463:2014-10-02 12:22:23,992 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Lock freed to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100469:2014-10-02 12:22:24,472 INFO [org.ovirt.engine.core.bll.RemoveAllVmImagesCommand] (DefaultQuartzScheduler_Worker-70) [51d6ba4] Running command: RemoveAllVmImagesCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:101215:2014-10-02 12:23:33,611 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) [71c47f8

Comment 2 Tomas Dosek 2014-10-08 07:02:52 UTC
There was a simple test made. This looks like an automatic behavior of some kind. If the VMs are marked as protected against deletion, they end up being detached from pools.

Comment 4 Arik 2014-10-12 13:11:50 UTC
first, I'll describe the flow:
1. host that contained prestarted VM was rebooted

2. when the host was up again, the monitoring detected that all the VMs that ran on it don't exist

3. the mechanism for 'cleaning up' VMs that went down can be invoked from 2 places:

a. from the monitoring: the monitoring detects that VM crashed and invoke ProcessVmOnDownCommand
b. from run command: when trying to run VM we invoke ProcessVmOnDownCommand if the VM already has stateless snapshot

in this case ProcessVmOnDownCommand was called by the second flow (b)

4. there is a bug that when ProcessVmOnDownCommand is called from within RunVmcommand, the version is updated even when it is not supposed to be updated (as it was in this case, there was no sub version for the template which the pool is based on). as part of the version update (UpdateVmVersionCommand), the VM is removed and added again - but when ProcessVmOnDownCommand is called from RunVmCommand the VM is added when it is based on the BLANK template instead of the correct template (another bug).

5. the removal of the VM succeed but the add command (AddVmAndAttachToPoolCommand) fails because of invalid timezone in the BLANK template:
2014-10-02 12:24:39,176 WARN  [org.ovirt.engine.core.bll.AddVmAndAttachToPoolCommand] (org.ovirt.thread.pool-4-thread-55) [2d0d06c5] CanDoAction of action AddVmAndAttachToPool failed. Reasons:VAR__ACTION__ADD,VAR__TYPE__VM,ACTION_TYPE_FAILED_INVALID_TIMEZONE

---------------------------------------------------------------------------

* the problem with the invalid timezone which prevents adding the updated VM was already solved by bz 1087917, so in 3.5 the VM won't disappear

* part of the problem is that ProcessVmOnDownCommand was called by RunVmCommand instead of the monitoring. that happens because of the frequency that the prestarted-monitoring is invoked (VmPoolMonitor). the default setting is 5 minutes but I see in the logs that sometimes it is invoked 3-4 times in a minute for several pools.
Tomas, I don't understand - VmPoolMonitorIntervalInMinutes is in minutes so how is the frequency so high? if it is set to something like 0.3 than it is like setting it to 0 which means it is invoked in a loop - and that's bad. maybe we should set it to mili-seconds if that's the case.

I'll make a fix for the case where ProcessVmOnDown is called from RunVmCommand and that should solve the problem.

Comment 6 Tomas Dosek 2014-10-13 07:18:13 UTC
With regards to the first issue, we have a patch so I could build a testpackage, but I prefer waiting for the second one to be sure we cover both scenarios.

Comment 9 Ilanit Stein 2014-11-13 14:17:37 UTC
Tested on vt9:

Steps to reproduce:
1. Have a pool of 10 pre-started VMs
2. Reboot hypervisor they run on
These steps were run 3 times.

Results:
 Following the host reboot, the VMs in the pool go to Unknown status.
After the host became up,
all vms became up.

Comment 11 errata-xmlrpc 2015-02-11 18:09:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html