1149135 – Prestarted VMs dissapear from UI after failure to restore snapshot once VM turns from Unknown status to Down

Bug 1149135 - Prestarted VMs dissapear from UI after failure to restore snapshot once VM turns from Unknown status to Down

Summary: Prestarted VMs dissapear from UI after failure to restore snapshot once VM tu...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Arik
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1133612 1155557
TreeView+	depends on / blocked

Reported:	2014-10-03 10:14 UTC by Tomas Dosek
Modified:	2019-04-28 09:55 UTC (History)
CC List:	16 users (show)
Fixed In Version:	org.ovirt.engine-root-3.5.0-19
Doc Type:	Bug Fix
Doc Text:	Previously, updating a virtual machine from a pool that was set to use the latest version of the template on which the pool is based would sometimes fail. This resulted in virtual machines that could not be updated to the latest version of the template being removed from the pool. Now, the version of the template for virtual machines in pools has been corrected so that virtual machines are no longer removed from pools under these circumstances.
Clone Of:
Clones:	1155557 (view as bug list)
Environment:
Last Closed:	2015-02-11 18:09:27 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0158	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC
oVirt gerrit	34053	master	MERGED	core: use the right template on template version update	Never
oVirt gerrit	34150	master	MERGED	core: use previous disk operator on template version update	Never
oVirt gerrit	34303	ovirt-engine-3.5	MERGED	core: use the right template on template version update	Never
oVirt gerrit	34304	ovirt-engine-3.5	MERGED	core: use previous disk operator on template version update	Never

Description Tomas Dosek 2014-10-03 10:14:56 UTC

Description of problem:
Prestarted VMs dissapear from UI after failure to restore snapshot once VM turns from Unknown status to Down

These VMs get deleted by Noninteractive internal user

Version-Release number of selected component (if applicable):
3.4.2

How reproducible:
Specific scenario

Steps to Reproduce:
1. Have a pool of prestarted VMs with latest template versioning policy
2. Cause the m to go to Unknown status (i.e. cut of storage they run on)
3. Reboot hypervisor they run on
4) Start the storage again (you might need to update vm status from 7 to 0)

Actual results:
Vm gets deleted after next attempt to run it.

Expected results:
VM should restore snapshot and run

Additional info:
Full logcollector, etc coming in next comment - engine events:
engine.log:82010:2014-10-02 11:47:10,995 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (org.ovirt.thread.pool-4-thread-27) [56c2bdf2] START, SetVmStatusVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a, status = Unknown, exit status = Normal), log id: 67d3e891
engine.log:84702:2014-10-02 11:52:36,096 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-93) [2bdc9616] VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) is running in db and not running in VDS RHVH-8_ucs3b3
engine.log:100314:2014-10-02 12:22:14,675 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (DefaultQuartzScheduler_Worker-70) START, IsVmDuringInitiatingVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a), log id: 43c6b8d1
engine.log:100349:2014-10-02 12:22:17,662 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] Lock Acquired to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100358:2014-10-02 12:22:18,288 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] START, IsVmDuringInitiatingVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a), log id: 47d61ab2
engine.log:100393:2014-10-02 12:22:20,569 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] Running command: RunVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100405:2014-10-02 12:22:21,073 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) already contains stateless snapshot, removing it
engine.log:100412:2014-10-02 12:22:21,201 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (DefaultQuartzScheduler_Worker-70) [1d40ac73] Deleting snapshot for stateless vm 8c8ecd90-f5b1-4950-a773-91cf3016c28a
engine.log:100414:2014-10-02 12:22:21,365 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (DefaultQuartzScheduler_Worker-70) [33cb1992] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100415:2014-10-02 12:22:21,467 INFO [org.ovirt.engine.core.bll.UpdateVmVersionCommand] (DefaultQuartzScheduler_Worker-70) [15482126] Running command: UpdateVmVersionCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100420:2014-10-02 12:22:21,872 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-4-thread-51) [48c849b2] Deleting snapshot for stateless vm 8c8ecd90-f5b1-4950-a773-91cf3016c28a
engine.log:100423:2014-10-02 12:22:22,103 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-4-thread-51) [9b33044] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100425:2014-10-02 12:22:22,320 INFO [org.ovirt.engine.core.bll.UpdateVmVersionCommand] (org.ovirt.thread.pool-4-thread-51) [72006e4d] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100429:2014-10-02 12:22:22,321 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-4-thread-51) [72006e4d] Couldnt update VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) version from its template, continue with restoring stateless snapshot.
engine.log:100430:2014-10-02 12:22:22,544 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (org.ovirt.thread.pool-4-thread-51) [4f7f9ee3] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100435:2014-10-02 12:22:22,590 INFO [org.ovirt.engine.core.bll.RemoveVmFromPoolCommand] (DefaultQuartzScheduler_Worker-70) [77b4cc4] Lock freed to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100455:2014-10-02 12:22:23,759 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Running command: RemoveVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:100456:2014-10-02 12:22:23,760 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] START, SetVmStatusVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a, status = ImageLocked, exit status = Normal), log id: b5b180e
engine.log:100462:2014-10-02 12:22:23,992 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Trying to release exclusive lock which does not exist, lock key: 8c8ecd90-f5b1-4950-a773-91cf3016c28aVM
engine.log:100463:2014-10-02 12:22:23,992 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Lock freed to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM
engine.log:100469:2014-10-02 12:22:24,472 INFO [org.ovirt.engine.core.bll.RemoveAllVmImagesCommand] (DefaultQuartzScheduler_Worker-70) [51d6ba4] Running command: RemoveAllVmImagesCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM
engine.log:101215:2014-10-02 12:23:33,611 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) [71c47f8

Comment 2 Tomas Dosek 2014-10-08 07:02:52 UTC

There was a simple test made. This looks like an automatic behavior of some kind. If the VMs are marked as protected against deletion, they end up being detached from pools.

Comment 4 Arik 2014-10-12 13:11:50 UTC

first, I'll describe the flow:
1. host that contained prestarted VM was rebooted

2. when the host was up again, the monitoring detected that all the VMs that ran on it don't exist

3. the mechanism for 'cleaning up' VMs that went down can be invoked from 2 places:

a. from the monitoring: the monitoring detects that VM crashed and invoke ProcessVmOnDownCommand
b. from run command: when trying to run VM we invoke ProcessVmOnDownCommand if the VM already has stateless snapshot

in this case ProcessVmOnDownCommand was called by the second flow (b)

4. there is a bug that when ProcessVmOnDownCommand is called from within RunVmcommand, the version is updated even when it is not supposed to be updated (as it was in this case, there was no sub version for the template which the pool is based on). as part of the version update (UpdateVmVersionCommand), the VM is removed and added again - but when ProcessVmOnDownCommand is called from RunVmCommand the VM is added when it is based on the BLANK template instead of the correct template (another bug).

5. the removal of the VM succeed but the add command (AddVmAndAttachToPoolCommand) fails because of invalid timezone in the BLANK template:
2014-10-02 12:24:39,176 WARN [org.ovirt.engine.core.bll.AddVmAndAttachToPoolCommand] (org.ovirt.thread.pool-4-thread-55) [2d0d06c5] CanDoAction of action AddVmAndAttachToPool failed. Reasons:VAR__ACTION__ADD,VAR__TYPE__VM,ACTION_TYPE_FAILED_INVALID_TIMEZONE

---------------------------------------------------------------------------

* the problem with the invalid timezone which prevents adding the updated VM was already solved by bz 1087917, so in 3.5 the VM won't disappear

* part of the problem is that ProcessVmOnDownCommand was called by RunVmCommand instead of the monitoring. that happens because of the frequency that the prestarted-monitoring is invoked (VmPoolMonitor). the default setting is 5 minutes but I see in the logs that sometimes it is invoked 3-4 times in a minute for several pools.
Tomas, I don't understand - VmPoolMonitorIntervalInMinutes is in minutes so how is the frequency so high? if it is set to something like 0.3 than it is like setting it to 0 which means it is invoked in a loop - and that's bad. maybe we should set it to mili-seconds if that's the case.

I'll make a fix for the case where ProcessVmOnDown is called from RunVmCommand and that should solve the problem.

Comment 6 Tomas Dosek 2014-10-13 07:18:13 UTC

With regards to the first issue, we have a patch so I could build a testpackage, but I prefer waiting for the second one to be sure we cover both scenarios.

Comment 9 Ilanit Stein 2014-11-13 14:17:37 UTC

Tested on vt9:

Steps to reproduce:
1. Have a pool of 10 pre-started VMs
2. Reboot hypervisor they run on
These steps were run 3 times.

Results:
 Following the host reboot, the VMs in the pool go to Unknown status.
After the host became up,
all vms became up.

Comment 11 errata-xmlrpc 2015-02-11 18:09:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.