Description of problem: Prestarted VMs dissapear from UI after failure to restore snapshot once VM turns from Unknown status to Down These VMs get deleted by Noninteractive internal user Version-Release number of selected component (if applicable): 3.4.2 How reproducible: Specific scenario Steps to Reproduce: 1. Have a pool of prestarted VMs with latest template versioning policy 2. Cause the m to go to Unknown status (i.e. cut of storage they run on) 3. Reboot hypervisor they run on 4) Start the storage again (you might need to update vm status from 7 to 0) Actual results: Vm gets deleted after next attempt to run it. Expected results: VM should restore snapshot and run Additional info: Full logcollector, etc coming in next comment - engine events: engine.log:82010:2014-10-02 11:47:10,995 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (org.ovirt.thread.pool-4-thread-27) [56c2bdf2] START, SetVmStatusVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a, status = Unknown, exit status = Normal), log id: 67d3e891 engine.log:84702:2014-10-02 11:52:36,096 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-93) [2bdc9616] VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) is running in db and not running in VDS RHVH-8_ucs3b3 engine.log:100314:2014-10-02 12:22:14,675 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (DefaultQuartzScheduler_Worker-70) START, IsVmDuringInitiatingVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a), log id: 43c6b8d1 engine.log:100349:2014-10-02 12:22:17,662 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] Lock Acquired to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM engine.log:100358:2014-10-02 12:22:18,288 INFO [org.ovirt.engine.core.vdsbroker.IsVmDuringInitiatingVDSCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] START, IsVmDuringInitiatingVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a), log id: 47d61ab2 engine.log:100393:2014-10-02 12:22:20,569 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] Running command: RunVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM engine.log:100405:2014-10-02 12:22:21,073 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-70) [18f18e94] VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) already contains stateless snapshot, removing it engine.log:100412:2014-10-02 12:22:21,201 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (DefaultQuartzScheduler_Worker-70) [1d40ac73] Deleting snapshot for stateless vm 8c8ecd90-f5b1-4950-a773-91cf3016c28a engine.log:100414:2014-10-02 12:22:21,365 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (DefaultQuartzScheduler_Worker-70) [33cb1992] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM engine.log:100415:2014-10-02 12:22:21,467 INFO [org.ovirt.engine.core.bll.UpdateVmVersionCommand] (DefaultQuartzScheduler_Worker-70) [15482126] Running command: UpdateVmVersionCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM engine.log:100420:2014-10-02 12:22:21,872 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (org.ovirt.thread.pool-4-thread-51) [48c849b2] Deleting snapshot for stateless vm 8c8ecd90-f5b1-4950-a773-91cf3016c28a engine.log:100423:2014-10-02 12:22:22,103 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-4-thread-51) [9b33044] Running command: RestoreStatelessVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM engine.log:100425:2014-10-02 12:22:22,320 INFO [org.ovirt.engine.core.bll.UpdateVmVersionCommand] (org.ovirt.thread.pool-4-thread-51) [72006e4d] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM engine.log:100429:2014-10-02 12:22:22,321 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (org.ovirt.thread.pool-4-thread-51) [72006e4d] Couldnt update VM 34TASUR401-32 (8c8ecd90-f5b1-4950-a773-91cf3016c28a) version from its template, continue with restoring stateless snapshot. engine.log:100430:2014-10-02 12:22:22,544 INFO [org.ovirt.engine.core.bll.RestoreAllSnapshotsCommand] (org.ovirt.thread.pool-4-thread-51) [4f7f9ee3] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM engine.log:100435:2014-10-02 12:22:22,590 INFO [org.ovirt.engine.core.bll.RemoveVmFromPoolCommand] (DefaultQuartzScheduler_Worker-70) [77b4cc4] Lock freed to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM engine.log:100455:2014-10-02 12:22:23,759 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Running command: RemoveVmCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM engine.log:100456:2014-10-02 12:22:23,760 INFO [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] START, SetVmStatusVDSCommand( vmId = 8c8ecd90-f5b1-4950-a773-91cf3016c28a, status = ImageLocked, exit status = Normal), log id: b5b180e engine.log:100462:2014-10-02 12:22:23,992 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Trying to release exclusive lock which does not exist, lock key: 8c8ecd90-f5b1-4950-a773-91cf3016c28aVM engine.log:100463:2014-10-02 12:22:23,992 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (DefaultQuartzScheduler_Worker-70) [1b2e3bce] Lock freed to object EngineLock [exclusiveLocks= key: 8c8ecd90-f5b1-4950-a773-91cf3016c28a value: VM engine.log:100469:2014-10-02 12:22:24,472 INFO [org.ovirt.engine.core.bll.RemoveAllVmImagesCommand] (DefaultQuartzScheduler_Worker-70) [51d6ba4] Running command: RemoveAllVmImagesCommand internal: true. Entities affected : ID: 8c8ecd90-f5b1-4950-a773-91cf3016c28a Type: VM engine.log:101215:2014-10-02 12:23:33,611 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-70) [71c47f8
There was a simple test made. This looks like an automatic behavior of some kind. If the VMs are marked as protected against deletion, they end up being detached from pools.
first, I'll describe the flow: 1. host that contained prestarted VM was rebooted 2. when the host was up again, the monitoring detected that all the VMs that ran on it don't exist 3. the mechanism for 'cleaning up' VMs that went down can be invoked from 2 places: a. from the monitoring: the monitoring detects that VM crashed and invoke ProcessVmOnDownCommand b. from run command: when trying to run VM we invoke ProcessVmOnDownCommand if the VM already has stateless snapshot in this case ProcessVmOnDownCommand was called by the second flow (b) 4. there is a bug that when ProcessVmOnDownCommand is called from within RunVmcommand, the version is updated even when it is not supposed to be updated (as it was in this case, there was no sub version for the template which the pool is based on). as part of the version update (UpdateVmVersionCommand), the VM is removed and added again - but when ProcessVmOnDownCommand is called from RunVmCommand the VM is added when it is based on the BLANK template instead of the correct template (another bug). 5. the removal of the VM succeed but the add command (AddVmAndAttachToPoolCommand) fails because of invalid timezone in the BLANK template: 2014-10-02 12:24:39,176 WARN [org.ovirt.engine.core.bll.AddVmAndAttachToPoolCommand] (org.ovirt.thread.pool-4-thread-55) [2d0d06c5] CanDoAction of action AddVmAndAttachToPool failed. Reasons:VAR__ACTION__ADD,VAR__TYPE__VM,ACTION_TYPE_FAILED_INVALID_TIMEZONE --------------------------------------------------------------------------- * the problem with the invalid timezone which prevents adding the updated VM was already solved by bz 1087917, so in 3.5 the VM won't disappear * part of the problem is that ProcessVmOnDownCommand was called by RunVmCommand instead of the monitoring. that happens because of the frequency that the prestarted-monitoring is invoked (VmPoolMonitor). the default setting is 5 minutes but I see in the logs that sometimes it is invoked 3-4 times in a minute for several pools. Tomas, I don't understand - VmPoolMonitorIntervalInMinutes is in minutes so how is the frequency so high? if it is set to something like 0.3 than it is like setting it to 0 which means it is invoked in a loop - and that's bad. maybe we should set it to mili-seconds if that's the case. I'll make a fix for the case where ProcessVmOnDown is called from RunVmCommand and that should solve the problem.
With regards to the first issue, we have a patch so I could build a testpackage, but I prefer waiting for the second one to be sure we cover both scenarios.
Tested on vt9: Steps to reproduce: 1. Have a pool of 10 pre-started VMs 2. Reboot hypervisor they run on These steps were run 3 times. Results: Following the host reboot, the VMs in the pool go to Unknown status. After the host became up, all vms became up.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html