Bug 1018229 - Starting some of VMs to meet prestarted value of a pool fails because of snapshot issue
Summary: Starting some of VMs to meet prestarted value of a pool fails because of snap...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.3.0
Assignee: Moti Asayag
QA Contact: Meni Yakove
URL:
Whiteboard: network
Depends On:
Blocks: 3.3snap2
TreeView+ depends on / blocked
 
Reported: 2013-10-11 13:59 UTC by Jiri Belka
Modified: 2016-02-10 19:55 UTC (History)
10 users (show)

Fixed In Version: is22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
part 00 of engine-log-collector (5.93 MB, application/x-xz)
2013-10-11 13:59 UTC, Jiri Belka
no flags Details
part 01 of engine-log-collector (12.32 MB, application/x-xz)
2013-10-11 14:01 UTC, Jiri Belka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 20718 0 None None None Never
oVirt gerrit 20774 0 None None None Never

Description Jiri Belka 2013-10-11 13:59:06 UTC
Created attachment 811136 [details]
part 00 of engine-log-collector

Description of problem:
When having an automatic pool with prestarted VMs and some of VMs are assigned/taked by a user, engine tries to satisfy pool's prestarted VMs value, thus it tried to start new VMs to have equal number of non-assigned/non-taken VMs ready as it is pool's prestarted VMs value.

Example:

* pool with 6 VMs, 3 prestarted
* 3 VMs are taken
* engine sees prestarted VMs value is not met, thus starting 3 VMs

This logic works but starting of some of these VMs fails. Not all of them.
It fails because of some issue with snapshots. As you can see test-2 and test-4 were started OK, but test-1 failed.

-%-
2013-Oct-11, 15:42
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.
	
2013-Oct-11, 15:37
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.
	
2013-Oct-11, 15:34
User admin@internal logged in.
	
2013-Oct-11, 15:32
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.
	
2013-Oct-11, 15:27
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.

2013-Oct-11, 15:23
VM test-2 was restarted on Host dell-r210ii-03 as stateless
	
2013-Oct-11, 15:23
VM test-4 was restarted on Host dell-r210ii-03 as stateless
	
2013-Oct-11, 15:22
Starting VM test-4 as stateless was initiated.
	
2013-Oct-11, 15:22
Starting VM test-2 as stateless was initiated.
	
2013-Oct-11, 15:22
Starting VM test-1 as stateless was initiated.
-%-

Number of tasks in Admin portal increases...

-%-
engine=# select action_type,description,status,start_time from job where status <> 'FINISHED';
    action_type     |     description     | status  |         start_time         
--------------------+---------------------+---------+----------------------------
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:42:30.586+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:47:31.084+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:27:29.038+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:32:29.313+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:47:31.267+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:32:29.399+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:37:29.891+02
 RunVm              | Launching VM test-1 | FAILED  | 2013-10-11 15:22:27.066+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:37:30.076+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:42:30.493+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:27:28.92+02
(11 rows)
-%-

Version-Release number of selected component (if applicable):
is18
vdsm-4.13.0-0.2.beta1.el6ev.x86_64
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. have an automatic pool with 6 VMs, 3 prestarted, one user max 6 vms
2. assign this pool to a user (UserRole)
3. wait till all 3 VMs are prestarted
4. login as the user into User Portal
5. take/assign (click '>' icon) on the pool main icon - do that 3 times!
6. now the user has 3 VMs assigned
7. wait a while to see engine initiates to start new 3 VMs (fulfilling prestarted VMs value)

Actual results:
some of new prestarted VMs fail to start

Expected results:
working

Additional info:

Comment 1 Jiri Belka 2013-10-11 14:01:19 UTC
Created attachment 811137 [details]
part 01 of engine-log-collector

Comment 2 Arik 2013-10-30 09:04:20 UTC
This problem here is that we assume the VM is attached to user when restoring snapshot (it was added recently to retrieve information which is related to the network) but in case of prestarted VM that fail to run, we restore the previous active snapshot when the VM is not attached to any user, and it cause NPE.
So we need to address this case as well to prevent the NPE.

The scenario is as following:
1. The VmPoolMonitor job tries to run a VM
2. Stateless snapshot is created for the VM
3. Then we try to run the VM
4. We can't find a suitable host to run the VM on
5. As part of the roll-back we try to restore the previous active snapshot
6. The restore snapshot operation fails due to NPE on "getCurrentUser().getId()" (at RestoreAllSnapshotsCommand#restoreConfiguration)

Comment 3 Meni Yakove 2013-11-10 12:18:13 UTC
rhevm-3.3.0-0.32.beta1.el6ev.noarch

Comment 4 Itamar Heim 2014-01-21 22:26:50 UTC
Closing - RHEV 3.3 Released

Comment 5 Itamar Heim 2014-01-21 22:30:07 UTC
Closing - RHEV 3.3 Released


Note You need to log in before you can comment on or make changes to this bug.