Bug 1018229 - Starting some of VMs to meet prestarted value of a pool fails because of snapshot issue
Starting some of VMs to meet prestarted value of a pool fails because of snap...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.3.0
Unspecified Unspecified
urgent Severity urgent
: ---
: 3.3.0
Assigned To: Moti Asayag
Meni Yakove
network
:
Depends On:
Blocks: 3.3snap2
  Show dependency treegraph
 
Reported: 2013-10-11 09:59 EDT by Jiri Belka
Modified: 2016-02-10 14:55 EST (History)
10 users (show)

See Also:
Fixed In Version: is22
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
part 00 of engine-log-collector (5.93 MB, application/x-xz)
2013-10-11 09:59 EDT, Jiri Belka
no flags Details
part 01 of engine-log-collector (12.32 MB, application/x-xz)
2013-10-11 10:01 EDT, Jiri Belka
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 20718 None None None Never
oVirt gerrit 20774 None None None Never

  None (edit)
Description Jiri Belka 2013-10-11 09:59:06 EDT
Created attachment 811136 [details]
part 00 of engine-log-collector

Description of problem:
When having an automatic pool with prestarted VMs and some of VMs are assigned/taked by a user, engine tries to satisfy pool's prestarted VMs value, thus it tried to start new VMs to have equal number of non-assigned/non-taken VMs ready as it is pool's prestarted VMs value.

Example:

* pool with 6 VMs, 3 prestarted
* 3 VMs are taken
* engine sees prestarted VMs value is not met, thus starting 3 VMs

This logic works but starting of some of these VMs fails. Not all of them.
It fails because of some issue with snapshots. As you can see test-2 and test-4 were started OK, but test-1 failed.

-%-
2013-Oct-11, 15:42
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.
	
2013-Oct-11, 15:37
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.
	
2013-Oct-11, 15:34
User admin@internal logged in.
	
2013-Oct-11, 15:32
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.
	
2013-Oct-11, 15:27
Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted.

2013-Oct-11, 15:23
VM test-2 was restarted on Host dell-r210ii-03 as stateless
	
2013-Oct-11, 15:23
VM test-4 was restarted on Host dell-r210ii-03 as stateless
	
2013-Oct-11, 15:22
Starting VM test-4 as stateless was initiated.
	
2013-Oct-11, 15:22
Starting VM test-2 as stateless was initiated.
	
2013-Oct-11, 15:22
Starting VM test-1 as stateless was initiated.
-%-

Number of tasks in Admin portal increases...

-%-
engine=# select action_type,description,status,start_time from job where status <> 'FINISHED';
    action_type     |     description     | status  |         start_time         
--------------------+---------------------+---------+----------------------------
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:42:30.586+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:47:31.084+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:27:29.038+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:32:29.313+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:47:31.267+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:32:29.399+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:37:29.891+02
 RunVm              | Launching VM test-1 | FAILED  | 2013-10-11 15:22:27.066+02
 RestoreStatelessVm | RestoreStatelessVm  | FAILED  | 2013-10-11 15:37:30.076+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:42:30.493+02
 RunVm              | Launching VM test-1 | STARTED | 2013-10-11 15:27:28.92+02
(11 rows)
-%-

Version-Release number of selected component (if applicable):
is18
vdsm-4.13.0-0.2.beta1.el6ev.x86_64
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. have an automatic pool with 6 VMs, 3 prestarted, one user max 6 vms
2. assign this pool to a user (UserRole)
3. wait till all 3 VMs are prestarted
4. login as the user into User Portal
5. take/assign (click '>' icon) on the pool main icon - do that 3 times!
6. now the user has 3 VMs assigned
7. wait a while to see engine initiates to start new 3 VMs (fulfilling prestarted VMs value)

Actual results:
some of new prestarted VMs fail to start

Expected results:
working

Additional info:
Comment 1 Jiri Belka 2013-10-11 10:01:19 EDT
Created attachment 811137 [details]
part 01 of engine-log-collector
Comment 2 Arik 2013-10-30 05:04:20 EDT
This problem here is that we assume the VM is attached to user when restoring snapshot (it was added recently to retrieve information which is related to the network) but in case of prestarted VM that fail to run, we restore the previous active snapshot when the VM is not attached to any user, and it cause NPE.
So we need to address this case as well to prevent the NPE.

The scenario is as following:
1. The VmPoolMonitor job tries to run a VM
2. Stateless snapshot is created for the VM
3. Then we try to run the VM
4. We can't find a suitable host to run the VM on
5. As part of the roll-back we try to restore the previous active snapshot
6. The restore snapshot operation fails due to NPE on "getCurrentUser().getId()" (at RestoreAllSnapshotsCommand#restoreConfiguration)
Comment 3 Meni Yakove 2013-11-10 07:18:13 EST
rhevm-3.3.0-0.32.beta1.el6ev.noarch
Comment 4 Itamar Heim 2014-01-21 17:26:50 EST
Closing - RHEV 3.3 Released
Comment 5 Itamar Heim 2014-01-21 17:30:07 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.