Created attachment 811136 [details] part 00 of engine-log-collector Description of problem: When having an automatic pool with prestarted VMs and some of VMs are assigned/taked by a user, engine tries to satisfy pool's prestarted VMs value, thus it tried to start new VMs to have equal number of non-assigned/non-taken VMs ready as it is pool's prestarted VMs value. Example: * pool with 6 VMs, 3 prestarted * 3 VMs are taken * engine sees prestarted VMs value is not met, thus starting 3 VMs This logic works but starting of some of these VMs fails. Not all of them. It fails because of some issue with snapshots. As you can see test-2 and test-4 were started OK, but test-1 failed. -%- 2013-Oct-11, 15:42 Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted. 2013-Oct-11, 15:37 Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted. 2013-Oct-11, 15:34 User admin@internal logged in. 2013-Oct-11, 15:32 Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted. 2013-Oct-11, 15:27 Failed to start VM test-1, because exist snapshot for stateless state. Snapshot will be deleted. 2013-Oct-11, 15:23 VM test-2 was restarted on Host dell-r210ii-03 as stateless 2013-Oct-11, 15:23 VM test-4 was restarted on Host dell-r210ii-03 as stateless 2013-Oct-11, 15:22 Starting VM test-4 as stateless was initiated. 2013-Oct-11, 15:22 Starting VM test-2 as stateless was initiated. 2013-Oct-11, 15:22 Starting VM test-1 as stateless was initiated. -%- Number of tasks in Admin portal increases... -%- engine=# select action_type,description,status,start_time from job where status <> 'FINISHED'; action_type | description | status | start_time --------------------+---------------------+---------+---------------------------- RestoreStatelessVm | RestoreStatelessVm | FAILED | 2013-10-11 15:42:30.586+02 RunVm | Launching VM test-1 | STARTED | 2013-10-11 15:47:31.084+02 RestoreStatelessVm | RestoreStatelessVm | FAILED | 2013-10-11 15:27:29.038+02 RunVm | Launching VM test-1 | STARTED | 2013-10-11 15:32:29.313+02 RestoreStatelessVm | RestoreStatelessVm | FAILED | 2013-10-11 15:47:31.267+02 RestoreStatelessVm | RestoreStatelessVm | FAILED | 2013-10-11 15:32:29.399+02 RunVm | Launching VM test-1 | STARTED | 2013-10-11 15:37:29.891+02 RunVm | Launching VM test-1 | FAILED | 2013-10-11 15:22:27.066+02 RestoreStatelessVm | RestoreStatelessVm | FAILED | 2013-10-11 15:37:30.076+02 RunVm | Launching VM test-1 | STARTED | 2013-10-11 15:42:30.493+02 RunVm | Launching VM test-1 | STARTED | 2013-10-11 15:27:28.92+02 (11 rows) -%- Version-Release number of selected component (if applicable): is18 vdsm-4.13.0-0.2.beta1.el6ev.x86_64 libvirt-0.10.2-29.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. have an automatic pool with 6 VMs, 3 prestarted, one user max 6 vms 2. assign this pool to a user (UserRole) 3. wait till all 3 VMs are prestarted 4. login as the user into User Portal 5. take/assign (click '>' icon) on the pool main icon - do that 3 times! 6. now the user has 3 VMs assigned 7. wait a while to see engine initiates to start new 3 VMs (fulfilling prestarted VMs value) Actual results: some of new prestarted VMs fail to start Expected results: working Additional info:
Created attachment 811137 [details] part 01 of engine-log-collector
This problem here is that we assume the VM is attached to user when restoring snapshot (it was added recently to retrieve information which is related to the network) but in case of prestarted VM that fail to run, we restore the previous active snapshot when the VM is not attached to any user, and it cause NPE. So we need to address this case as well to prevent the NPE. The scenario is as following: 1. The VmPoolMonitor job tries to run a VM 2. Stateless snapshot is created for the VM 3. Then we try to run the VM 4. We can't find a suitable host to run the VM on 5. As part of the roll-back we try to restore the previous active snapshot 6. The restore snapshot operation fails due to NPE on "getCurrentUser().getId()" (at RestoreAllSnapshotsCommand#restoreConfiguration)
rhevm-3.3.0-0.32.beta1.el6ev.noarch
Closing - RHEV 3.3 Released