Description of problem:
When pre-starting VMs from pool in larger scale (Pools of 30VMs 20 of which is set to be pre-started - 100 such pools being shut down and prestarted every 30 minutes) can cause tasks to get stuck on hypervisor side in finished state
(the task is removal of stateless snapshot).
Since the task is blocking one the VM cannot be pre-started in next round automatically. It can't be even started manually until user goes and cleans the tasks from storage and restarts vdsm.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Have large amount of pools with prestarted VMs (100 pools, 30 VMs in pool, 20 are prestarted)
2. Shut down all the VMs in the pool and let them prestart repeatedly (each 30 minutes)
Majority of VMs will not get pre-started after few rounds, tasks stuck on vdsm side until they are deleted from storage task repository and vdsm is restarted.
Should work just fine, the tasks should not get stuck.
Logcollector on its way most likely to be shared privately due to limitations.
Are the (In reply to Tomas Dosek from comment #0)
> Description of problem:
> When pre-starting VMs from pool in larger scale (Pools of 30VMs 20 of which
> is set to be pre-started - 100 such pools being shut down and prestarted
> every 30 minutes) can cause tasks to get stuck on hypervisor side in
> finished state
> (the task is removal of stateless snapshot).
> Since the task is blocking one the VM cannot be pre-started in next round
> automatically. It can't be even started manually until user goes and cleans
> the tasks from storage and restarts vdsm.
Uncleared tasks on the vdsm side shouldn't affect VMs. Is engine blocking the VM pre-start? If not can you provide the relevant vdsm log messages?
Relevant message in engine logs:
2014-04-09 12:15:08,244 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-8) [f8c2040] Correlation ID: ca038b6, Job ID: 2fc9beb2-308b-4b76-a652-7b19223be2f1, Call Stack: null, Custom Event ID: -1, Message: Failed to start VM test-3-22, because exist snapshot for stateless state. Snapshot will be deleted.
On vdsm side there is no message indeed because the above mentioned message from engine is from the validator, not from vdsm.
The question is more of "chicken-egg" trouble.
What engine does is simply - is there a task related to guest on any hypervisor? If so is the task blocking? In case these two conditions are met, validator will not allow starting the VM.
The trouble here is - the task exists on hypervisor (even though the state is FINISHED, success) - so from my point of view there are even 2 bugs to fix.
A) The task is stuck in vdsm
B) Engine should also check for state of the task and not only blindly decide
based whether task exists or not
The main problem here is the NPE we get when trying to run stateless VM.
Basically what happens is that we invoke the create (stateless) snapshot operation, the tasks are initiated in VDSM, and when they are finished the engine fails to invoke the create VM operation so:
1. The VM doesn't start
2. The task for the create snapshot remains (Tomas, it was the create-snapshot tasks that you found "stuck" in VDSM right?)
The engine fails due to NPE while trying to add a step for the create VM operation, which is very similar to what happened in . I believe it is the same issue because:
1. It doesn't happen for all the VMs which means that there is a race, and there was a race in the other issue where the engine sometimes removed jobs too early
2. I see in the logs that the bug happened in 3.3.1 as the other bug
Because of this problem we end up with a VM that is down and contains stateless snapshot after trying to run VM as stateless. So the next time you try to run it, you'll get the error which was mentioned in comment #3 - the VM won't start, only the stateless snapshot will be removed (and bz 1072375 was opened to improve this flow, but currently that's how it should work).
The next time you'll try to run the VM (after the stateless snapshot was removed) - it should work even without removing the stuck task in VDSM. Tomas, did you try to run it again without clearing the task?
So it should be solved by the fix for .
Right the task for creating snapshot is the one that gets stuck. And we get following messages afterwards:
Message: Failed to start VM test-3-22, because exist snapshot for stateless state. Snapshot will be deleted.
Let me build a test build based on the bugfix you mention and provide it to the customer. If it fixes the issue we can close this one.
(In reply to Arik from comment #9)
bug 1083423 is the 3.3.3 child of bug 1076246
Once confirmed we can hopefully close this one as a duplicate
The bugfix mentioned above solved the original problem., but exposed other issues:
1. We found error messages like:
ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-39) [3cb2d52] Failed in GetVmStatsVDS method
ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-39) [3cb2d52] Error code noVM and error message VDSGenericException: VDSErrorExceptio
n: Failed to GetVmStatsVDS, error = Virtual machine does not exist
INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-39) [3cb2d52] Command org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand
Those errors are generated because of the following scenario:
- VM is shut down (by the user within the guest)
- The VM process is getting down
- VDSM is doing disk related operations
- At this point, before the VM is removed from VDSM's containers, the VM is included in the list of VMs that VDSM send to the engine (because it is contained in the containers) but when we try to get its stats we get the error mentioned above
- The VM is removed from the VDSM's containers
- The engine detects that the VM went down
The vmGetStats call fails because of tries to get balloon-info, VDSM tries to interact with libvirt to get this info, gets an error that the domain doesn't exist and convert this error to 'noVM' error (BindingXMLRPC#wrapper).
The operation of fetching the balloon-info should reside in the statistics-retrieval thread and not in the statistics-respond thread, and that should solve it.
Note that there's no major harm by this errors, the engine just ignores them.
2. When there are many VMs which are being stopped together on one host, we might get into a situation where we find that stateless snapshot already exists for a VM we're about to run more often:
Lets say we detect that 50 VMs went down, those VMs will be added to a list by the monitoring thread and at the end of the monitoring iteration we'll try to handle them (VmPoolHandler#processVmPoolOnStopVm) one by one (in a synchronized way). It can take more than 8 minutes to process ~35 VMs. So if the prestarted-VMs monitoring job starts before we process all those VMs, and tries to start a VM that wasn't processed, we'll get into a situation where the VM will be started before we managed to remove its last used stateless snapshot.
The solution for this issue should be to do the 'cleanup' of VMs that went down in a non-blocking way, i.e delegate it to separate thread.
Note that the VM won't start in the first attempt, but should be started on the next attempt.
doesn't converge. awaiting results from customer testing….
Created attachment 900752 [details]
we have some preliminary patches for moving out snapshot processing out of monitoring cycle as part of bug 1098791
opened follow-up bugs for application list reporting issues which should take care of vmdynamic update errors. So only the time should remain, all the VMs should eventually succeed. Please let us know if you find any other error
Liran, do you have any clue about this? could this be related to the execute in batch method? inserted values seems to be ok, the 'console_cur_user_name' is NULL, is this an issue? can it be that it prints NULL for bad value?
I run it all on Postgres without any issue. I don't think the batch could be relevant. The only other problem I can think of is JDBC related - did you try to run it from code?
I suspect that the encoding of the database or of template0/template1 is not UTF-8
Can you please attache the result of the following
> psql -U postgres -l
bug 10857760 tracks the wrong locale of the DB
bug 1108675 tracks the pool behavior/delay
bug 1103710 tracks the guest-agent's application list not being filtered correctly (target 3.4.1)