1085760 – Prestarted-VMs in pool repeatedly shut down and started up in large scale can cause tasks get stuck in finished state on hypervisor blocking next prestarting round

Bug 1085760 - Prestarted-VMs in pool repeatedly shut down and started up in large scale can cause tasks get stuck in finished state on hypervisor blocking next prestarting round

Summary: Prestarted-VMs in pool repeatedly shut down and started up in large scale can...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Eli Mesika
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:	infra
Depends On:	1103710
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-09 09:46 UTC by Tomas Dosek
Modified:	2019-04-28 09:38 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-10-06 11:49:42 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
UpdateVm payload (8.67 MB, text/plain) 2014-05-30 10:54 UTC, Tomas Dosek	no flags	Details
View All

Description Tomas Dosek 2014-04-09 09:46:31 UTC

Description of problem:

When pre-starting VMs from pool in larger scale (Pools of 30VMs 20 of which is set to be pre-started - 100 such pools being shut down and prestarted every 30 minutes) can cause tasks to get stuck on hypervisor side in finished state
(the task is removal of stateless snapshot).

Since the task is blocking one the VM cannot be pre-started in next round automatically. It can't be even started manually until user goes and cleans the tasks from storage and restarts vdsm.

Version-Release number of selected component (if applicable):
3.3.1

How reproducible:
100 %

Steps to Reproduce:
1. Have large amount of pools with prestarted VMs (100 pools, 30 VMs in pool, 20 are prestarted)
2. Shut down all the VMs in the pool and let them prestart repeatedly (each 30 minutes)

Actual results:
Majority of VMs will not get pre-started after few rounds, tasks stuck on vdsm side until they are deleted from storage task repository and vdsm is restarted.

Expected results:
Should work just fine, the tasks should not get stuck.

Additional info:
Logcollector on its way most likely to be shared privately due to limitations.

Comment 2 Federico Simoncelli 2014-04-09 22:12:39 UTC

Are the (In reply to Tomas Dosek from comment #0)
> Description of problem:
> 
> When pre-starting VMs from pool in larger scale (Pools of 30VMs 20 of which
> is set to be pre-started - 100 such pools being shut down and prestarted
> every 30 minutes) can cause tasks to get stuck on hypervisor side in
> finished state
> (the task is removal of stateless snapshot).
> 
> Since the task is blocking one the VM cannot be pre-started in next round
> automatically. It can't be even started manually until user goes and cleans
> the tasks from storage and restarts vdsm.

Uncleared tasks on the vdsm side shouldn't affect VMs. Is engine blocking the VM pre-start? If not can you provide the relevant vdsm log messages?

Thanks.

Comment 3 Tomas Dosek 2014-04-10 05:29:18 UTC

Relevant message in engine logs:

2014-04-09 12:15:08,244 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-8) [f8c2040] Correlation ID: ca038b6, Job ID: 2fc9beb2-308b-4b76-a652-7b19223be2f1, Call Stack: null, Custom Event ID: -1, Message: Failed to start VM test-3-22, because exist snapshot for stateless state. Snapshot will be deleted.

On vdsm side there is no message indeed because the above mentioned message from engine is from the validator, not from vdsm.

The question is more of "chicken-egg" trouble.

What engine does is simply - is there a task related to guest on any hypervisor? If so is the task blocking? In case these two conditions are met, validator will not allow starting the VM.

The trouble here is - the task exists on hypervisor (even though the state is FINISHED, success) - so from my point of view there are even 2 bugs to fix.
A) The task is stuck in vdsm
B) Engine should also check for state of the task and not only blindly decide 
   based whether task exists or not

Comment 9 Arik 2014-04-13 10:22:25 UTC

The main problem here is the NPE we get when trying to run stateless VM.

Basically what happens is that we invoke the create (stateless) snapshot operation, the tasks are initiated in VDSM, and when they are finished the engine fails to invoke the create VM operation so:
1. The VM doesn't start
2. The task for the create snapshot remains (Tomas, it was the create-snapshot tasks that you found "stuck" in VDSM right?)

The engine fails due to NPE while trying to add a step for the create VM operation, which is very similar to what happened in [1]. I believe it is the same issue because:
1. It doesn't happen for all the VMs which means that there is a race, and there was a race in the other issue where the engine sometimes removed jobs too early
2. I see in the logs that the bug happened in 3.3.1 as the other bug

Because of this problem we end up with a VM that is down and contains stateless snapshot after trying to run VM as stateless. So the next time you try to run it, you'll get the error which was mentioned in comment #3 - the VM won't start, only the stateless snapshot will be removed (and bz 1072375 was opened to improve this flow, but currently that's how it should work).

The next time you'll try to run the VM (after the stateless snapshot was removed) - it should work even without removing the stuck task in VDSM. Tomas, did you try to run it again without clearing the task?

So it should be solved by the fix for [1].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1076246

Comment 10 Tomas Dosek 2014-04-14 05:08:09 UTC

Right the task for creating snapshot is the one that gets stuck. And we get following messages afterwards:

Message: Failed to start VM test-3-22, because exist snapshot for stateless state. Snapshot will be deleted.

Let me build a test build based on the bugfix you mention and provide it to the customer. If it fixes the issue we can close this one.

Comment 11 Michal Skrivanek 2014-04-14 11:00:49 UTC

(In reply to Arik from comment #9)
bug 1083423 is the 3.3.3 child of bug 1076246
Once confirmed we can hopefully close this one as a duplicate

Comment 12 Tomas Dosek 2014-04-14 11:10:00 UTC

Indeed, indeed.

Comment 21 Arik 2014-05-14 12:21:18 UTC

The bugfix mentioned above solved the original problem., but exposed other issues:

1. We found error messages like:

ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-39) [3cb2d52] Failed in GetVmStatsVDS method
ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-39) [3cb2d52] Error code noVM and error message VDSGenericException: VDSErrorExceptio
n: Failed to GetVmStatsVDS, error = Virtual machine does not exist
INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-39) [3cb2d52] Command org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmStatsVDSCommand 
return value 

Those errors are generated because of the following scenario:
- VM is shut down (by the user within the guest)
- The VM process is getting down
- VDSM is doing disk related operations
- At this point, before the VM is removed from VDSM's containers, the VM is included in the list of VMs that VDSM send to the engine (because it is contained in the containers) but when we try to get its stats we get the error mentioned above
- The VM is removed from the VDSM's containers
- The engine detects that the VM went down

The vmGetStats call fails because of tries to get balloon-info, VDSM tries to interact with libvirt to get this info, gets an error that the domain doesn't exist and convert this error to 'noVM' error (BindingXMLRPC#wrapper).

The operation of fetching the balloon-info should reside in the statistics-retrieval thread and not in the statistics-respond thread, and that should solve it.

Note that there's no major harm by this errors, the engine just ignores them.

2. When there are many VMs which are being stopped together on one host, we might get into a situation where we find that stateless snapshot already exists for a VM we're about to run more often:

Lets say we detect that 50 VMs went down, those VMs will be added to a list by the monitoring thread and at the end of the monitoring iteration we'll try to handle them (VmPoolHandler#processVmPoolOnStopVm) one by one (in a synchronized way). It can take more than 8 minutes to process ~35 VMs. So if the prestarted-VMs monitoring job starts before we process all those VMs, and tries to start a VM that wasn't processed, we'll get into a situation where the VM will be started before we managed to remove its last used stateless snapshot.

The solution for this issue should be to do the 'cleanup' of VMs that went down in a non-blocking way, i.e delegate it to separate thread.

Note that the VM won't start in the first attempt, but should be started on the next attempt.

Comment 25 Michal Skrivanek 2014-05-26 10:46:25 UTC

doesn't converge. awaiting results from customer testing….

Comment 31 Tomas Dosek 2014-05-30 10:54:16 UTC

Created attachment 900752 [details]
UpdateVm payload

Comment 34 Michal Skrivanek 2014-06-02 09:40:50 UTC

we have some preliminary patches for moving out snapshot processing out of monitoring cycle as part of bug 1098791

Comment 36 Michal Skrivanek 2014-06-02 12:11:34 UTC

opened follow-up bugs for application list reporting issues which should take care of vmdynamic update errors. So only the time should remain, all the VMs should eventually succeed. Please let us know if you find any other error

Comment 38 Omer Frenkel 2014-06-05 08:17:21 UTC

Liran, do you have any clue about this? could this be related to the execute in batch method? inserted values seems to be ok, the 'console_cur_user_name' is NULL, is this an issue? can it be that it prints NULL for bad value?

Comment 39 Liran Zelkha 2014-06-05 18:14:57 UTC

I run it all on Postgres without any issue. I don't think the batch could be relevant. The only other problem I can think of is JDBC related - did you try to run it from code?

Comment 40 Eli Mesika 2014-06-08 08:06:45 UTC

I suspect that the encoding of the database or of template0/template1 is not UTF-8

Can you please attache the result of the following 

> psql -U postgres -l

Comment 48 Michal Skrivanek 2014-06-17 06:12:48 UTC

note:
bug 10857760 tracks the wrong locale of the DB
bug 1108675 tracks the pool behavior/delay
bug 1103710 tracks the guest-agent's application list not being filtered correctly (target 3.4.1)

Note You need to log in before you can comment on or make changes to this bug.