843058 – Can't run large amount of VMs simultaneously. Getting error Cant find VDS to run the VM.

Bug 843058 - Can't run large amount of VMs simultaneously. Getting error Cant find VDS to run the VM.

Summary: Can't run large amount of VMs simultaneously. Getting error Cant find VDS ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Roy Golan
QA Contact:	vvyazmin@redhat.com
Docs Contact:
URL:
Whiteboard:	virt
Duplicates (1):	927078 (view as bug list)
Depends On:
Blocks:	915537
TreeView+	depends on / blocked

Reported:	2012-07-25 13:17 UTC by Leonid Natapov
Modified:	2022-07-09 06:02 UTC (History)
CC List:	12 users (show)
Fixed In Version:	sf2
Doc Type:	Bug Fix
Doc Text:	The pending memory count increases when the RunVm call is issued, and decreases when the virtual machine changes to an Up state. When the memory was not decreased, it created an overflow of free memory which prevented a host from being selected to run virtual machines. Consequently, a large number of virtual machines could not be run simultaneously. This update implements an interleaving solution where the pending memory count is monitored, and throttled if there is insufficient memory. Bulk running of virtual machines now succeed.
Clone Of:
Environment:
Last Closed:	2013-06-10 21:08:06 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine log (2.71 MB, text/x-log) 2012-07-25 13:17 UTC, Leonid Natapov	no flags	Details
vdsm log (1.09 MB, application/octet-stream) 2012-08-06 10:55 UTC, Leonid Natapov	no flags	Details
engine debug log (5.12 MB, text/x-log) 2012-08-08 11:22 UTC, Roy Golan	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-47110	None	None	None	2022-07-09 06:02:52 UTC
Red Hat Knowledge Base (Solution)	333153	None	None	None	2018-11-30 19:19:12 UTC
Red Hat Product Errata	RHSA-2013:0888	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Virtualization Manager 3.2 update	2013-06-11 00:55:41 UTC
oVirt gerrit	7204	None	None	None	Never

Description Leonid Natapov 2012-07-25 13:17:55 UTC

Created attachment 600296 [details]
engine log

Can't  run large amount of VMs  simultaneously. Getting  error Cant find VDS to run the VM. I have 20+ VMs that I am trying to run simultaneously. Some VMs turns on and switch to powering up state but some vms are failed to run. After the Powering up VMs are UP I can successfully start VMs which previously failed to run. I can run them one by one and it works fine. In the backend I get the following error:

2012-07-25 16:02:12,134 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (pool-3-thread-43) [40d6b26e] Cant find VDS to run the VM e53f8a2e-4fc0-4d5d-81ea-53135622f577 on, so this VM will not be run.

full engine log attached

Comment 2 Roy Golan 2012-07-25 14:17:36 UTC

can you specify your setup: num of hosts and memory and CPU of the VMS and HOSTS

Comment 3 Simon Grinberg 2012-07-26 11:10:35 UTC

(In reply to comment #2)
> can you specify your setup: num of hosts and memory and CPU of the VMS and
> HOSTS

Leonid, could it be that you over commit memory? If so then it's a known issue. You must wait until KSM kicks in before you can farther run VMS

Comment 6 Roy Golan 2012-08-06 08:37:53 UTC

I'm not sure its KSM issue. it could be IO, timeout on VDSM semaphore lock for running qemu etc... Leonid please specify which VMs you played and attach the VDSM log.

Comment 8 Leonid Natapov 2012-08-06 10:54:38 UTC

Attaching vdsm.log file.

I am running 1 host in cluster. VMs are server machines with no OS.

Comment 9 Leonid Natapov 2012-08-06 10:55:08 UTC

Created attachment 602482 [details]
vdsm log

Comment 10 Roy Golan 2012-08-08 11:22:38 UTC

Created attachment 602998 [details]
engine debug log

Comment 11 Roy Golan 2012-08-08 11:43:21 UTC

The problem is that we are summing the increasing pending memory count from the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the VMs to run.

one of the solutions I can come with is to throttle the VM run in a way the *monitoring* will be able to interleave and decrement the pending memory . this means probably slower flow because we need a way to fire the monitoring (maybe parts of it by code sharing?) after every VM run?

Anyway I find it very bad UX when you have a monster Host but you just can't bulk run a mass of VMs on it.

Comment 12 Simon Grinberg 2012-08-08 14:41:04 UTC

(In reply to comment #11)
> The problem is that we are summing the increasing pending memory count from
> the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM
> goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the
> VMs to run.
> 
> one of the solutions I can come with is to throttle the VM run in a way the
> *monitoring* will be able to interleave and decrement the pending memory .
> this means probably slower flow because we need a way to fire the monitoring
> (maybe parts of it by code sharing?) after every VM run?
> 
> Anyway I find it very bad UX when you have a monster Host but you just can't
> bulk run a mass of VMs on it.

There are other consequences of firing up multiple VMs at the same time. 
For example - timeout on 'wait for launch' that may happen when you spawn many VMs at once, IO storms when all VMs try to boot from the same shared storage, etc. You need to throttle anyhow. 

The solution is to have the creation of multiple object asynchronous, and then throttle the actual creation. It's not a bad UX, it's a reasonable limitation to prevent Monday morning effect. Actually we have an RFE to do just that, I just can find it ATM

Comment 13 Roy Golan 2012-08-12 14:15:08 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > The problem is that we are summing the increasing pending memory count from
> > the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM
> > goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the
> > VMs to run.
> > 
> > one of the solutions I can come with is to throttle the VM run in a way the
> > *monitoring* will be able to interleave and decrement the pending memory .
> > this means probably slower flow because we need a way to fire the monitoring
> > (maybe parts of it by code sharing?) after every VM run?
> > 
> > Anyway I find it very bad UX when you have a monster Host but you just can't
> > bulk run a mass of VMs on it.
> 
> There are other consequences of firing up multiple VMs at the same time. 
> For example - timeout on 'wait for launch' that may happen when you spawn
> many VMs at once, IO storms when all VMs try to boot from the same shared
> storage, etc. You need to throttle anyhow. 
> 
> The solution is to have the creation of multiple object asynchronous, and
> then throttle the actual creation. It's not a bad UX, it's a reasonable
> limitation to prevent Monday morning effect. Actually we have an RFE to do
> just that, I just can find it ATM

I am not sure about the I/O storm you mentioned. I know VDSM has a semaphore for running a VM with the num of cores as the semaphore count. 

Anyhow my take on this now will be to decrease the pending memory count when the Vm status changes to POWERING_UP instead of UP and to see if this hurry things up.

Comment 14 Roy Golan 2012-08-15 08:44:21 UTC

http://gerrit.ovirt.org/#/c/7204/

Comment 15 Roy Golan 2012-11-29 09:23:31 UTC

merged
http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=ec2e27de952e7446dfa8f43b96be17972d631ab7

Comment 24 Doron Fediuck 2013-03-27 11:14:25 UTC

*** Bug 927078 has been marked as a duplicate of this bug. ***

Comment 27 vvyazmin@redhat.com 2013-05-26 20:27:29 UTC

No issues are found, When run 150 VM's simultaneously (via Python SDK), each VM have 256 MB RAM

Verified on RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

Comment 28 errata-xmlrpc 2013-06-10 21:08:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html

Note You need to log in before you can comment on or make changes to this bug.