Bug 843058

Summary:

Can't run large amount of VMs simultaneously. Getting error Cant find VDS to run the VM.

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Leonid Natapov <lnatapov>

Component:

ovirt-engine

Assignee:

Roy Golan <rgolan>

Status:

CLOSED ERRATA

QA Contact:

vvyazmin <vvyazmin>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.1.0

CC:

chetan, dyasny, hateya, iheim, lpeer, mhuth, ofrenkel, pstehlik, Rhev-m-bugs, sgrinber, yeylon, ykaul

Target Milestone:

---

Keywords:

Reopened

Target Release:

3.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

virt

Fixed In Version:

sf2

Doc Type:

Bug Fix

Doc Text:

The pending memory count increases when the RunVm call is issued, and decreases when the virtual machine changes to an Up state. When the memory was not decreased, it created an overflow of free memory which prevented a host from being selected to run virtual machines. Consequently, a large number of virtual machines could not be run simultaneously. This update implements an interleaving solution where the pending memory count is monitored, and throttled if there is insufficient memory. Bulk running of virtual machines now succeed.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-06-10 21:08:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

915537

Attachments:

Description	Flags
engine log	none
vdsm log	none
engine debug log	none

Description Leonid Natapov 2012-07-25 13:17:55 UTC

Created attachment 600296 [details]
engine log

Can't  run large amount of VMs  simultaneously. Getting  error Cant find VDS to run the VM. I have 20+ VMs that I am trying to run simultaneously. Some VMs turns on and switch to powering up state but some vms are failed to run. After the Powering up VMs are UP I can successfully start VMs which previously failed to run. I can run them one by one and it works fine. In the backend I get the following error:

2012-07-25 16:02:12,134 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (pool-3-thread-43) [40d6b26e] Cant find VDS to run the VM e53f8a2e-4fc0-4d5d-81ea-53135622f577 on, so this VM will not be run.

full engine log attached

Comment 2 Roy Golan 2012-07-25 14:17:36 UTC

can you specify your setup: num of hosts and memory and CPU of the VMS and HOSTS

Comment 3 Simon Grinberg 2012-07-26 11:10:35 UTC

(In reply to comment #2)
> can you specify your setup: num of hosts and memory and CPU of the VMS and
> HOSTS

Leonid, could it be that you over commit memory? If so then it's a known issue. You must wait until KSM kicks in before you can farther run VMS

Comment 6 Roy Golan 2012-08-06 08:37:53 UTC

I'm not sure its KSM issue. it could be IO, timeout on VDSM semaphore lock for running qemu etc... Leonid please specify which VMs you played and attach the VDSM log.

Comment 8 Leonid Natapov 2012-08-06 10:54:38 UTC

Attaching vdsm.log file.

I am running 1 host in cluster. VMs are server machines with no OS.

Comment 9 Leonid Natapov 2012-08-06 10:55:08 UTC

Created attachment 602482 [details]
vdsm log

Comment 10 Roy Golan 2012-08-08 11:22:38 UTC

Created attachment 602998 [details]
engine debug log

Comment 11 Roy Golan 2012-08-08 11:43:21 UTC

The problem is that we are summing the increasing pending memory count from the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the VMs to run.

one of the solutions I can come with is to throttle the VM run in a way the *monitoring* will be able to interleave and decrement the pending memory . this means probably slower flow because we need a way to fire the monitoring (maybe parts of it by code sharing?) after every VM run?

Anyway I find it very bad UX when you have a monster Host but you just can't bulk run a mass of VMs on it.

Comment 12 Simon Grinberg 2012-08-08 14:41:04 UTC

(In reply to comment #11)
> The problem is that we are summing the increasing pending memory count from
> the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM
> goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the
> VMs to run.
> 
> one of the solutions I can come with is to throttle the VM run in a way the
> *monitoring* will be able to interleave and decrement the pending memory .
> this means probably slower flow because we need a way to fire the monitoring
> (maybe parts of it by code sharing?) after every VM run?
> 
> Anyway I find it very bad UX when you have a monster Host but you just can't
> bulk run a mass of VMs on it.

There are other consequences of firing up multiple VMs at the same time. 
For example - timeout on 'wait for launch' that may happen when you spawn many VMs at once, IO storms when all VMs try to boot from the same shared storage, etc. You need to throttle anyhow. 

The solution is to have the creation of multiple object asynchronous, and then throttle the actual creation. It's not a bad UX, it's a reasonable limitation to prevent Monday morning effect. Actually we have an RFE to do just that, I just can find it ATM

Comment 13 Roy Golan 2012-08-12 14:15:08 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > The problem is that we are summing the increasing pending memory count from
> > the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM
> > goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the
> > VMs to run.
> > 
> > one of the solutions I can come with is to throttle the VM run in a way the
> > *monitoring* will be able to interleave and decrement the pending memory .
> > this means probably slower flow because we need a way to fire the monitoring
> > (maybe parts of it by code sharing?) after every VM run?
> > 
> > Anyway I find it very bad UX when you have a monster Host but you just can't
> > bulk run a mass of VMs on it.
> 
> There are other consequences of firing up multiple VMs at the same time. 
> For example - timeout on 'wait for launch' that may happen when you spawn
> many VMs at once, IO storms when all VMs try to boot from the same shared
> storage, etc. You need to throttle anyhow. 
> 
> The solution is to have the creation of multiple object asynchronous, and
> then throttle the actual creation. It's not a bad UX, it's a reasonable
> limitation to prevent Monday morning effect. Actually we have an RFE to do
> just that, I just can find it ATM

I am not sure about the I/O storm you mentioned. I know VDSM has a semaphore for running a VM with the num of cores as the semaphore count. 

Anyhow my take on this now will be to decrease the pending memory count when the Vm status changes to POWERING_UP instead of UP and to see if this hurry things up.

Comment 14 Roy Golan 2012-08-15 08:44:21 UTC

http://gerrit.ovirt.org/#/c/7204/

Comment 15 Roy Golan 2012-11-29 09:23:31 UTC

merged
http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=ec2e27de952e7446dfa8f43b96be17972d631ab7

Comment 24 Doron Fediuck 2013-03-27 11:14:25 UTC

*** Bug 927078 has been marked as a duplicate of this bug. ***

Comment 27 vvyazmin@redhat.com 2013-05-26 20:27:29 UTC

No issues are found, When run 150 VM's simultaneously (via Python SDK), each VM have 256 MB RAM

Verified on RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

Comment 28 errata-xmlrpc 2013-06-10 21:08:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html