Bug 843058 - Can't run large amount of VMs simultaneously. Getting error Cant find VDS to run the VM.
Can't run large amount of VMs simultaneously. Getting error Cant find VDS ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
Unspecified Unspecified
unspecified Severity high
: ---
: 3.2.0
Assigned To: Roy Golan
vvyazmin@redhat.com
virt
: Reopened
: 927078 (view as bug list)
Depends On:
Blocks: 915537
  Show dependency treegraph
 
Reported: 2012-07-25 09:17 EDT by Leonid Natapov
Modified: 2013-06-10 17:08 EDT (History)
12 users (show)

See Also:
Fixed In Version: sf2
Doc Type: Bug Fix
Doc Text:
The pending memory count increases when the RunVm call is issued, and decreases when the virtual machine changes to an Up state. When the memory was not decreased, it created an overflow of free memory which prevented a host from being selected to run virtual machines. Consequently, a large number of virtual machines could not be run simultaneously. This update implements an interleaving solution where the pending memory count is monitored, and throttled if there is insufficient memory. Bulk running of virtual machines now succeed.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-10 17:08:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
engine log (2.71 MB, text/x-log)
2012-07-25 09:17 EDT, Leonid Natapov
no flags Details
vdsm log (1.09 MB, application/octet-stream)
2012-08-06 06:55 EDT, Leonid Natapov
no flags Details
engine debug log (5.12 MB, text/x-log)
2012-08-08 07:22 EDT, Roy Golan
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 7204 None None None Never

  None (edit)
Description Leonid Natapov 2012-07-25 09:17:55 EDT
Created attachment 600296 [details]
engine log

Can't  run large amount of VMs  simultaneously. Getting  error Cant find VDS to run the VM. I have 20+ VMs that I am trying to run simultaneously. Some VMs turns on and switch to powering up state but some vms are failed to run. After the Powering up VMs are UP I can successfully start VMs which previously failed to run. I can run them one by one and it works fine. In the backend I get the following error:

2012-07-25 16:02:12,134 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (pool-3-thread-43) [40d6b26e] Cant find VDS to run the VM e53f8a2e-4fc0-4d5d-81ea-53135622f577 on, so this VM will not be run.

full engine log attached
Comment 2 Roy Golan 2012-07-25 10:17:36 EDT
can you specify your setup: num of hosts and memory and CPU of the VMS and HOSTS
Comment 3 Simon Grinberg 2012-07-26 07:10:35 EDT
(In reply to comment #2)
> can you specify your setup: num of hosts and memory and CPU of the VMS and
> HOSTS

Leonid, could it be that you over commit memory? If so then it's a known issue. You must wait until KSM kicks in before you can farther run VMS
Comment 6 Roy Golan 2012-08-06 04:37:53 EDT
I'm not sure its KSM issue. it could be IO, timeout on VDSM semaphore lock for running qemu etc... Leonid please specify which VMs you played and attach the VDSM log.
Comment 8 Leonid Natapov 2012-08-06 06:54:38 EDT
Attaching vdsm.log file.

I am running 1 host in cluster. VMs are server machines with no OS.
Comment 9 Leonid Natapov 2012-08-06 06:55:08 EDT
Created attachment 602482 [details]
vdsm log
Comment 10 Roy Golan 2012-08-08 07:22:38 EDT
Created attachment 602998 [details]
engine debug log
Comment 11 Roy Golan 2012-08-08 07:43:21 EDT
The problem is that we are summing the increasing pending memory count from the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the VMs to run.

one of the solutions I can come with is to throttle the VM run in a way the *monitoring* will be able to interleave and decrement the pending memory . this means probably slower flow because we need a way to fire the monitoring (maybe parts of it by code sharing?) after every VM run?

Anyway I find it very bad UX when you have a monster Host but you just can't bulk run a mass of VMs on it.
Comment 12 Simon Grinberg 2012-08-08 10:41:04 EDT
(In reply to comment #11)
> The problem is that we are summing the increasing pending memory count from
> the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM
> goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the
> VMs to run.
> 
> one of the solutions I can come with is to throttle the VM run in a way the
> *monitoring* will be able to interleave and decrement the pending memory .
> this means probably slower flow because we need a way to fire the monitoring
> (maybe parts of it by code sharing?) after every VM run?
> 
> Anyway I find it very bad UX when you have a monster Host but you just can't
> bulk run a mass of VMs on it.

There are other consequences of firing up multiple VMs at the same time. 
For example - timeout on 'wait for launch' that may happen when you spawn many VMs at once, IO storms when all VMs try to boot from the same shared storage, etc. You need to throttle anyhow. 

The solution is to have the creation of multiple object asynchronous, and then throttle the actual creation. It's not a bad UX, it's a reasonable limitation to prevent Monday morning effect. Actually we have an RFE to do just that, I just can find it ATM
Comment 13 Roy Golan 2012-08-12 10:15:08 EDT
(In reply to comment #12)
> (In reply to comment #11)
> > The problem is that we are summing the increasing pending memory count from
> > the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM
> > goes to UP  so a burst running VMs will always fail short after ~ 1/2 of the
> > VMs to run.
> > 
> > one of the solutions I can come with is to throttle the VM run in a way the
> > *monitoring* will be able to interleave and decrement the pending memory .
> > this means probably slower flow because we need a way to fire the monitoring
> > (maybe parts of it by code sharing?) after every VM run?
> > 
> > Anyway I find it very bad UX when you have a monster Host but you just can't
> > bulk run a mass of VMs on it.
> 
> There are other consequences of firing up multiple VMs at the same time. 
> For example - timeout on 'wait for launch' that may happen when you spawn
> many VMs at once, IO storms when all VMs try to boot from the same shared
> storage, etc. You need to throttle anyhow. 
> 
> The solution is to have the creation of multiple object asynchronous, and
> then throttle the actual creation. It's not a bad UX, it's a reasonable
> limitation to prevent Monday morning effect. Actually we have an RFE to do
> just that, I just can find it ATM

I am not sure about the I/O storm you mentioned. I know VDSM has a semaphore for running a VM with the num of cores as the semaphore count. 

Anyhow my take on this now will be to decrease the pending memory count when the Vm status changes to POWERING_UP instead of UP and to see if this hurry things up.
Comment 14 Roy Golan 2012-08-15 04:44:21 EDT
http://gerrit.ovirt.org/#/c/7204/
Comment 24 Doron Fediuck 2013-03-27 07:14:25 EDT
*** Bug 927078 has been marked as a duplicate of this bug. ***
Comment 27 vvyazmin@redhat.com 2013-05-26 16:27:29 EDT
No issues are found, When run 150 VM's simultaneously (via Python SDK), each VM have 256 MB RAM

Verified on RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64
Comment 28 errata-xmlrpc 2013-06-10 17:08:06 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html

Note You need to log in before you can comment on or make changes to this bug.