Bug 843058
Summary: | Can't run large amount of VMs simultaneously. Getting error Cant find VDS to run the VM. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Leonid Natapov <lnatapov> | ||||||||
Component: | ovirt-engine | Assignee: | Roy Golan <rgolan> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | vvyazmin <vvyazmin> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.1.0 | CC: | chetan, dyasny, hateya, iheim, lpeer, mhuth, ofrenkel, pstehlik, Rhev-m-bugs, sgrinber, yeylon, ykaul | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | 3.2.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | virt | ||||||||||
Fixed In Version: | sf2 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
The pending memory count increases when the RunVm call is issued, and decreases when the virtual machine changes to an Up state. When the memory was not decreased, it created an overflow of free memory which prevented a host from being selected to run virtual machines. Consequently, a large number of virtual machines could not be run simultaneously.
This update implements an interleaving solution where the pending memory count is monitored, and throttled if there is insufficient memory. Bulk running of virtual machines now succeed.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2013-06-10 21:08:06 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 915537 | ||||||||||
Attachments: |
|
can you specify your setup: num of hosts and memory and CPU of the VMS and HOSTS (In reply to comment #2) > can you specify your setup: num of hosts and memory and CPU of the VMS and > HOSTS Leonid, could it be that you over commit memory? If so then it's a known issue. You must wait until KSM kicks in before you can farther run VMS I'm not sure its KSM issue. it could be IO, timeout on VDSM semaphore lock for running qemu etc... Leonid please specify which VMs you played and attach the VDSM log. Attaching vdsm.log file. I am running 1 host in cluster. VMs are server machines with no OS. Created attachment 602482 [details]
vdsm log
Created attachment 602998 [details]
engine debug log
The problem is that we are summing the increasing pending memory count from the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM goes to UP so a burst running VMs will always fail short after ~ 1/2 of the VMs to run. one of the solutions I can come with is to throttle the VM run in a way the *monitoring* will be able to interleave and decrement the pending memory . this means probably slower flow because we need a way to fire the monitoring (maybe parts of it by code sharing?) after every VM run? Anyway I find it very bad UX when you have a monster Host but you just can't bulk run a mass of VMs on it. (In reply to comment #11) > The problem is that we are summing the increasing pending memory count from > the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM > goes to UP so a burst running VMs will always fail short after ~ 1/2 of the > VMs to run. > > one of the solutions I can come with is to throttle the VM run in a way the > *monitoring* will be able to interleave and decrement the pending memory . > this means probably slower flow because we need a way to fire the monitoring > (maybe parts of it by code sharing?) after every VM run? > > Anyway I find it very bad UX when you have a monster Host but you just can't > bulk run a mass of VMs on it. There are other consequences of firing up multiple VMs at the same time. For example - timeout on 'wait for launch' that may happen when you spawn many VMs at once, IO storms when all VMs try to boot from the same shared storage, etc. You need to throttle anyhow. The solution is to have the creation of multiple object asynchronous, and then throttle the actual creation. It's not a bad UX, it's a reasonable limitation to prevent Monday morning effect. Actually we have an RFE to do just that, I just can find it ATM (In reply to comment #12) > (In reply to comment #11) > > The problem is that we are summing the increasing pending memory count from > > the RunVm and decreasing it when VdsUpdateRunTimeInfo detects that the VM > > goes to UP so a burst running VMs will always fail short after ~ 1/2 of the > > VMs to run. > > > > one of the solutions I can come with is to throttle the VM run in a way the > > *monitoring* will be able to interleave and decrement the pending memory . > > this means probably slower flow because we need a way to fire the monitoring > > (maybe parts of it by code sharing?) after every VM run? > > > > Anyway I find it very bad UX when you have a monster Host but you just can't > > bulk run a mass of VMs on it. > > There are other consequences of firing up multiple VMs at the same time. > For example - timeout on 'wait for launch' that may happen when you spawn > many VMs at once, IO storms when all VMs try to boot from the same shared > storage, etc. You need to throttle anyhow. > > The solution is to have the creation of multiple object asynchronous, and > then throttle the actual creation. It's not a bad UX, it's a reasonable > limitation to prevent Monday morning effect. Actually we have an RFE to do > just that, I just can find it ATM I am not sure about the I/O storm you mentioned. I know VDSM has a semaphore for running a VM with the num of cores as the semaphore count. Anyhow my take on this now will be to decrease the pending memory count when the Vm status changes to POWERING_UP instead of UP and to see if this hurry things up. *** Bug 927078 has been marked as a duplicate of this bug. *** No issues are found, When run 150 VM's simultaneously (via Python SDK), each VM have 256 MB RAM Verified on RHEVM 3.2 - SF17.1 environment: RHEVM: rhevm-3.2.0-11.28.el6ev.noarch VDSM: vdsm-4.10.2-21.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0888.html |
Created attachment 600296 [details] engine log Can't run large amount of VMs simultaneously. Getting error Cant find VDS to run the VM. I have 20+ VMs that I am trying to run simultaneously. Some VMs turns on and switch to powering up state but some vms are failed to run. After the Powering up VMs are UP I can successfully start VMs which previously failed to run. I can run them one by one and it works fine. In the backend I get the following error: 2012-07-25 16:02:12,134 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (pool-3-thread-43) [40d6b26e] Cant find VDS to run the VM e53f8a2e-4fc0-4d5d-81ea-53135622f577 on, so this VM will not be run. full engine log attached