Description of problem: while 111 vms ramp up scenario was running for a single host, the last vm failed to start due to low memory which reported via the engine: 68367:2016-09-28 12:08:35,878 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-13) [] Operation Failed: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host <hostnamecovered> did not satisfy internal filter Memory because its available memory is too low (361.000000 MB) to run the VM.] while the host has enough memory ~14GB at that time. there is no errors in the vdsm log. this bug looks like engine related, which avoid from a vm to run, and skip the start vm call to execute on vdsm. KSM, ballon, overcommit are off (engine cluster level). HW profile: 24 Cores, 64GB RAM, 1TB local disk, 1 NFS SD via 10gig private network (9000 MTU). attaching available logs. looks like regression. Version-Release number of selected component (if applicable): 4.1.0-0 .master.20160920231321.git50b92e5 How reproducible: not clear. Steps to Reproduce: 1. ramp up 111 vms for a single host. 2. 3. Actual results: failed to start 111 vms. Expected results: 111 vms ramp up should pass as in 4.0 Additional info:
Created attachment 1205702 [details] engine logs
Could you upload debug engine logs too?
(In reply to Andrej Krejcir from comment #3) > Could you upload debug engine logs too? already attached
How much memory is assigned to a VM? It may be possible, that when a VM is running it only consumes the memory it actually uses, so the host reports unused memory as free, even if it is assigned to the VM. The scheduler considers the full assigned memory, not only the used portion. The attached logs have INFO level. DEBUG level would be useful to see details of scheduling.
(In reply to Andrej Krejcir from comment #5) > How much memory is assigned to a VM? > > It may be possible, that when a VM is running it only consumes the memory it > actually uses, so the host reports unused memory as free, even if it is > assigned to the VM. > The scheduler considers the full assigned memory, not only the used portion. > > The attached logs have INFO level. > DEBUG level would be useful to see details of scheduling. 512mb
111 * (512 MiB + 64 MiB) = 63 936 MiB This looks as not a bug: the amount of VMs + default expected overhead per VM add up almost to the host's available memory. We do not use the actual physical free memory for this check. We are trying to guarantee that all VMs are allowed to eat all their memory at the same time when no over-commit is defined.
Eldad, attach an engine log with DEBUG level enable if you want to reopen this so we see all the numbers that wen't into the equation.
(In reply to Martin Sivák from comment #7) > 111 * (512 MiB + 64 MiB) = 63 936 MiB > > This looks as not a bug: the amount of VMs + default expected overhead per > VM add up almost to the host's available memory. > > We do not use the actual physical free memory for this check. We are trying > to guarantee that all VMs are allowed to eat all their memory at the same > time when no over-commit is defined. Martin, in the description I mention that host has 14GB available, when vm failed to start. https://bugzilla.redhat.com/show_bug.cgi?id=1380194#c0
And I am telling you that the engine does not care about physical memory. The host has 14 GiB available, because the VMs are not fully using their allocated memory. But we count them as if they were. Attach the debug log, there is no bug right now (the fact that we only allow 110 VMs to start instead of 111 is interesting, but not important enough by itself).
(In reply to Martin Sivák from comment #10) > And I am telling you that the engine does not care about physical memory. > The host has 14 GiB available, because the VMs are not fully using their > allocated memory. But we count them as if they were. > > Attach the debug log, there is no bug right now (the fact that we only allow > 110 VMs to start instead of 111 is interesting, but not important enough by > itself). please raise the priority if needed.
Well I am closing this again until you convince me we have a bug. All the information attached to this bug so far show correct and expected behaviour.