1380194 – [scale] - wrong memory utilization for a host

Bug 1380194 - [scale] - wrong memory utilization for a host

Summary: [scale] - wrong memory utilization for a host

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backend.Core
Sub Component:
Version:	4.1.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.0-beta
Target Release:	---
Assignee:	Andrej Krejcir
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-28 22:06 UTC by Eldad Marciano
Modified:	2017-01-19 16:08 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-01-19 16:08:15 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:
Flags:	dfediuck: ovirt-4.1? rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)

Description Eldad Marciano 2016-09-28 22:06:51 UTC

Description of problem:
while 111 vms ramp up scenario was running for a single host, the last vm failed to start due to low memory which reported via the engine:
 
68367:2016-09-28 12:08:35,878 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-13) [] Operation Failed: [Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host <hostnamecovered> did not satisfy internal filter Memory because its available memory is too low (361.000000 MB) to run the VM.]

while the host has enough memory ~14GB at that time.
there is no errors in the vdsm log.
this bug looks like engine related, which avoid from a vm to run, and skip the start vm call to execute on vdsm.


KSM, ballon, overcommit are off (engine cluster level).

HW profile:
24 Cores, 64GB RAM, 1TB local disk, 1 NFS SD via 10gig private network (9000 MTU).

attaching available logs.

looks like regression.

Version-Release number of selected component (if applicable):
4.1.0-0 .master.20160920231321.git50b92e5

How reproducible:
not clear.

Steps to Reproduce:
1. ramp up 111 vms for a single host.
2.
3.

Actual results:
failed to start 111 vms.

Expected results:
111 vms ramp up should pass as in 4.0

Additional info:

Comment 1 Eldad Marciano 2016-09-28 22:08:39 UTC

Created attachment 1205702 [details]
engine logs

Comment 3 Andrej Krejcir 2016-10-05 14:49:55 UTC

Could you upload debug engine logs too?

Comment 4 Eldad Marciano 2016-10-06 16:36:45 UTC

(In reply to Andrej Krejcir from comment #3)
> Could you upload debug engine logs too?

already attached

Comment 5 Andrej Krejcir 2016-11-28 11:58:49 UTC

How much memory is assigned to a VM?

It may be possible, that when a VM is running it only consumes the memory it
actually uses, so the host reports unused memory as free, even if it is 
assigned to the VM.
The scheduler considers the full assigned memory, not only the used portion.

The attached logs have INFO level.
DEBUG level would be useful to see details of scheduling.

Comment 6 Eldad Marciano 2016-11-28 12:23:12 UTC

(In reply to Andrej Krejcir from comment #5)
> How much memory is assigned to a VM?
> 
> It may be possible, that when a VM is running it only consumes the memory it
> actually uses, so the host reports unused memory as free, even if it is 
> assigned to the VM.
> The scheduler considers the full assigned memory, not only the used portion.
> 
> The attached logs have INFO level.
> DEBUG level would be useful to see details of scheduling.

512mb

Comment 7 Martin Sivák 2017-01-09 12:54:09 UTC

111 * (512 MiB + 64 MiB) = 63 936 MiB

This looks as not a bug: the amount of VMs + default expected overhead per VM add up almost to the host's available memory.

We do not use the actual physical free memory for this check. We are trying to guarantee that all VMs are allowed to eat all their memory at the same time when no over-commit is defined.

Comment 8 Martin Sivák 2017-01-09 13:20:29 UTC

Eldad, attach an engine log with DEBUG level enable if you want to reopen this so we see all the numbers that wen't into the equation.

Comment 9 Eldad Marciano 2017-01-09 13:36:39 UTC

(In reply to Martin Sivák from comment #7)
> 111 * (512 MiB + 64 MiB) = 63 936 MiB
> 
> This looks as not a bug: the amount of VMs + default expected overhead per
> VM add up almost to the host's available memory.
> 
> We do not use the actual physical free memory for this check. We are trying
> to guarantee that all VMs are allowed to eat all their memory at the same
> time when no over-commit is defined.

Martin, in the description I mention that host has 14GB available, when vm failed to start.

https://bugzilla.redhat.com/show_bug.cgi?id=1380194#c0

Comment 10 Martin Sivák 2017-01-09 13:58:00 UTC

And I am telling you that the engine does not care about physical memory. The host has 14 GiB available, because the VMs are not fully using their allocated memory. But we count them as if they were.

Attach the debug log, there is no bug right now (the fact that we only allow 110 VMs to start instead of 111 is interesting, but not important enough by itself).

Comment 11 Eldad Marciano 2017-01-19 15:40:39 UTC

(In reply to Martin Sivák from comment #10)
> And I am telling you that the engine does not care about physical memory.
> The host has 14 GiB available, because the VMs are not fully using their
> allocated memory. But we count them as if they were.
> 
> Attach the debug log, there is no bug right now (the fact that we only allow
> 110 VMs to start instead of 111 is interesting, but not important enough by
> itself).

please raise the priority if needed.

Comment 12 Martin Sivák 2017-01-19 16:08:15 UTC

Well I am closing this again until you convince me we have a bug. All the information attached to this bug so far show correct and expected behaviour.

Note You need to log in before you can comment on or make changes to this bug.