1370282 – [RFE] Enhance Memory Usage estimates for VMs on RHV to account for free memory per hypervisor with an approximation of scheduling algorithm

Bug 1370282 - [RFE] Enhance Memory Usage estimates for VMs on RHV to account for free memory per hypervisor with an approximation of scheduling algorithm

Summary: [RFE] Enhance Memory Usage estimates for VMs on RHV to account for free memo...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat Quickstart Cloud Installer
Classification:	Red Hat
Component:	Installation - CloudForms
Sub Component:
Version:	1.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	1.2
Assignee:	John Matthews
QA Contact:	Sudhir Mallamprabhakara
Docs Contact:	Dan Macpherson
URL:
Whiteboard:
Depends On:
Blocks:	1403864
TreeView+	depends on / blocked

Reported:	2016-08-25 19:45 UTC by James Olin Oden
Modified:	2022-07-30 04:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description James Olin Oden 2016-08-25 19:45:44 UTC

Description of problem:
I was doing a deployment of:

   RHV(self hosted, 4 hosts) + OSE(3 nodes) + CFME

It got through the RHV and OSE deployment but failed in the CFME deployment with the following error:

CFME Launch failed with error ["Failed to power up a compute thistimeitwillwork-RHEV (RHEV) instance thistimeitwillwork-cfme.b.b: Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host hosted_engine_1 did not satisfy internal filter Memory because its available memory is too low (3083.000000 MB) to run the VM., The host hosted_engine_1 did not satisfy internal filter Memory because its available memory is too low (3083.000000 MB) to run the VM., The host hosted_engine_1 did not satisfy internal filter Memory because its available memory is too low (3083.000000 MB) to run the VM., The host hosted_engine_1 did not satisfy internal filter Memory because its available memory is too low (3083.000000 MB) to run the VM."]

I actually could login into the CFME and found that the guest for the CFME 
did not exist according to itself and RHV (however it was running), and their
was a host with 3083 MB of memory left, and the other three hosts had 7244 MB left.   All the hosts had 16GB of memory each.

Version-Release number of selected component (if applicable):
QCI-1.0-RHEL-7-20160824.t.1

How reproducible:
I think always

Steps to Reproduce:
1.  Do a RHV + OSE + CFME deployment where there is not quite enough memory 
    left on the first RHV host.

Actual results:
It fails with the error above.

Expected results:
An error when configuring CFME to say there was not enough space to deploy the CFME.

Comment 1 Todd Sanders 2016-08-29 17:42:18 UTC

Possible RHEV Issue.  We need to investigate this further and attempt to recreate.  Our simplified aggregate checks seem to correctly determined that enough space *was* available across the 4 hypervisors.  Moving to v1.1

Comment 7 James Olin Oden 2016-10-13 13:03:43 UTC

Right now self hosted does not work and the original configuration I tested this with involved self hosted (and indeed was where the conflict occurred because the engine and the CFME box ended up on the same system).   So I am at present waiting for a compose of QCI 1.1 where self hosted works to test this.

Comment 8 James Olin Oden 2016-12-12 19:31:17 UTC

Unfortunately now I can't validate this because OCP deployments are failing due to this bug:

   https://bugzilla.redhat.com/show_bug.cgi?id=1403864

This is in compose:

   QCI-1.1-RHEL-7-20161209.t.0

Comment 9 James Olin Oden 2016-12-20 14:00:56 UTC

Verified in QCI-1.1-RHEL-7-20161215.t.0

Comment 10 Tasos Papaioannou 2016-12-20 19:16:17 UTC

I don't think this should be marked as verified. On QCI-1.1-RHEL-7-20161215.t.0, a deployment of RHV self-hosted + OCP + CFME with 4 hypervisors (16 GB RAM each) and 4 OCP nodes (1 master + 3 workers), the CFME task fails because there is not enough RAM available:

----
D, [2016-12-20T13:24:38.048941 #19406] DEBUG -- : ====== CFME Launch run method ======

I, [2016-12-20T13:25:46.069406 #19406] INFO -- : ["Failed to power up a compute tpapaioa_3-RHEV (RHEV) instance tpapaioa-3-rhv-cfme.cfme.lab.eng.rdu2.redhat.com: Cannot run VM. There is no host that satisfies current scheduling constraints. See below for details:, The host hosted_engine_2 did not satisfy internal filter Memory because its available memory is too low (7243.000000 MB) to run the VM., The host hosted_engine_2 did not satisfy internal filter Memory because its available memory is too low (7243.000000 MB) to run the VM., The host hosted_engine_2 did not satisfy internal filter Memory because its available memory is too low (7243.000000 MB) to run the VM., The host hosted_engine_2 did not satisfy internal filter Memory because its available memory is too low (7243.000000 MB) to run the VM."]
----

There is no warning during the creation of the deployment that there is insufficient memory for all 6 VM's (1 engine + 4 OCP nodes + 1 CFME). On the OpenShift > Master/Nodes tab:

Resources needed:         Resources available:
vCPU  5                   16
RAM   32 GB               58.04 GB
Disk  135 GB              657.54 GB

Hovering over the tooltip next to "Resources available" shows:

"0 vCPUs, 0GB RAM, 0GB Disk reserved for CloudForms", even though CFME has been selected, and there is no mention of the resources required for the self-hosted engine VM, nor does there appear to be some accounting for the fact that even though a total of ~58 GB RAM is available, each individual host has at most 16 GB.

Comment 11 jkim 2017-01-03 19:33:50 UTC

will revisit by trying a RHV self-hosted deployment with 4 16GB RAM hypervisors as you've described above. Thank You.

Comment 12 John Matthews 2017-01-05 17:33:55 UTC

I am punting this from the 1.1 release, the underlying issue is more involved than we first realized.

Heart of the issue is that we are treating the memory requirements incorrectly. We assumed we could consider the total memory available as a single pool, similar to shared disk usage, and do simple arithmetic.

We can't, we need to account for how a VM needs to satisfy it's RAM requirements from free RAM on a single hypervisor, it can't split it's requirement onto a 2nd hypervisor.

The issue is that the memory actually consumed is segmented based on each hypervisor accounting for how VMs get scheduled to each VM.

For example:
If we have 4 hypervisors each with 16GB or RAM, this is a total of 64GB.
Each hypervisor has ~16GB - 2GB (assuming RAM hypervisor reserves for itself) = 14GB

When it comes to running 5 VMs say of 8GB each, we need 40GB.
We __thought__ we had sufficient memory, 4 hypervisors each 14GB = 56GB > 40GB

Issue is we need to account for where the VMs will run and how a VM only runs on a single hypervisor.

When we schedule this we have each hypervisor with a single 8GB VM.
So each hypevisor 16GB - 2GB (hyp usage) - 8GB (VM) = ~6GB free..
None of the hypervisors in this case have a full chunk of 8GB to allocate to a VM.

The "pool" reflects we have enough RAM free, but treating this memory in a pool is inaccurate since it doesn't model real world usage.

Also note.....to implement this correctly, we need to be aware of how RHV will schedule the VM to hypervisor, example...if hypervisors differ in capabilities, say free CPUs or other requirements we need to address this in our algorithm.

At the moment I'm aware of:
- Memory needs
- Number of CPUs

We need to address the above 2 requirements when estimating how the RHV scheduler will place VM to Hypervisor.

Note You need to log in before you can comment on or make changes to this bug.