1846844 – Compute memory is overprovisioned despite setting ram_allocation_ratio=1.0 in nova.conf

Bug 1846844 - Compute memory is overprovisioned despite setting ram_allocation_ratio=1.0 in nova.conf

Summary: Compute memory is overprovisioned despite setting ram_allocation_ratio=1.0 in...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sylvain Bauza
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-15 04:59 UTC by Rohini Diwakar
Modified:	2023-10-06 20:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-22 15:31:37 UTC
Target Upstream Version:
Embargoed:
Flags:	rdiwakar: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-23534	0	None	None	None	2023-03-21 19:35:18 UTC

Description Rohini Diwakar 2020-06-15 04:59:13 UTC

Description of problem:
One compute node has overprovisioned memory even though the ram_allocation_ratio is set to 1.0 in nova.conf of the controller node. New instances fail to spawn on this particular node. 

Total memory is 523778mb and used memory 552960mb in wb-sdc-compute09.wbsdc.in

Version-Release number of selected component (if applicable):
RHOSP13

Comment 6 Sylvain Bauza 2020-06-23 14:12:01 UTC

As Piotr explained, you need to set the allocation ratios per each compute service, and not in the controller.

The default value of RAM ratio is 1.5, not 1.0 so that's why you got more instance RAM resources than what the compute service was having.
If you set the value to 1.0 on compute09 and restart it, then Placement won't accept new instances for this compute until you migrate some of the instances to another compute.

Comment 8 Sylvain Bauza 2020-06-26 14:52:40 UTC

To be clear, since you haven't modified the ram ratios on the computes yet, they already supported over-provisioning hence the free RAM to be negative.

I don't know the flavor of the instance you want to create on compute09, but I suspect the asked RAM to be larger than the available RAM for over-provisioning and since you haven't set a swap memory for over-provisioning, that's why it's rejecting.


There are two possibilities for you then :
 - you allow over-provisioning by allocating swap memory of the size of the over-provisioned RAM
or,
 - you disallow over-provisioning by modifying the configuration option on *all computes* to be 1.0 and you restart the nova-compute services. It won't allow new instances to be created but then you can migrate your over-provisioned instances to compute services that have enough RAM to fit them.

HTH.

Comment 10 Sylvain Bauza 2020-06-29 09:15:52 UTC

Could you please ask the customer to give us the Placement details for the compute09 Resource Provider (RP)?

First, get the RP UUID for compute09 :
https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-provider-list

$ openstack resource provider list --name wb-sdc-compute09.wbsdc.in



Then, ask for the MEMORY_MB inventory for the RP UUID you got :
(you should see the reported allocation ratio)
https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-provider-inventory-show

$ openstack resource provider inventory show <uuid> MEMORY_MB


Eventually, ask for this RP usage (meaning looking at all the allocations) :
https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-provider-usage-show

$ openstack resource provider usage show <uuid>


Thanks,
-S

Comment 11 Rohini Diwakar 2020-07-01 02:14:42 UTC

Hi,

Please find the required output.

(overcloud) [stack@wb-sdc-undercloud01 ~]$ openstack resource provider list --name wb-sdc-compute09.wbsdc.in
+--------------------------------------+---------------------------+------------+
| uuid                                 | name                      | generation |
+--------------------------------------+---------------------------+------------+
| a85401d1-8ccb-4399-afdc-b43eb2ddc92f | wb-sdc-compute09.wbsdc.in |        170 |
+--------------------------------------+---------------------------+------------+
(overcloud) [stack@wb-sdc-undercloud01 ~]$
(overcloud) [stack@wb-sdc-undercloud01 ~]$ openstack resource provider inventory show a85401d1-8ccb-4399-afdc-b43eb2ddc92f MEMORY_MB
+------------------+--------+
| Field            | Value  |
+------------------+--------+
| allocation_ratio | 1.0    |
| max_unit         | 523778 |
| reserved         | 4096   |
| step_size        | 1      |
| min_unit         | 1      |
| total            | 523778 |
+------------------+--------+
(overcloud) [stack@wb-sdc-undercloud01 ~]$ openstack resource provider usage show a85401d1-8ccb-4399-afdc-b43eb2ddc92f
+----------------+--------+
| resource_class |  usage |
+----------------+--------+
| VCPU           |    168 |
| MEMORY_MB      | 507904 |
| DISK_GB        |      0 |
+----------------+--------+
(overcloud) [stack@wb-sdc-undercloud01 ~]$

This is the current usage report. Note that the cu has deleted 1 out of 2 instances that he had conducted to test over-provisioning(test output in my previous comment).

Comment 12 Stephen Finucane 2020-07-01 10:53:36 UTC

Thanks for providing the placement information. Based on that, there is no overallocation happening here, at least, not as far as nova is concerned. Instances are consuming 507904MB of RAM, and that is less than the total available of 523778MB.

What has happened here is that you haven't correctly accounted for host overhead when configuring your host. When using the libvirt driver, the os-hypervisors API does not report free memory based purely on the amount of memory consumed by instances. Instead, it parses '/etc/meminfo' to calculate total free memory [1] and subtracts that from the total available memory [2]. Nova simply considers all memory not used by an instance to be "free", however, it's clearly not only nova instances consuming memory on the host hence this mismatch.

The correct solution here is to increased the amount of memory reserved for the host. This can be done by configuring the '[compute] reserved_host_memory_mb' and '[compute] reserved_huge_pages' nova.conf values (or 'NovaReservedHostMemory' and 'NovaReservedHugePages' heat parameters). The former is currently configured to 4096 MB, as seen in the output of the 'openstack resource provider inventory show' command (the 'reserved' value). I'm not sure what is consuming the additional memory, but I would suspect hugepage allocations? This should be pretty easy to identify.

Also, as an aside, the os-hypervisors API is generally considered to be a poorly-designed API and should be avoided where possible. We will likely deprecate it in a future release.

[1] https://github.com/openstack/nova/blob/21.0.0/nova/virt/libvirt/host.py#L1089-L1099
[2] https://github.com/openstack/nova/blob/21.0.0/nova/virt/libvirt/host.py#L1117-L1118

Comment 16 Artom Lifshitz 2020-07-10 14:46:02 UTC

free_ram_mb can be negative if the compute host has been overprovisioned. The calculation is done as follows: (total host RAM) * (RAM allocation ratio) - (sum of all instances's flavors's RAM). To give a simple example, on a host with 10GB of memory and a RAM allocation ratio of 2.0, two 10GB instances can land. If the RAM allocation ratio is then decreased back down to 1.0, free_ram_mb will become 10GB * 1.0 - (10GB + 10GB) = -10GB. I believe this is basically what has happened in his BZ.

Because of that, I'd like to close this BZ as NOTABUG. If you feel that this is an incorrect assessment, and there's a problem here that the Nova engineering team should fix, by all means feel free to reopen this BZ.

Thanks!

Comment 17 Rohini Diwakar 2020-07-13 03:46:44 UTC

Hi,

Re-opening this bug as the cu has confirmed that the node wasn't overprovisioned and the ram_allocation_ratio has been 1.0 since the beginning. 

Cu statements: 
~~~
Also would like to know if anytime in past you have over-provisioned this node and then changed the value to 1.0 for ram_allocation?
Ans: NO

Were any modifications done to the configuration as per ram_allocation_ratio in the past? 
Ans: NO.
~~~

free_ram_mb is right now negative, is there any way to correct this?

Comment 18 Stephen Finucane 2020-07-13 13:13:58 UTC

(In reply to Rohini Diwakar from comment #17)
> Hi,
> 
> Re-opening this bug as the cu has confirmed that the node wasn't
> overprovisioned and the ram_allocation_ratio has been 1.0 since the
> beginning. 
> 
> Cu statements: 
> ~~~
> Also would like to know if anytime in past you have over-provisioned this
> node and then changed the value to 1.0 for ram_allocation?
> Ans: NO
> 
> Were any modifications done to the configuration as per ram_allocation_ratio
> in the past? 
> Ans: NO.
> ~~~
> 
> free_ram_mb is right now negative, is there any way to correct this?

Yes, as noted in comment 12, you need to increase '[compute] reserved_host_memory_mb' to account for compute overhead. There appears to be process overhead consuming memory that nova is not accounting for. You can determine this by inspecting memory usage using a utility such as 'top'.

Note You need to log in before you can comment on or make changes to this bug.