1664702 – [OSP10] Oversubscription broken for instances with NUMA topologies

Bug 1664702 - [OSP10] Oversubscription broken for instances with NUMA topologies

Summary: [OSP10] Oversubscription broken for instances with NUMA topologies

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	10.0 (Newton)
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	10.0 (Newton)
Assignee:	Stephen Finucane
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:	1519540 1664698 1664701
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-09 13:38 UTC by Stephen Finucane
Modified:	2023-03-21 19:10 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-nova-14.1.0-43.el7ost
Doc Type:	Known Issue
Doc Text:	Previously, due to an update that made memory allocation pagesize aware, you cannot oversubscribe memory for instances with NUMA topologies. With this update, memory oversubscription is disabled for all NUMA topology instances, including implicit NUMA topologies, such as hugepages or CPU pinning.
Clone Of:	1664701
Environment:
Last Closed:	2019-04-30 16:59:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1810977	0	None	None	None	2019-01-09 13:38:18 UTC
Red Hat Product Errata	RHBA-2019:0923	0	None	None	None	2019-04-30 16:59:23 UTC

Description Stephen Finucane 2019-01-09 13:38:19 UTC

+++ This bug was initially created as a clone of Bug #1664701 +++

Description of problem:

As described in [1], the fix to [2] appears to have inadvertently broken oversubscription of memory for instances with a NUMA topology but no hugepages.

Version-Release number of selected component (if applicable):

N/A

How reproducible:

Always.

Steps to Reproduce:

1. Create a flavor that will consume > 50% available memory for your host(s) and specify an explicit NUMA topology. For example, on my all-in-one deployment where the host has 32GB RAM, we will request a 20GB instance:

   $ openstack flavor create --vcpu 2 --disk 0 --ram 20480 test.numa
   $ openstack flavor set test.numa --property hw:numa_nodes=2

2. Boot an instance using this flavor:

   $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test

3. Boot another instance using this flavor:

   $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test2

Actual results:

The second instance fails to boot. We see the following error message in the logs.

  nova-scheduler[18295]: DEBUG nova.virt.hardware [None req-f7a6594b-8d25-424c-9c6e-8522f66ffd22 demo admin] No specific pagesize requested for instance, selected pagesize: 4 {{(pid=18318) _numa_fit_instance_cell /opt/stack/nova/nova/virt/hardware.py:1045}}
  nova-scheduler[18295]: DEBUG nova.virt.hardware [None req-f7a6594b-8d25-424c-9c6e-8522f66ffd22 demo admin] Not enough available memory to schedule instance with pagesize 4. Required: 10240, available: 5676, total: 15916. {{(pid=18318) _numa_fit_instance_cell /opt/stack/nova/nova/virt/hardware.py:1055}}

If we revert the patch that addressed the bug [3] then we revert to the correct behaviour and the instance boots. With this though, we obviously lose whatever benefits that change gave us.

Expected results:

The second instance should boot.

Additional info:

[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001459.html
[2] https://bugs.launchpad.net/nova/+bug/1734204
[3] https://review.openstack.org/#/c/532168

Comment 6 Joe H. Rahme 2019-04-09 12:55:49 UTC

Verification steps:

# 2 compute nodes with ~6GB of memory each
 
        [stack@undercloud-0 ~]$ for i in 6 8; do ssh heat-admin.24.$i 'echo $(hostname) $(grep MemTotal /proc/meminfo)'; done
        compute-1 MemTotal: 5944884 kB
        compute-0 MemTotal: 5944892 kB
 
# Create a large flavor with numa_nodes
 
        [stack@undercloud-0 ~]$ openstack flavor create --vcpu 2 --disk 0 --ram 4096 test.numa
        [stack@undercloud-0 ~]$ openstack flavor set test.numa --property hw:numa_nodes=1
 
 
# boot 2 instances with this flavor. Works because each instance goes on a separate compute
 
        [stack@undercloud-0 ~]$ nova boot --poll --image cirros --flavor test.numa test1 --nic net-id=353d787b-7788-40b0-aaff-a0ab2325b64e
        [stack@undercloud-0 ~]$ nova boot --poll --image cirros --flavor test.numa test2 --nic net-id=353d787b-7788-40b0-aaff-a0ab2325b64e
 
 
# Negative test, booting a third instance will fail with the 'No valid host error'
 
        [stack@undercloud-0 ~]$ nova boot --poll --image cirros --flavor test.numa test3 --nic net-id=353d787b-7788-40b0-aaff-a0ab2325b64e
       
# Modify `ram_allocation_ratio` in nova.conf on the compute node

	[heat-admin@compute-1 ~]$ sudo grep ram_allocation_ratio /etc/nova/nova.conf
	ram_allocation_ratio=2.0

# Boot a 4th instance, it boots successfully

	[stack@undercloud-0 ~]$ nova boot --poll --image cirros --flavor test.numa test4 --nic net-id=353d787b-7788-40b0-aaff-a0ab2325b64e
	[stack@undercloud-0 ~]$ nova list
	+--------------------------------------+-------+--------+------------+-------------+------------------------+
	| ID                                   | Name  | Status | Task State | Power State | Networks               |
	+--------------------------------------+-------+--------+------------+-------------+------------------------+
	| 4baccd63-0a8e-4288-97a0-b2b449d45a39 | test1 | ACTIVE | -          | Running     | private=192.168.100.9  |
	| ff0a5dd2-a1b8-4937-a3e9-c8a45f5253dd | test2 | ACTIVE | -          | Running     | private=192.168.100.6  |
	| 5bb3597c-a193-479a-9292-6d652b799a66 | test3 | ERROR  | -          | NOSTATE     |                        |
	| 81ce205a-1a15-48f6-8055-3c1a39334602 | test4 | ACTIVE | -          | Running     | private=192.168.100.16 |
	+--------------------------------------+-------+--------+------------+-------------+------------------------+


# Package version:
 openstack-nova-common.noarch     1:14.1.0-44.el7ost    @rhos-10.0-signed

Comment 8 errata-xmlrpc 2019-04-30 16:59:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0923

Note You need to log in before you can comment on or make changes to this bug.