Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1741214 - Nova scheduler fails spawning if a NUMA is close to exhaustion
Summary: Nova scheduler fails spawning if a NUMA is close to exhaustion
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 12.0 (Pike)
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Artom Lifshitz
QA Contact: nova-maint
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-14 13:57 UTC by Ignacio
Modified: 2019-08-15 21:17 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-15 21:17:17 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1741217 0 unspecified CLOSED Scheduler fails spawning if total instances memory does not fit in one NUMA 2021-02-22 00:41:40 UTC

Internal Links: 1741217

Description Ignacio 2019-08-14 13:57:52 UTC
Description of problem:
Conditions:
Enabled CPU Pinning and huge pages on the computes.
4 NUMAnodes of 12 physical threads with 30 huge pages of 1GB.
PCI affinity not used.
Added pinned=true metadata to the host aggregate

If a numanode is close to memory exhaustion but with free vcpus, if I deploy an instance that would exhaust the memory on the node, the scheduler will allow it, try to allocate the memory without success and then retry on another host.
Unfortunately, the first attempt will have exhausted the memory of the node and triggered OOM killer that will randomly kill a qemu process or sometimes nova-compute itself.

Version-Release number of selected component (if applicable):
openstack-nova-scheduler-16.1.4-6.el7ost.noarch

How reproducible:
Reproducible by cu.

Steps to Reproduce:
Create an aggregate of 2 computes.
STEP1 :
Spawning 10 instances 2VCPU 4GB on 2 computes. 
Scheduling is as expected, well balanced and the system tends to try to fill NUMA node 0 first.

STEP2: Spawning one instance 4VCPU  12GB on the same aggregate of 2 computes. The scheduler tries node 1, it fits on NUMA node 0 for vcpu requirements, memory allocation fails, oom killer is triggered. It tries the second host, same memory allocation issue on its NUMA node 0, the instance is finally deployed on host 2 NUMA node 1, oom killer is also triggered on this host killing nova-compute and qemu processes on NUMA NODE 0.

Actual results:
oom killer is also triggered on this host killing nova-compute and qemu processes on NUMA NODE 0.

Expected results:
scheduler should deploy instance with memory mode set to prefer in order to be able to allocate memory from any numa node and avoid triggering oom killer.

Additional info:

Comment 4 Artom Lifshitz 2019-08-15 21:17:17 UTC
> STEP2: Spawning one instance 4VCPU  12GB on the same aggregate of 2
> computes. The scheduler tries node 1, it fits on NUMA node 0 for vcpu
> requirements, memory allocation fails, oom killer is triggered. It tries the
> second host, same memory allocation issue on its NUMA node 0, the instance
> is finally deployed on host 2 NUMA node 1, oom killer is also triggered on
> this host killing nova-compute and qemu processes on NUMA NODE 0.

This can happen because Nova is only able to do memory accounting on a per-node basis if a pagesize is set. While we can know the per-node memory usage of a VM that has the hw:numa_nodes extra spec set and thus has its memory pinned a certain NUMA node(s), we're unable to do the same for "floating" VMs, and thus we don't have enough information to decide whether a particular NUMA node has enough memory for a VM.

The workaround is to set hw:mem_page_size in the extra specs. Even for 4K pages, we track memory usage on a per-node basis.

> scheduler should deploy instance with memory mode set to prefer in order to be able to allocate memory from any numa node and avoid triggering oom killer.

While the hw:numa_mempolicy flavor extra spec is indeed mentioned in the OSP documentation, this is a mistake as it was never implemented for libvirt upstream (we've filed bz 1741702 to fix our documentation). In addition, hw:numa_mempolicy is unlikely to be implemented upstream, as such, I'm closing this bug as WONTFIX.

Lemme know if you need further information about the workaround.


Note You need to log in before you can comment on or make changes to this bug.