Bug 1296882

Summary: Mixing flavor using "hw:cpu_policy": "dedicated" and non cpu policy can result in instance get scheduled on a numa node even if there is not enough free memory available
Product: Red Hat OpenStack Reporter: Martin Schuppert <mschuppe>
Component: openstack-novaAssignee: Sahid Ferdjaoui <sferdjao>
Status: CLOSED NOTABUG QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0 (Juno)CC: berrange, dasmith, eglynn, kchamart, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---Keywords: ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-26 09:19:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Schuppert 2016-01-08 10:52:48 UTC
Description of problem:
When schedule instances on the same compute which use "hw:cpu_policy": "dedicated" and non cpu policy can result in an instance to be scheduled on a numa node with note enough memory available. As a result OOM will jump in at some point.

It is know that mixing instances using flavor with "hw:cpu_policy": "dedicated" and not dedicated policy is not recommended as those might use the same pCPU [1]:

~~~
The scheduler will have to be enhanced so that it considers the usage of CPUs by existing guests. Use of a dedicated CPU policy will have to be accompanied by the setup of aggregates to split the hosts into two groups, one allowing overcommit of shared pCPUs and the other only allowing dedicated CPU guests. ie we do not want a situation with dedicated CPU and shared CPU guests on the same host. It is likely that the administrator will already need to setup host aggregates for the purpose of using huge pages for guest RAM. The same grouping will be usable for both dedicated RAM (via huge pages) and dedicated CPUs (via pinning).
~~~

But it seems as a side effect of the above when a instance is being spawned using a non dedicated cpu policy, the next instance which use the dedicated policy will/can use numa node 0 even if there is not enough memory available. 

Reproducer steps bellow should make the situation understandable.

Version-Release number of selected component (if applicable):
# rpm -qa |grep nova
openstack-nova-compute-2014.2.3-48.el7ost.noarch
openstack-nova-cert-2014.2.3-48.el7ost.noarch
openstack-nova-common-2014.2.3-48.el7ost.noarch
python-novaclient-2.20.0-1.el7ost.noarch
openstack-nova-console-2014.2.3-48.el7ost.noarch
openstack-nova-scheduler-2014.2.3-48.el7ost.noarch
openstack-nova-api-2014.2.3-48.el7ost.noarch
openstack-nova-novncproxy-2014.2.3-48.el7ost.noarch
python-nova-2014.2.3-48.el7ost.noarch
openstack-nova-conductor-2014.2.3-48.el7ost.noarch

How reproducible:

Reproducer - hardware with 2 numa nodes, each with 8GB:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 8162 MB
node 0 free: 6422 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 8192 MB
node 1 free: 6008 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

/etc/nova.conf:

~~~
ram_allocation_ratio=1.0
scheduler_available_filters=nova.scheduler.filters.all_filters
scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,CoreFilter,DifferentHostFilter,AggregateInstanceExtraSpecsFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,NUMATopologyFilter
~~~

# nova flavor-show dedicated1
+----------------------------+-------------------------------------------------------------------+
| Property                   | Value                                                             |
+----------------------------+-------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                                             |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                                 |
| disk                       | 3                                                                 |
| extra_specs                | {"hw:cpu_policy": "dedicated", "hw:cpu_threads_policy": "prefer"} |
| id                         | 51                                                                |
| name                       | dedicated1                                                        |
| os-flavor-access:is_public | True                                                              |
| ram                        | 5000                                                              |
| rxtx_factor                | 1.0                                                               |
| swap                       |                                                                   |
| vcpus                      | 1                                                                 |
+----------------------------+-------------------------------------------------------------------+

For testing we make sure that the compute can not swap
# swapoff -a
# nova net-list
+--------------------------------------+---------+------+
| ID                                   | Label   | CIDR |
+--------------------------------------+---------+------+
| 2159466c-23cc-45d7-aaea-90fac2c0d3fc | private | None |
| 5f2dc926-4d44-4b90-b326-e647abcb2961 | public  | None |
+--------------------------------------+---------+------+

# nova image-list
+--------------------------------------+--------+--------+--------+
| ID                                   | Name   | Status | Server |
+--------------------------------------+--------+--------+--------+
| 47331f47-19f0-45d0-8447-94d9edaeea8c | cirros | ACTIVE |        |
| 34d34283-037c-44e7-957a-05245dab9896 | fedora | ACTIVE |        |
+--------------------------------------+--------+--------+--------+

1) restart nova 
# openstack-service restart nova

2016-01-08 02:57:48.501 24122 DEBUG nova.openstack.common.service [-] ram_allocation_ratio           = 1.0 log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996
2016-01-08 02:57:48.501 24122 DEBUG nova.openstack.common.service [-] ram_weight_multiplier          = 1.0 log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996
2016-01-08 02:57:48.508 24122 DEBUG nova.openstack.common.service [-] scheduler_available_filters    = ['nova.scheduler.filters.all_filters'] log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996
2016-01-08 02:57:48.508 24122 DEBUG nova.openstack.common.service [-] scheduler_default_filters      = ['RetryFilter', 'AvailabilityZoneFilter', 'RamFilter', 'ComputeFilter', 'ComputeCapabilitiesFilter', 'ImagePropertiesFilter', 'CoreFilter', 'DifferentHostFilter', 'AggregateInstanceExtraSpecsFilter', 'ServerGroupAntiAffinityFilter', 'ServerGroupAffinityFilter', 'PciPassthroughFilter', 'NUMATopologyFilter'] log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996

2) dedicated
# nova boot --flavor dedicated1 --nic net-id=2159466c-23cc-45d7-aaea-90fac2c0d3fc --image 34d34283-037c-44e7-957a-05245dab9896 --security-groups default --key-name root dedi1

3) small:
# nova boot --flavor m1.small --nic net-id=2159466c-23cc-45d7-aaea-90fac2c0d3fc --image 34d34283-037c-44e7-957a-05245dab9896 --security-groups default --key-name root small1

4) dediacated
# nova boot --flavor dedicated1 --nic net-id=2159466c-23cc-45d7-aaea-90fac2c0d3fc --image 34d34283-037c-44e7-957a-05245dab9896 --security-groups default --key-name root dedi2

cd /etc/libvirt/qemu

Both "big" instances (sum 10GB RAM) are on numa node 0 :
# for i in instance-000000* ; do virsh dumpxml ${i%%.*} | grep node; done
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>

# nova list
+--------------------------------------+--------+--------+------------+-------------+-------------------+
| ID                                   | Name   | Status | Task State | Power State | Networks          |
+--------------------------------------+--------+--------+------------+-------------+-------------------+
| 138983e1-315c-4589-bdce-630407ad3232 | dedi1  | ACTIVE | -          | Running     | private=10.0.0.20 |
| 76479219-ceac-4859-ab8b-4af53112ae4c | dedi2  | ACTIVE | -          | Running     | private=10.0.0.22 |
| 17c44b69-39f0-442e-9064-251fb9bfc318 | small1 | ACTIVE | -          | Running     | private=10.0.0.19 |
+--------------------------------------+--------+--------+------------+-------------+-------------------+

# ip netns
qdhcp-2159466c-23cc-45d7-aaea-90fac2c0d3fc
qrouter-7e4ff38a-6fb2-48c4-8c9a-34e675f6fa01

# ip netns exec qdhcp-2159466c-23cc-45d7-aaea-90fac2c0d3fc bash

# scp stress-1.0.2-1.el7.rf.x86_64.rpm fedora.0.20:
# scp stress-1.0.2-1.el7.rf.x86_64.rpm fedora.0.22:

login to both and run stress ... one get's killed:

Jan  7 11:42:13 cisco-b420m3-01 kernel: Out of memory: Kill process 22865 (qemu-kvm) score 274 or sacrifice child
Jan  7 11:42:13 cisco-b420m3-01 kernel: Killed process 22865 (qemu-kvm) total-vm:9967432kB, anon-rss:2864168kB, file-rss:8792kB

not reproduced with
* restart nova
* dedi1
* dedi2
* dedi3

# nova list
+--------------------------------------+-------+--------+------------+-------------+-------------------+
| ID                                   | Name  | Status | Task State | Power State | Networks          |
+--------------------------------------+-------+--------+------------+-------------+-------------------+
| f4ec1874-38ba-469a-995e-a4dd9b7b473c | dedi1 | ACTIVE | -          | Running     | private=10.0.0.11 |
| 0559c2d3-5bf8-4c46-8b79-52b600cdd9d4 | dedi2 | ACTIVE | -          | Running     | private=10.0.0.12 |
| 14fe7247-8c2e-4d8f-a1bd-89ccee3450ab | dedi3 | ERROR  | -          | NOSTATE     |                   |
+--------------------------------------+-------+--------+------------+-------------+-------------------+

2016-01-08 03:22:25.884 31001 INFO nova.filters [req-c42621ab-72cc-4ab4-b6b1-394ba1f8c96f None] Filter NUMATopologyFilter returned 0 hosts

===

Another test scenario:

* Flavor with 3GB RAM
# nova flavor-show dedicated
+----------------------------+--------------------------------+
| Property                   | Value                          |
+----------------------------+--------------------------------+
| OS-FLV-DISABLED:disabled   | False                          |
| OS-FLV-EXT-DATA:ephemeral  | 0                              |
| disk                       | 3                              |
| extra_specs                | {"hw:cpu_policy": "dedicated"} |
| id                         | 53                             |
| name                       | dedicated                      |
| os-flavor-access:is_public | True                           |
| ram                        | 3000                           |
| rxtx_factor                | 1.0                            |
| swap                       |                                |
| vcpus                      | 1                              |
+----------------------------+--------------------------------+

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 8162 MB
node 0 free: 6422 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 8192 MB
node 1 free: 6008 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

Sequence is:
* 3 instances - flavor dedicated
* 1 instance  - flavor small
* 1 instance  - flavor dedicated

Result:
* instance 1/2 numa node 0
* instance 3 numa node 1 (expected)
* instance 4, using default small flavor, can be scheduled wherever it is possible
* instance 5 gets scheduled on numa node 0 
  => expected to either be scheduled on node1 or fail if there not enough available memory on the compute

# for i in instance-000000* ; do virsh dumpxml ${i%%.*} | grep node; done
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memory mode='strict' nodeset='1'/>
    <memnode cellid='0' mode='strict' nodeset='1'/>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>

# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
2849 (qemu-kvm)     294      8   302
3025 (qemu-kvm)     294      8   303
3654 (qemu-kvm)       1    300   301
4347 (qemu-kvm)      82    210   292
4720 (qemu-kvm)     294      8   302
---------------  ------ ------ -----
Total               964    536  1500

When schedule 4 instances with dedicated in a row, we see the expected behavior:
# for i in instance-000000* ; do virsh dumpxml ${i%%.*} | grep node; done
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memory mode='strict' nodeset='1'/>
    <memnode cellid='0' mode='strict' nodeset='1'/>
    <memory mode='strict' nodeset='1'/>
    <memnode cellid='0' mode='strict' nodeset='1'/>

[1] https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-cpu-pinning.html

Comment 2 Sahid Ferdjaoui 2016-12-26 09:19:37 UTC
Operators are expected to use host aggregates to isolate VMs using pinning (and basically NUMA related functionality) with the rest of the world.