Description of problem: When schedule instances on the same compute which use "hw:cpu_policy": "dedicated" and non cpu policy can result in an instance to be scheduled on a numa node with note enough memory available. As a result OOM will jump in at some point. It is know that mixing instances using flavor with "hw:cpu_policy": "dedicated" and not dedicated policy is not recommended as those might use the same pCPU [1]: ~~~ The scheduler will have to be enhanced so that it considers the usage of CPUs by existing guests. Use of a dedicated CPU policy will have to be accompanied by the setup of aggregates to split the hosts into two groups, one allowing overcommit of shared pCPUs and the other only allowing dedicated CPU guests. ie we do not want a situation with dedicated CPU and shared CPU guests on the same host. It is likely that the administrator will already need to setup host aggregates for the purpose of using huge pages for guest RAM. The same grouping will be usable for both dedicated RAM (via huge pages) and dedicated CPUs (via pinning). ~~~ But it seems as a side effect of the above when a instance is being spawned using a non dedicated cpu policy, the next instance which use the dedicated policy will/can use numa node 0 even if there is not enough memory available. Reproducer steps bellow should make the situation understandable. Version-Release number of selected component (if applicable): # rpm -qa |grep nova openstack-nova-compute-2014.2.3-48.el7ost.noarch openstack-nova-cert-2014.2.3-48.el7ost.noarch openstack-nova-common-2014.2.3-48.el7ost.noarch python-novaclient-2.20.0-1.el7ost.noarch openstack-nova-console-2014.2.3-48.el7ost.noarch openstack-nova-scheduler-2014.2.3-48.el7ost.noarch openstack-nova-api-2014.2.3-48.el7ost.noarch openstack-nova-novncproxy-2014.2.3-48.el7ost.noarch python-nova-2014.2.3-48.el7ost.noarch openstack-nova-conductor-2014.2.3-48.el7ost.noarch How reproducible: Reproducer - hardware with 2 numa nodes, each with 8GB: # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 8 9 10 11 node 0 size: 8162 MB node 0 free: 6422 MB node 1 cpus: 4 5 6 7 12 13 14 15 node 1 size: 8192 MB node 1 free: 6008 MB node distances: node 0 1 0: 10 20 1: 20 10 /etc/nova.conf: ~~~ ram_allocation_ratio=1.0 scheduler_available_filters=nova.scheduler.filters.all_filters scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,CoreFilter,DifferentHostFilter,AggregateInstanceExtraSpecsFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,NUMATopologyFilter ~~~ # nova flavor-show dedicated1 +----------------------------+-------------------------------------------------------------------+ | Property | Value | +----------------------------+-------------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 3 | | extra_specs | {"hw:cpu_policy": "dedicated", "hw:cpu_threads_policy": "prefer"} | | id | 51 | | name | dedicated1 | | os-flavor-access:is_public | True | | ram | 5000 | | rxtx_factor | 1.0 | | swap | | | vcpus | 1 | +----------------------------+-------------------------------------------------------------------+ For testing we make sure that the compute can not swap # swapoff -a # nova net-list +--------------------------------------+---------+------+ | ID | Label | CIDR | +--------------------------------------+---------+------+ | 2159466c-23cc-45d7-aaea-90fac2c0d3fc | private | None | | 5f2dc926-4d44-4b90-b326-e647abcb2961 | public | None | +--------------------------------------+---------+------+ # nova image-list +--------------------------------------+--------+--------+--------+ | ID | Name | Status | Server | +--------------------------------------+--------+--------+--------+ | 47331f47-19f0-45d0-8447-94d9edaeea8c | cirros | ACTIVE | | | 34d34283-037c-44e7-957a-05245dab9896 | fedora | ACTIVE | | +--------------------------------------+--------+--------+--------+ 1) restart nova # openstack-service restart nova 2016-01-08 02:57:48.501 24122 DEBUG nova.openstack.common.service [-] ram_allocation_ratio = 1.0 log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996 2016-01-08 02:57:48.501 24122 DEBUG nova.openstack.common.service [-] ram_weight_multiplier = 1.0 log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996 2016-01-08 02:57:48.508 24122 DEBUG nova.openstack.common.service [-] scheduler_available_filters = ['nova.scheduler.filters.all_filters'] log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996 2016-01-08 02:57:48.508 24122 DEBUG nova.openstack.common.service [-] scheduler_default_filters = ['RetryFilter', 'AvailabilityZoneFilter', 'RamFilter', 'ComputeFilter', 'ComputeCapabilitiesFilter', 'ImagePropertiesFilter', 'CoreFilter', 'DifferentHostFilter', 'AggregateInstanceExtraSpecsFilter', 'ServerGroupAntiAffinityFilter', 'ServerGroupAffinityFilter', 'PciPassthroughFilter', 'NUMATopologyFilter'] log_opt_values /usr/lib/python2.7/site-packages/oslo/config/cfg.py:1996 2) dedicated # nova boot --flavor dedicated1 --nic net-id=2159466c-23cc-45d7-aaea-90fac2c0d3fc --image 34d34283-037c-44e7-957a-05245dab9896 --security-groups default --key-name root dedi1 3) small: # nova boot --flavor m1.small --nic net-id=2159466c-23cc-45d7-aaea-90fac2c0d3fc --image 34d34283-037c-44e7-957a-05245dab9896 --security-groups default --key-name root small1 4) dediacated # nova boot --flavor dedicated1 --nic net-id=2159466c-23cc-45d7-aaea-90fac2c0d3fc --image 34d34283-037c-44e7-957a-05245dab9896 --security-groups default --key-name root dedi2 cd /etc/libvirt/qemu Both "big" instances (sum 10GB RAM) are on numa node 0 : # for i in instance-000000* ; do virsh dumpxml ${i%%.*} | grep node; done <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> # nova list +--------------------------------------+--------+--------+------------+-------------+-------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+-------------------+ | 138983e1-315c-4589-bdce-630407ad3232 | dedi1 | ACTIVE | - | Running | private=10.0.0.20 | | 76479219-ceac-4859-ab8b-4af53112ae4c | dedi2 | ACTIVE | - | Running | private=10.0.0.22 | | 17c44b69-39f0-442e-9064-251fb9bfc318 | small1 | ACTIVE | - | Running | private=10.0.0.19 | +--------------------------------------+--------+--------+------------+-------------+-------------------+ # ip netns qdhcp-2159466c-23cc-45d7-aaea-90fac2c0d3fc qrouter-7e4ff38a-6fb2-48c4-8c9a-34e675f6fa01 # ip netns exec qdhcp-2159466c-23cc-45d7-aaea-90fac2c0d3fc bash # scp stress-1.0.2-1.el7.rf.x86_64.rpm fedora.0.20: # scp stress-1.0.2-1.el7.rf.x86_64.rpm fedora.0.22: login to both and run stress ... one get's killed: Jan 7 11:42:13 cisco-b420m3-01 kernel: Out of memory: Kill process 22865 (qemu-kvm) score 274 or sacrifice child Jan 7 11:42:13 cisco-b420m3-01 kernel: Killed process 22865 (qemu-kvm) total-vm:9967432kB, anon-rss:2864168kB, file-rss:8792kB not reproduced with * restart nova * dedi1 * dedi2 * dedi3 # nova list +--------------------------------------+-------+--------+------------+-------------+-------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------+--------+------------+-------------+-------------------+ | f4ec1874-38ba-469a-995e-a4dd9b7b473c | dedi1 | ACTIVE | - | Running | private=10.0.0.11 | | 0559c2d3-5bf8-4c46-8b79-52b600cdd9d4 | dedi2 | ACTIVE | - | Running | private=10.0.0.12 | | 14fe7247-8c2e-4d8f-a1bd-89ccee3450ab | dedi3 | ERROR | - | NOSTATE | | +--------------------------------------+-------+--------+------------+-------------+-------------------+ 2016-01-08 03:22:25.884 31001 INFO nova.filters [req-c42621ab-72cc-4ab4-b6b1-394ba1f8c96f None] Filter NUMATopologyFilter returned 0 hosts === Another test scenario: * Flavor with 3GB RAM # nova flavor-show dedicated +----------------------------+--------------------------------+ | Property | Value | +----------------------------+--------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 3 | | extra_specs | {"hw:cpu_policy": "dedicated"} | | id | 53 | | name | dedicated | | os-flavor-access:is_public | True | | ram | 3000 | | rxtx_factor | 1.0 | | swap | | | vcpus | 1 | +----------------------------+--------------------------------+ # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 8 9 10 11 node 0 size: 8162 MB node 0 free: 6422 MB node 1 cpus: 4 5 6 7 12 13 14 15 node 1 size: 8192 MB node 1 free: 6008 MB node distances: node 0 1 0: 10 20 1: 20 10 Sequence is: * 3 instances - flavor dedicated * 1 instance - flavor small * 1 instance - flavor dedicated Result: * instance 1/2 numa node 0 * instance 3 numa node 1 (expected) * instance 4, using default small flavor, can be scheduled wherever it is possible * instance 5 gets scheduled on numa node 0 => expected to either be scheduled on node1 or fail if there not enough available memory on the compute # for i in instance-000000* ; do virsh dumpxml ${i%%.*} | grep node; done <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> <memory mode='strict' nodeset='1'/> <memnode cellid='0' mode='strict' nodeset='1'/> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> # numastat -c qemu-kvm Per-node process memory usage (in MBs) PID Node 0 Node 1 Total --------------- ------ ------ ----- 2849 (qemu-kvm) 294 8 302 3025 (qemu-kvm) 294 8 303 3654 (qemu-kvm) 1 300 301 4347 (qemu-kvm) 82 210 292 4720 (qemu-kvm) 294 8 302 --------------- ------ ------ ----- Total 964 536 1500 When schedule 4 instances with dedicated in a row, we see the expected behavior: # for i in instance-000000* ; do virsh dumpxml ${i%%.*} | grep node; done <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> <memory mode='strict' nodeset='1'/> <memnode cellid='0' mode='strict' nodeset='1'/> <memory mode='strict' nodeset='1'/> <memnode cellid='0' mode='strict' nodeset='1'/> [1] https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-cpu-pinning.html
Operators are expected to use host aggregates to isolate VMs using pinning (and basically NUMA related functionality) with the rest of the world.