Description of problem: As described in [1], the fix to [2] appears to have inadvertently broken oversubscription of memory for instances with a NUMA topology but no hugepages. Version-Release number of selected component (if applicable): N/A How reproducible: Always. Steps to Reproduce: 1. Create a flavor that will consume > 50% available memory for your host(s) and specify an explicit NUMA topology. For example, on my all-in-one deployment where the host has 32GB RAM, we will request a 20GB instance: $ openstack flavor create --vcpu 2 --disk 0 --ram 20480 test.numa $ openstack flavor set test.numa --property hw:numa_nodes=2 2. Boot an instance using this flavor: $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test 3. Boot another instance using this flavor: $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test2 Actual results: The second instance fails to boot. We see the following error message in the logs. nova-scheduler[18295]: DEBUG nova.virt.hardware [None req-f7a6594b-8d25-424c-9c6e-8522f66ffd22 demo admin] No specific pagesize requested for instance, selected pagesize: 4 {{(pid=18318) _numa_fit_instance_cell /opt/stack/nova/nova/virt/hardware.py:1045}} nova-scheduler[18295]: DEBUG nova.virt.hardware [None req-f7a6594b-8d25-424c-9c6e-8522f66ffd22 demo admin] Not enough available memory to schedule instance with pagesize 4. Required: 10240, available: 5676, total: 15916. {{(pid=18318) _numa_fit_instance_cell /opt/stack/nova/nova/virt/hardware.py:1055}} If we revert the patch that addressed the bug [3] then we revert to the correct behaviour and the instance boots. With this though, we obviously lose whatever benefits that change gave us. Expected results: The second instance should boot. Additional info: [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001459.html [2] https://bugs.launchpad.net/nova/+bug/1734204 [3] https://review.openstack.org/#/c/532168
Test case A: This is a positive test case. 1. Configure the host to use a RAM overcommit ratio of 16.0, adding the following to nova.conf: ram_allocation_ratio = 16.0 1. Create a flavor that will consume > 50% available memory for your host(s) and specify an explicit NUMA topology. For example, on my all-in-one deployment where the host has 32GB RAM, we will request a 20GB instance: $ openstack flavor create --vcpu 2 --disk 0 --ram 20480 test.numa $ openstack flavor set test.numa --property hw:numa_nodes=2 2. Boot an instance using this flavor: $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test 3. Boot another instance using this flavor: $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test2 Both instances should boot. Test case B: This is a negative test case. 1. Configure the host to use a RAM overcommit ratio of 1.0, adding the following to nova.conf: ram_allocation_ratio = 1.0 1. Create a flavor that will consume > 50% available memory for your host(s) and specify an explicit NUMA topology. For example, on my all-in-one deployment where the host has 32GB RAM, we will request a 20GB instance: $ openstack flavor create --vcpu 2 --disk 0 --ram 20480 test.numa $ openstack flavor set test.numa --property hw:numa_nodes=2 2. Boot an instance using this flavor: $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test 3. Boot another instance using this flavor: $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test2 The first instance should boot. The second instance should fail to boot.
As of OSP13, the ram alloc ratio is defined in the placement API * nova version (overcloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 14 -p 2019-04-05.1 * Compute nodes (overcloud) [stack@undercloud-0 ~]$ openstack resource provider inventory show 8cc922c3-3f8d-4a2f-ac64-c5090423dcf1 MEMORY_MB +------------------+-------+ | Field | Value | +------------------+-------+ | allocation_ratio | 3.0 | | max_unit | 6143 | | reserved | 4096 | | step_size | 1 | | min_unit | 1 | | total | 6143 | +------------------+-------+ (overcloud) [stack@undercloud-0 ~]$ openstack resource provider inventory show f16eeb0d-5833-4471-b615-fb61ed3fa49c MEMORY_MB +------------------+-------+ | Field | Value | +------------------+-------+ | allocation_ratio | 1.5 | | max_unit | 6143 | | reserved | 4096 | | step_size | 1 | | min_unit | 1 | | total | 6143 | +------------------+-------+ * Instance flavor (overcloud) [stack@undercloud-0 ~]$ openstack flavor show half_node_flavor +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 0 | | id | 8394a344-24eb-4d72-bd83-3c34ef57e2d5 | | name | half_node_flavor | | os-flavor-access:is_public | True | | properties | hw:numa_nodes='1' | | ram | 3000 | | rxtx_factor | 1.0 | | swap | | | vcpus | 2 | +----------------------------+--------------------------------------+ * Boot 3 instances (validates that oversubscription works): (overcloud) [stack@undercloud-0 ~]$ for i in {1..3}; do openstack server create --flavor half_node_flavor --image cirros-0.3.5-x86_64-disk.img --nic net-id=cada4c5e-3987-4ac8-9e82-9bff0a9fb291 $i --wait; done +-------------------------------------+---------------------------------------------------------------------+ | Field | Value | +-------------------------------------+---------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-1.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-00000026 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-04-29T15:18:12.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | private=192.168.200.15 | | adminPass | RRez8juH22Zh | | config_drive | | | created | 2019-04-29T15:17:57Z | | flavor | half_node_flavor (8394a344-24eb-4d72-bd83-3c34ef57e2d5) | | hostId | 1ff3d27b587c01eff3a0d6a53e67207207ded52654e8597517c5ef26 | | id | 0973bd20-3acd-4331-a265-68d0016271e9 | | image | cirros-0.3.5-x86_64-disk.img (f272f221-fa36-4e18-800d-1bbcb9eb1bb2) | | key_name | None | | name | 1 | | progress | 0 | | project_id | be47c8cfb80446fd85e58f9ef060db0f | | properties | | | security_groups | name='default' | | status | ACTIVE | | updated | 2019-04-29T15:18:12Z | | user_id | 17be1bcfab9c438f99201442cd73f6e2 | | volumes_attached | | +-------------------------------------+---------------------------------------------------------------------+ +-------------------------------------+---------------------------------------------------------------------+ | Field | Value | +-------------------------------------+---------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-0.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-0.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-00000029 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-04-29T15:18:45.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | private=192.168.200.26 | | adminPass | QgEzm29JBN3c | | config_drive | | | created | 2019-04-29T15:18:30Z | | flavor | half_node_flavor (8394a344-24eb-4d72-bd83-3c34ef57e2d5) | | hostId | 07663e9fd4675aadfe5d7732494e7db23f8d4700383dd6697e8b05dd | | id | de6581af-5dc5-408d-ae8a-b681741f4b5c | | image | cirros-0.3.5-x86_64-disk.img (f272f221-fa36-4e18-800d-1bbcb9eb1bb2) | | key_name | None | | name | 2 | | progress | 0 | | project_id | be47c8cfb80446fd85e58f9ef060db0f | | properties | | | security_groups | name='default' | | status | ACTIVE | | updated | 2019-04-29T15:18:45Z | | user_id | 17be1bcfab9c438f99201442cd73f6e2 | | volumes_attached | | +-------------------------------------+---------------------------------------------------------------------+ +-------------------------------------+---------------------------------------------------------------------+ | Field | Value | +-------------------------------------+---------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-1.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-0000002c | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-04-29T15:19:17.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | private=192.168.200.8 | | adminPass | ZjvsHSY8BcXA | | config_drive | | | created | 2019-04-29T15:19:06Z | | flavor | half_node_flavor (8394a344-24eb-4d72-bd83-3c34ef57e2d5) | | hostId | 1ff3d27b587c01eff3a0d6a53e67207207ded52654e8597517c5ef26 | | id | bb5ec2b9-eaa5-4b9b-beea-45b7d5ff0d66 | | image | cirros-0.3.5-x86_64-disk.img (f272f221-fa36-4e18-800d-1bbcb9eb1bb2) | | key_name | None | | name | 3 | | progress | 0 | | project_id | be47c8cfb80446fd85e58f9ef060db0f | | properties | | | security_groups | name='default' | | status | ACTIVE | | updated | 2019-04-29T15:19:17Z | | user_id | 17be1bcfab9c438f99201442cd73f6e2 | | volumes_attached | | +-------------------------------------+---------------------------------------------------------------------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0941