Bug 1517004
Summary: | Insufficient free host memory pages available to allocate guest RAM with Open vSwitch DPDK in Red Hat OpenStack Platform 10 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Andreas Karis <akaris> | ||||
Component: | openstack-nova | Assignee: | Sahid Ferdjaoui <sferdjao> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Joe H. Rahme <jhakimra> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 10.0 (Newton) | CC: | aguetta, akaris, awaugama, berrange, cfields, dasmith, eglynn, gkadam, joea, kchamart, lyarwood, mmethot, nchandek, sbauza, sferdjao, sgordon, srevivo, stephenfin, vromanso | ||||
Target Milestone: | --- | Keywords: | Triaged | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-01-10 16:42:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Andreas Karis
2017-11-23 21:04:17 UTC
Root Cause Nova by default will first fill up NUMA node 0 if there are still free pCPUs. This issue happens when the requested pCPUs still fir into NUMA 0, but the hugepages on NUMA 0 aren't sufficient for the instance memory to fit. Unfortunately, at time of this writing, one cannot tell nova to spawn an instance on a specific NUMA node. Diagnostic Steps On a hypervisor with 2MB hugepages and 512 free hugepages per NUMA node: Raw [root@overcloud-compute-1 ~]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0 And with the following NUMA architecture: Raw [root@overcloud-compute-1 nova]# lscpu | grep -i NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 Spawn 3 instances with the following flavor (1 vCPU and 512 MB or memory): Raw [stack@undercloud-4 ~]$ nova flavor-show m1.tiny +----------------------------+-------------------------------------------------------------+ | Property | Value | +----------------------------+-------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 8 | | extra_specs | {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "large"} | | id | 49debbdb-c12e-4435-97ef-f575990b352f | | name | m1.tiny | | os-flavor-access:is_public | True | | ram | 512 | | rxtx_factor | 1.0 | | swap | | | vcpus | 1 | +----------------------------+-------------------------------------------------------------+ The new instance will boot and will use memory from NUMA 1: Raw [stack@undercloud-4 ~]$ nova list | grep d98772d1-119e-48fa-b1d9-8a68411cba0b | d98772d1-119e-48fa-b1d9-8a68411cba0b | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe8d:a6ef, 10.0.0.102 | Raw [root@overcloud-compute-1 nova]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 256 Node 1 HugePages_Surp: 0 Raw nova boot --nic net-id=$NETID --image cirros --flavor m1.tiny --key-name id_rsa cirros-test0 The 3rd instance fails to boot: Raw [stack@undercloud-4 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc | cirros-test0 | ERROR | - | NOSTATE | | | a44c43ca-49ad-43c5-b8a1-543ed8ab80ad | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe0f:565b, 10.0.0.105 | | e21ba401-6161-45e6-8a04-6c45cef4aa3e | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe69:18bd, 10.0.0.111 | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ From the compute node, we can see that free hugepages on NUMA Node 0 are exhausted, whereas in theory there's still enough space on NUMA node 1: Raw [root@overcloud-compute-1 qemu]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0 /var/log/nova/nova-compute.log reveals that the instance CPU shall be pinned to NUMA node 0: Raw <name>instance-00000006</name> <uuid>1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="14.0.8-5.el7ost"/> <nova:name>cirros-test0</nova:name> <nova:creationTime>2017-11-23 19:53:00</nova:creationTime> <nova:flavor name="m1.tiny"> <nova:memory>512</nova:memory> <nova:disk>8</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>1</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="5d1785ee87294a6fad5e2bdddd91cc20">admin</nova:user> <nova:project uuid="8c307c08d2234b339c504bfdd896c13e">admin</nova:project> </nova:owner> <nova:root type="image" uuid="6350211f-5a11-4e02-a21a-cb1c0d543214"/> </nova:instance> </metadata> <memory unit='KiB'>524288</memory> <currentMemory unit='KiB'>524288</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> </hugepages> </memoryBacking> <vcpu placement='static'>1</vcpu> <cputune> <shares>1024</shares> <vcpupin vcpu='0' cpuset='2'/> <emulatorpin cpuset='2'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> In the above, also look at the nodeset='0' in the numatune section, which indicates that memory shall be claimed from NUMA 0. Thanks for the excellent bug report here, Andreas. To summarize, the issue is that there are sufficient free hugepages on a second NUMA node but nova is not smart enough to choose CPUs from this node instead of the first one. Is this correct? If so, this sounds like a valid issue with the scheduler. As you've noted in the Customer Portal solution, the obvious workaround is to simply allocate more hugepages, but this wastes resources and nova should handle this better IMO. Hi Stephen, Yes, from a customer environment and from my lab, I can confirm this behavior. I did not look into the code, so I cannot tell you if this is really what happens. But looking at nova as a black box, this seems to be nova's behavior. I don't believe it's the scheduler: the scheduler passes, the instance tries to boot on the compute node. It seems more to be related to openstack-nova-compute which should select the other NUMA node with free memory, but it looks as if it cared first and foremost about free CPUs to pick the NUMA node. When placing an instance on a NUMA node, nova should consider all resources (cpu, memory, etc.) and only then make a decision about where to put the instance. By the way, this could also be a configuration issue, if this feature already exists and simply is non-default. Thanks, Andreas The instances that are using a NUMA topology should be scheduled on isolated host aggregate that is because the memory is counting differently. So mixing instances with and without NUMA topology creates such behavior. Basically in Nova world all instances that use pinning, huge pages, realtime, numa feature are using a NUMA topology. In this case, instances aren't actually being mixed. All instances are Cisco CSR1kv, using a single flavor with 4G memory, 2 vCPUs, and {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "any"} The first 10-12 instances start fine, but once all the 1G hugepages are allocated from NUMA0, with 2 left on NUMA0 and 52 available on NUMA1, that's when it fails. Multiple quantity retries, where 10 instances are booted, result in 4-6 succeeding and the rest failing with this error. Retrying this repeatedly eventually results in the maximum number of instances running, but only after 40% of them failing to start. I need sosreport because Im not able to reproduce the case. In my env the guests are well placed on the host NUMA nodes. I started several instances and all were well assigned on the NUMA nodes with HP available. An other point is that, to schedule on NUMA1, since you have cpu_policy=dedicated you need to have free pCPUs on that host NUMA node. Also are you using vpu_pin_set option to exclude some pCPUs? An other point I just noted. My tests are on master since I don't think that there were changes in this part of code but based on comment 4 it seems that could be the case. I will try that. But please in that time f you can share sosreport that could help. Created attachment 1367497 [details]
sosreport from lab compute node
Ok I found the issue. Not sure that will be so easy to fix but probably backportable in all cases. We only check for small pages available on the host NUMA node when verifying whether we can fit the guest NUMA node. The thing which makes the work a bit difficult is that the hugepages placement and page size selection (when using ANY) is done in a different place than the pinning. So my worry is a large refactor to fix the issue. *** Bug 1499083 has been marked as a duplicate of this bug. *** *** Bug 1519540 has been marked as a duplicate of this bug. *** (In reply to Sahid Ferdjaoui from comment #17) > Ok I found the issue. Not sure that will be so easy to fix but probably > backportable in all cases. We only check for small pages available on the > host NUMA node when verifying whether we can fit the guest NUMA node. The > thing which makes the work a bit difficult is that the hugepages placement > and page size selection (when using ANY) is done in a different place than > the pinning. So my worry is a large refactor to fix the issue. Any setting for flavor mem_page_size that might be a workaround? We attempted large, 4GB, 2048, 2, any, with similar results. Or, kernel hugepage size/count? Instance flavor is 4G memory, no variations. Would this also occur when using CPU pinning and SR-IOV? Hi Joe, I also tried your suggestion in my lab, interleaving the vCPUs: 2017-12-15 15:09:36.357 547419 DEBUG oslo_service.service [req-d2f1dd34-cdcb-489f-99f6-820ed85e2a9f - - - - -] vcpu_pin_set = 1,5,2,6,3,7 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2622 But that doesn't work for me, either: [root@overcloud-compute-1 ~]# lscpu | grep -i numa NUMA node(s): 2 NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 [root@overcloud-compute-1 ~]# virsh list Id Name State ---------------------------------------------------- 33 instance-00000014 running 34 instance-00000015 running [root@overcloud-compute-1 ~]# virsh vcpupinset 33 error: unknown command: 'vcpupinset' [root@overcloud-compute-1 ~]# virsh vcpupin 33 VCPU: CPU Affinity ---------------------------------- 0: 1 [root@overcloud-compute-1 ~]# virsh vcpupin 34 VCPU: CPU Affinity ---------------------------------- 0: 2 2017-12-15 15:15:33.875 547419 ERROR nova.virt.libvirt.guest [req-fffa6257-c757-4ab8-8081-01d8ffabaa26 ffe8d5e0f97b4849bfcb901f52dcac76 4dc6c5de84134974a8282eb8a39f8cd1 - - -] Error launching a defined domain with XML: <domain type='kvm'> <name>instance-00000016</name> <uuid>7edb8e02-203f-4974-8664-ba31014230a4</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="14.0.8-5.el7ost"/> <nova:name>cirros-test3</nova:name> <nova:creationTime>2017-12-15 15:15:30</nova:creationTime> <nova:flavor name="m1.tiny"> 2017-12-15 15:15:34.859 547419 DEBUG nova.compute.manager [req-fffa6257-c757-4ab8-8081-01d8ffabaa26 ffe8d5e0f97b4849bfcb901f52dcac76 4dc6c5de84134974a8282eb8a39f8cd1 - - -] [instance: 7edb8e 02-203f-4974-8664-ba31014230a4] Build of instance 7edb8e02-203f-4974-8664-ba31014230a4 was re-scheduled: internal error: process exited while connecting to monitor: 2017-12-15T15:15:33.49890 3Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-12-15T15:15:33.672454Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/37-instance-00000016,share=yes,size=536870912,host-nodes=0,po licy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM I can see some workarounds that will consist of adding VMs which will be used to create padding or by tweaking the vcpu_pin_set option. - You can have an instance configured to fit in the last pCPUs available in the host NUMA 0 (consider to use hw:cpu_policy=dedicated, hw:numa_nodes=1) - You can update the vpu_pin_set to remove the last pCPUs in the host NUMA node 0, start the guest and then you can revert your change in vcpu_pin_set option. I did not notice that in my first investigation but using DPDK implies huge pages consumed. If we look at your env without any guests running on compute-1 we can see that: [root@overcloud-compute-1 nova]# cat /sys/devices/system/node/node*/meminfo | grep -i hugepages_ Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0 Both of the host NUMA nodes already consume 512 pages. An option "rreserved_huge_pages" had been introduced in nova.conf [0] Basically what you want to do is to indicate to Nova that a part of the huge pages available will be used by other components. reserved_huge_pages=node:0,size:2048,count:512 reserved_huge_pages=node:1,size:2048,count:512 Please let me know whether that is fixing your issue. Thanks, s. [0] https://review.openstack.org/#/c/292499/ Hi Sahid, It seems that you made some code changes in the lab: ~~~ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py: # TODO(sahid): We are converting all calls from a /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py: # TODO(sahid): We are converting all calls from a /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py: # TODO(sahid): Needs to use get_info but more changes have to /usr/lib/python2.7/site-packages/nova/virt/hardware.py:# TODO(sahid): Move numa related to hardward/numa.py /usr/lib/python2.7/site-packages/nova/virt/hardware.py: LOG.debug("sahid mempages new %s", newcell.mempages) [root@overcloud-compute-1 ~]# rpm -qf /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py python-nova-14.0.8-5.el7ost.noarch [root@overcloud-compute-1 ~]# rpm -qV python-nova-14.0.8-5.el7ost.noarch S.5....T. /usr/lib/python2.7/site-packages/nova/virt/hardware.py S.5....T. /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py ~~~ ~~~ [root@overcloud-compute-1 ~]# yum reinstall python-nova-14.0.8-5.el7ost.noarch -y (...) ~~~ ~~~ [root@overcloud-compute-1 virt]# diff -ruN /root/nova-sahid /usr/lib/python2.7/site-packages/nova diff -ruN /root/nova-sahid/virt/hardware.py /usr/lib/python2.7/site-packages/nova/virt/hardware.py --- /root/nova-sahid/virt/hardware.py 2017-12-19 12:02:42.948676221 +0000 +++ /usr/lib/python2.7/site-packages/nova/virt/hardware.py 2017-10-24 03:44:48.000000000 +0000 @@ -829,8 +829,6 @@ :returns: objects.InstanceNUMACell instance with pinning information, or None if instance cannot be pinned to the given host """ - LOG.debug("host memepages %s", host_cell.mempages) - LOG.debug("instance pagesize requested: %s", instance_cell.pagesize) if host_cell.avail_cpus < len(instance_cell.cpuset): LOG.debug('Not enough available CPUs to schedule instance. ' 'Oversubscription is not possible with pinned instances. ' @@ -929,10 +927,8 @@ pagesize = None if instance_cell.pagesize: - LOG.debug("pagesize requested: %s", instance_cell.pagesize) pagesize = _numa_cell_supports_pagesize_request( host_cell, instance_cell) - LOG.debug("pagesize %s, node=%s", pagesize, host_cell) if not pagesize: LOG.debug('Host does not support requested memory pagesize. ' 'Requested: %d kB', instance_cell.pagesize) @@ -1403,7 +1399,6 @@ if instancecell.pagesize and instancecell.pagesize > 0: newcell.mempages = _numa_pagesize_usage_from_cell( hostcell, instancecell, sign) - LOG.debug("sahid mempages new %s", newcell.mempages) if instance.cpu_pinning_requested: pinned_cpus = set(instancecell.cpu_pinning.values()) if free: diff -ruN /root/nova-sahid/virt/libvirt/driver.py /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py --- /root/nova-sahid/virt/libvirt/driver.py 2017-12-19 11:22:06.503893967 +0000 +++ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py 2017-10-24 03:44:48.000000000 +0000 @@ -5343,7 +5343,7 @@ reserved=_get_reserved_memory_for_cell( self, cell.id, pages.size)) for pages in cell.mempages] - LOG.debug("mempages1 %s", mempages) + cell = objects.NUMACell(id=cell.id, cpuset=cpuset, memory=cell.memory / units.Ki, cpu_usage=0, memory_usage=0, [root@overcloud-compute-1 virt]# ~~~ All of the above being extra logging, though, I guess that this had no impact on your tests. I restarted openstack-nova-compute: ~~~ [root@overcloud-compute-1 virt]# systemctl restart openstack-nova-compute (...) [root@overcloud-compute-1 virt]# grep reserved_huge_pages /var/log/nova/nova-compute.log | tail -n1 2017-12-19 17:56:40.727 26691 DEBUG oslo_service.service [req-e681e97d-7d99-4ba8-bee7-5f7a3f655b21 - - - - -] reserved_huge_pages = [{'node': '0', 'count': '512', 'size': '2048'}, {'node': '1', 'count': '512', 'size': '2048'}] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2622 [root@overcloud-compute-1 virt]# ~~~ I repeated the test: ~~~ [stack@undercloud-4 ~]$ NETID=e17bd36d-4296-40ff-affe-803c954de05a ; for i in 2 3 ; do nova boot --nic net-id=$NETID --image cirros --flavor m1.tiny --key-name id_rsa cirros-test$i ; sleep 3 ;done ~~~ I spawned a total of 6 VMs: ~~~ [stack@undercloud-4 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | 18fc41df-1718-4d55-97b6-e7ce27c69054 | cirros-test1 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fede:e904, 10.0.0.102 | | 436fadef-459b-4c7d-b146-3ea2f9120a00 | cirros-test2 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe7d:f526, 10.0.0.109 | | 8f9d7634-e71f-4f37-bcd9-d4a2ee6adf9d | cirros-test3 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe40:a120, 10.0.0.114 | | 6ba1cd0e-fe5c-4eb0-8323-71f22ca8e1dd | cirros-test4 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe46:db, 10.0.0.101 | | 834b00b2-c521-40db-ac28-974ca2bdec8e | cirros-test5 | ERROR | - | NOSTATE | | | 53fb8e43-539a-499f-8831-100a307c0304 | cirros-test6 | ERROR | - | NOSTATE | | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ ~~~ ~~~ [root@overcloud-compute-1 virt]# cat /sys/devices/system/node/node*/meminfo | grep -i hugepages_ Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 [root@overcloud-compute-1 virt]# virsh list Id Name State ---------------------------------------------------- 54 instance-00000023 running 55 instance-00000024 running 56 instance-00000025 running 57 instance-00000026 running [root@overcloud-compute-1 virt]# for i in {54..57}; do virsh vcpupin $i; done VCPU: CPU Affinity ---------------------------------- 0: 1 VCPU: CPU Affinity ---------------------------------- 0: 2 VCPU: CPU Affinity ---------------------------------- 0: 5 VCPU: CPU Affinity ---------------------------------- 0: 6 ~~~ The problem seems to be fixed by this setting! [root@overcloud-compute-1 virt]# grep reserved_huge /etc/nova/nova.conf -B1 [DEFAULT] reserved_huge_pages=node:0,size:2048,count:512 reserved_huge_pages=node:1,size:2048,count:512 |