Bug 1517004

Summary:

Insufficient free host memory pages available to allocate guest RAM with Open vSwitch DPDK in Red Hat OpenStack Platform 10

Product:

Red Hat OpenStack

Reporter:

Andreas Karis <akaris>

Component:

openstack-nova

Assignee:

Sahid Ferdjaoui <sferdjao>

Status:

CLOSED NOTABUG

QA Contact:

Joe H. Rahme <jhakimra>

Severity:

high

Docs Contact:

Priority:

high

Version:

10.0 (Newton)

CC:

aguetta, akaris, awaugama, berrange, cfields, dasmith, eglynn, gkadam, joea, kchamart, lyarwood, mmethot, nchandek, sbauza, sferdjao, sgordon, srevivo, stephenfin, vromanso

Target Milestone:

---

Keywords:

Triaged

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-01-10 16:42:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sosreport from lab compute node	none

Description Andreas Karis 2017-11-23 21:04:17 UTC

Environment

Red Hat OpenStack Platform 10
Issue

When spawning an instance and scheduling it onto a compute node which still has sufficient pCPUs for the instance and also sufficient free huge pages for the instance memory, nova returns:
Raw

[stack@undercloud-4 ~]$ nova show 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc
(...)
| fault                                | {"message": "Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc. Last exception: internal error: process exited while connecting to monitor: 2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=cha", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 492, in build_instances |
|                                      |     filter_properties, instances[0].uuid)                                                                                                                                                                                                                                                                                                                                                                   |
|                                      |   File \"/usr/lib/python2.7/site-packages/nova/scheduler/utils.py\", line 184, in populate_retry                                                                                                                                                                                                                                                                                                            |
|                                      |     raise exception.MaxRetriesExceeded(reason=msg)                                                                                                                                                                                                                                                                                                                                                          |
|                                      | ", "created": "2017-11-23T19:53:22Z"} 
(...)                                                                                                                                                                                                                                                                                          

And /var/log/nova/nova-compute.log on the compute node gives the following ERROR message:
Raw

2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [req-2ad59cdf-4901-4df1-8bd7-ebaea20b9361 5d1785ee87294a6fad5e2bdddd91cc20 8c307c08d2234b339c504bfdd896c13e - - -] [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] Instance failed 
to spawn
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] Traceback (most recent call last):
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2087, in _build_resources
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     yield resources
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1928, in _build_and_run_instance
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     block_device_info=block_device_info)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2674, in spawn
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     destroy_disks_on_failure=True)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5013, in _create_domain_and_network
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     destroy_disks_on_failure)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     self.force_reraise()
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     six.reraise(self.type_, self.value, self.tb)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4985, in _create_domain_and_network
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     post_xml_callback=post_xml_callback)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4903, in _create_domain
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     guest.launch(pause=pause)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 144, in launch
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     self._encoded_xml, errors='ignore')
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     self.force_reraise()
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     six.reraise(self.type_, self.value, self.tb)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 139, in launch
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     return self._domain.createWithFlags(flags)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     rv = execute(f, *args, **kwargs)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     six.reraise(c, e, tb)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     rv = meth(*args, **kwargs)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] libvirtError: internal error: process exited while connecting to monitor: 2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1)
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] 2017-11-23T19:53:20.477183Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] 
2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] 

Additionally, libvirt creates the following log file:
Raw

[root@overcloud-compute-1 qemu]# cat instance-00000006.log
2017-11-23 19:53:02.145+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/5-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2017-11-23T19:53:03.217386Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1)
2017-11-23T19:53:03.359799Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/5-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM

2017-11-23 19:53:03.630+0000: shutting down, reason=failed
2017-11-23 19:53:10.052+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-6-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-6-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2017-11-23T19:53:11.466399Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1)
2017-11-23T19:53:11.729226Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM

2017-11-23 19:53:12.159+0000: shutting down, reason=failed
2017-11-23 19:53:19.370+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-7-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-7-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1)
2017-11-23T19:53:20.477183Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM

2017-11-23 19:53:20.724+0000: shutting down, reason=failed

Comment 1 Andreas Karis 2017-11-23 21:05:04 UTC

Root Cause

Nova by default will first fill up NUMA node 0 if there are still free pCPUs. This issue happens when the requested pCPUs still fir into NUMA 0, but the hugepages on NUMA 0 aren't sufficient for the instance memory to fit. Unfortunately, at time of this writing, one cannot tell nova to spawn an instance on a specific NUMA node.
Diagnostic Steps

On a hypervisor with 2MB hugepages and 512 free hugepages per NUMA node:
Raw

[root@overcloud-compute-1 ~]# cat /sys/devices/system/node/node*/meminfo  | grep -i huge
Node 0 AnonHugePages:      2048 kB
Node 0 HugePages_Total:  1024
Node 0 HugePages_Free:    512
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:      2048 kB
Node 1 HugePages_Total:  1024
Node 1 HugePages_Free:    512
Node 1 HugePages_Surp:      0

And with the following NUMA architecture:
Raw

[root@overcloud-compute-1 nova]# lscpu  | grep -i NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7

Spawn 3 instances with the following flavor (1 vCPU and 512 MB or memory):
Raw

[stack@undercloud-4 ~]$ nova flavor-show m1.tiny
+----------------------------+-------------------------------------------------------------+
| Property                   | Value                                                       |
+----------------------------+-------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                                       |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                           |
| disk                       | 8                                                           |
| extra_specs                | {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "large"} |
| id                         | 49debbdb-c12e-4435-97ef-f575990b352f                        |
| name                       | m1.tiny                                                     |
| os-flavor-access:is_public | True                                                        |
| ram                        | 512                                                         |
| rxtx_factor                | 1.0                                                         |
| swap                       |                                                             |
| vcpus                      | 1                                                           |
+----------------------------+-------------------------------------------------------------+

The new instance will boot and will use memory from NUMA 1:
Raw

[stack@undercloud-4 ~]$ nova list | grep d98772d1-119e-48fa-b1d9-8a68411cba0b
| d98772d1-119e-48fa-b1d9-8a68411cba0b | cirros-test0 | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe8d:a6ef, 10.0.0.102 |

Raw

[root@overcloud-compute-1 nova]# cat /sys/devices/system/node/node*/meminfo  | grep -i huge
Node 0 AnonHugePages:      2048 kB
Node 0 HugePages_Total:  1024
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:      2048 kB
Node 1 HugePages_Total:  1024
Node 1 HugePages_Free:    256
Node 1 HugePages_Surp:      0

Raw

nova boot --nic net-id=$NETID --image cirros --flavor m1.tiny --key-name id_rsa cirros-test0

The 3rd instance fails to boot:
Raw

[stack@undercloud-4 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks                                           |
+--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+
| 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc | cirros-test0 | ERROR  | -          | NOSTATE     |                                                    |
| a44c43ca-49ad-43c5-b8a1-543ed8ab80ad | cirros-test0 | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe0f:565b, 10.0.0.105 |
| e21ba401-6161-45e6-8a04-6c45cef4aa3e | cirros-test0 | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe69:18bd, 10.0.0.111 |
+--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+

From the compute node, we can see that free hugepages on NUMA Node 0 are exhausted, whereas in theory there's still enough space on NUMA node 1:
Raw

[root@overcloud-compute-1 qemu]# cat /sys/devices/system/node/node*/meminfo  | grep -i huge
Node 0 AnonHugePages:      2048 kB
Node 0 HugePages_Total:  1024
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:      2048 kB
Node 1 HugePages_Total:  1024
Node 1 HugePages_Free:    512
Node 1 HugePages_Surp:      0

/var/log/nova/nova-compute.log reveals that the instance CPU shall be pinned to NUMA node 0:
Raw

  <name>instance-00000006</name>
  <uuid>1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="14.0.8-5.el7ost"/>
      <nova:name>cirros-test0</nova:name>
      <nova:creationTime>2017-11-23 19:53:00</nova:creationTime>
      <nova:flavor name="m1.tiny">
        <nova:memory>512</nova:memory>
        <nova:disk>8</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>1</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="5d1785ee87294a6fad5e2bdddd91cc20">admin</nova:user>
        <nova:project uuid="8c307c08d2234b339c504bfdd896c13e">admin</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="6350211f-5a11-4e02-a21a-cb1c0d543214"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>524288</memory>
  <currentMemory unit='KiB'>524288</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>1</vcpu>
  <cputune>
    <shares>1024</shares>
    <vcpupin vcpu='0' cpuset='2'/>
    <emulatorpin cpuset='2'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
  </numatune>

In the above, also look at the nodeset='0' in the numatune section, which indicates that memory shall be claimed from NUMA 0.

Comment 2 Stephen Finucane 2017-12-01 14:03:53 UTC

Thanks for the excellent bug report here, Andreas. To summarize, the issue is that there are sufficient free hugepages on a second NUMA node but nova is not smart enough to choose CPUs from this node instead of the first one. Is this correct? If so, this sounds like a valid issue with the scheduler. As you've noted in the  Customer Portal solution, the obvious workaround is to simply allocate more hugepages, but this wastes resources and nova should handle this better IMO.

Comment 4 Andreas Karis 2017-12-01 18:57:55 UTC

Hi Stephen,

Yes, from a customer environment and from my lab, I can confirm this behavior. 

I did not look into the code, so I cannot tell you if this is really what happens. But looking at nova as a black box, this seems to be nova's behavior. I don't believe it's the scheduler: the scheduler passes, the instance tries to boot on the compute node. It seems more to be related to openstack-nova-compute which should select the other NUMA node with free memory, but it looks as if it cared first and foremost about free CPUs to pick the NUMA node. When placing an instance on a NUMA node, nova should consider all resources (cpu, memory, etc.) and only then make a decision about where to put the instance. 

By the way, this could also be a configuration issue, if this feature already exists and simply is non-default.

Thanks,

Andreas

Comment 7 Sahid Ferdjaoui 2017-12-11 11:09:40 UTC

The instances that are using a NUMA topology should be scheduled on isolated host aggregate that is because the memory is counting differently. So mixing instances with and without NUMA topology creates such behavior. Basically in Nova world all instances that use pinning, huge pages, realtime, numa feature are using a NUMA topology.

Comment 8 Joe Antkowiak 2017-12-12 18:27:27 UTC

In this case, instances aren't actually being mixed.  All instances are Cisco CSR1kv, using a single flavor with 4G memory, 2 vCPUs, and {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "any"}

The first 10-12 instances start fine, but once all the 1G hugepages are allocated from NUMA0, with 2 left on NUMA0 and 52 available on NUMA1, that's when it fails.

Multiple quantity retries, where 10 instances are booted, result in 4-6 succeeding and the rest failing with this error.  Retrying this repeatedly eventually results in the maximum number of instances running, but only after 40% of them failing to start.

Comment 9 Sahid Ferdjaoui 2017-12-13 11:50:22 UTC

I need sosreport because Im not able to reproduce the case. In my env the guests are well placed on the host NUMA nodes. I started several instances and all were well assigned on the NUMA nodes with HP available.

An other point is that, to schedule on NUMA1, since you have cpu_policy=dedicated you need to have free pCPUs on that host NUMA node. Also are you using vpu_pin_set option to exclude some pCPUs?

Comment 10 Sahid Ferdjaoui 2017-12-13 11:54:02 UTC

An other point I just noted. My tests are on master since I don't think that there were changes in this part of code but based on comment 4 it seems that could be the case. I will try that. But please in that time f you can share sosreport that could help.

Comment 14 Andreas Karis 2017-12-13 17:12:30 UTC

Created attachment 1367497 [details]
sosreport from lab compute node

Comment 17 Sahid Ferdjaoui 2017-12-15 09:03:44 UTC

Ok I found the issue. Not sure that will be so easy to fix but probably backportable in all cases. We only check for small pages available on the host NUMA node when verifying whether we can fit the guest NUMA node. The thing which makes the work a bit difficult is that the hugepages placement and page size selection (when using ANY) is done in a different place than the pinning. So my worry is a large refactor to fix the issue.

Comment 18 Sahid Ferdjaoui 2017-12-15 09:20:43 UTC

*** Bug 1499083 has been marked as a duplicate of this bug. ***

Comment 19 Sahid Ferdjaoui 2017-12-15 09:23:22 UTC

*** Bug 1519540 has been marked as a duplicate of this bug. ***

Comment 20 Joe Antkowiak 2017-12-15 15:05:04 UTC

(In reply to Sahid Ferdjaoui from comment #17)
> Ok I found the issue. Not sure that will be so easy to fix but probably
> backportable in all cases. We only check for small pages available on the
> host NUMA node when verifying whether we can fit the guest NUMA node. The
> thing which makes the work a bit difficult is that the hugepages placement
> and page size selection (when using ANY) is done in a different place than
> the pinning. So my worry is a large refactor to fix the issue.

Any setting for flavor mem_page_size that might be a workaround?  We attempted large, 4GB, 2048, 2, any, with similar results.  Or, kernel hugepage size/count?  Instance flavor is 4G memory, no variations.

Would this also occur when using CPU pinning and SR-IOV?

Comment 21 Andreas Karis 2017-12-15 15:26:17 UTC

Hi Joe,

I also tried your suggestion in my lab, interleaving the vCPUs:

2017-12-15 15:09:36.357 547419 DEBUG oslo_service.service [req-d2f1dd34-cdcb-489f-99f6-820ed85e2a9f - - - - -] vcpu_pin_set                   = 1,5,2,6,3,7 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2622

But that doesn't work for me, either:

[root@overcloud-compute-1 ~]# lscpu | grep -i numa
NUMA node(s):          2
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7


[root@overcloud-compute-1 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 33    instance-00000014              running
 34    instance-00000015              running


[root@overcloud-compute-1 ~]# virsh vcpupinset 33
error: unknown command: 'vcpupinset'
[root@overcloud-compute-1 ~]# virsh vcpupin 33
VCPU: CPU Affinity
----------------------------------
   0: 1

[root@overcloud-compute-1 ~]# virsh vcpupin 34
VCPU: CPU Affinity
----------------------------------
   0: 2




2017-12-15 15:15:33.875 547419 ERROR nova.virt.libvirt.guest [req-fffa6257-c757-4ab8-8081-01d8ffabaa26 ffe8d5e0f97b4849bfcb901f52dcac76 4dc6c5de84134974a8282eb8a39f8cd1 - - -] Error launching a defined domain with XML: <domain type='kvm'>
  <name>instance-00000016</name>
  <uuid>7edb8e02-203f-4974-8664-ba31014230a4</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="14.0.8-5.el7ost"/>
      <nova:name>cirros-test3</nova:name>
      <nova:creationTime>2017-12-15 15:15:30</nova:creationTime>
      <nova:flavor name="m1.tiny">

2017-12-15 15:15:34.859 547419 DEBUG nova.compute.manager [req-fffa6257-c757-4ab8-8081-01d8ffabaa26 ffe8d5e0f97b4849bfcb901f52dcac76 4dc6c5de84134974a8282eb8a39f8cd1 - - -] [instance: 7edb8e
02-203f-4974-8664-ba31014230a4] Build of instance 7edb8e02-203f-4974-8664-ba31014230a4 was re-scheduled: internal error: process exited while connecting to monitor: 2017-12-15T15:15:33.49890
3Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1)
2017-12-15T15:15:33.672454Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/37-instance-00000016,share=yes,size=536870912,host-nodes=0,po
licy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM

Comment 22 Sahid Ferdjaoui 2017-12-18 09:39:39 UTC

I can see some workarounds that will consist of adding VMs which will be used to create padding or by tweaking the vcpu_pin_set option.

- You can have an instance configured to fit in the last pCPUs available in the host NUMA 0 (consider to use hw:cpu_policy=dedicated, hw:numa_nodes=1)

- You can update the vpu_pin_set to remove the last pCPUs in the host NUMA node 0, start the guest and then you can revert your change in vcpu_pin_set option.

Comment 24 Sahid Ferdjaoui 2017-12-19 13:24:00 UTC

I did not notice that in my first investigation but using DPDK implies huge pages consumed.

If we look at your env without any guests running on compute-1 we can see that:

[root@overcloud-compute-1 nova]# cat /sys/devices/system/node/node*/meminfo | grep -i hugepages_
Node 0 HugePages_Total:  1024
Node 0 HugePages_Free:    512
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:  1024
Node 1 HugePages_Free:    512
Node 1 HugePages_Surp:      0

Both of the host NUMA nodes already consume 512 pages.

An option "rreserved_huge_pages" had been introduced in nova.conf [0] Basically what you want to do is to indicate to Nova that a part of the huge pages available will be used by other components.

  reserved_huge_pages=node:0,size:2048,count:512
  reserved_huge_pages=node:1,size:2048,count:512

Please let me know whether that is fixing your issue.

Thanks,
s.

[0] https://review.openstack.org/#/c/292499/

Comment 25 Andreas Karis 2017-12-19 18:03:21 UTC

Hi Sahid,

It seems that you made some code changes in the lab:
~~~
/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:        # TODO(sahid): We are converting all calls from a
/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:        # TODO(sahid): We are converting all calls from a
/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:        # TODO(sahid): Needs to use get_info but more changes have to
/usr/lib/python2.7/site-packages/nova/virt/hardware.py:# TODO(sahid): Move numa related to hardward/numa.py
/usr/lib/python2.7/site-packages/nova/virt/hardware.py:			LOG.debug("sahid mempages new %s", newcell.mempages)
[root@overcloud-compute-1 ~]# rpm -qf /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py
python-nova-14.0.8-5.el7ost.noarch
[root@overcloud-compute-1 ~]# rpm -qV python-nova-14.0.8-5.el7ost.noarch
S.5....T.    /usr/lib/python2.7/site-packages/nova/virt/hardware.py
S.5....T.    /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py
~~~

~~~
[root@overcloud-compute-1 ~]# yum reinstall python-nova-14.0.8-5.el7ost.noarch -y
(...)
~~~

~~~
[root@overcloud-compute-1 virt]# diff -ruN /root/nova-sahid /usr/lib/python2.7/site-packages/nova
diff -ruN /root/nova-sahid/virt/hardware.py /usr/lib/python2.7/site-packages/nova/virt/hardware.py
--- /root/nova-sahid/virt/hardware.py	2017-12-19 12:02:42.948676221 +0000
+++ /usr/lib/python2.7/site-packages/nova/virt/hardware.py	2017-10-24 03:44:48.000000000 +0000
@@ -829,8 +829,6 @@
     :returns: objects.InstanceNUMACell instance with pinning information,
               or None if instance cannot be pinned to the given host
     """
-    LOG.debug("host memepages %s", host_cell.mempages)
-    LOG.debug("instance pagesize requested: %s", instance_cell.pagesize)
     if host_cell.avail_cpus < len(instance_cell.cpuset):
         LOG.debug('Not enough available CPUs to schedule instance. '
                   'Oversubscription is not possible with pinned instances. '
@@ -929,10 +927,8 @@
 
     pagesize = None
     if instance_cell.pagesize:
-	LOG.debug("pagesize requested: %s", instance_cell.pagesize)
         pagesize = _numa_cell_supports_pagesize_request(
             host_cell, instance_cell)
-	LOG.debug("pagesize %s, node=%s", pagesize, host_cell)
         if not pagesize:
             LOG.debug('Host does not support requested memory pagesize. '
                       'Requested: %d kB', instance_cell.pagesize)
@@ -1403,7 +1399,6 @@
                     if instancecell.pagesize and instancecell.pagesize > 0:
                         newcell.mempages = _numa_pagesize_usage_from_cell(
                             hostcell, instancecell, sign)
-			LOG.debug("sahid mempages new %s", newcell.mempages)
                     if instance.cpu_pinning_requested:
                         pinned_cpus = set(instancecell.cpu_pinning.values())
                         if free:
diff -ruN /root/nova-sahid/virt/libvirt/driver.py /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py
--- /root/nova-sahid/virt/libvirt/driver.py	2017-12-19 11:22:06.503893967 +0000
+++ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py	2017-10-24 03:44:48.000000000 +0000
@@ -5343,7 +5343,7 @@
                         reserved=_get_reserved_memory_for_cell(
                             self, cell.id, pages.size))
                     for pages in cell.mempages]
-            LOG.debug("mempages1 %s", mempages)
+
             cell = objects.NUMACell(id=cell.id, cpuset=cpuset,
                                     memory=cell.memory / units.Ki,
                                     cpu_usage=0, memory_usage=0,
[root@overcloud-compute-1 virt]# 
~~~

All of the above being extra logging, though, I guess that this had no impact on your tests.

I restarted openstack-nova-compute:
~~~
[root@overcloud-compute-1 virt]# systemctl restart openstack-nova-compute 
(...)
[root@overcloud-compute-1 virt]# grep reserved_huge_pages /var/log/nova/nova-compute.log | tail -n1
2017-12-19 17:56:40.727 26691 DEBUG oslo_service.service [req-e681e97d-7d99-4ba8-bee7-5f7a3f655b21 - - - - -] reserved_huge_pages            = [{'node': '0', 'count': '512', 'size': '2048'}, {'node': '1', 'count': '512', 'size': '2048'}] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2622
[root@overcloud-compute-1 virt]# 
~~~

I repeated the test:
~~~
[stack@undercloud-4 ~]$ NETID=e17bd36d-4296-40ff-affe-803c954de05a ; for i in 2 3 ; do nova boot --nic net-id=$NETID --image cirros --flavor m1.tiny --key-name id_rsa cirros-test$i ; sleep 3 ;done
~~~

I spawned a total of 6 VMs:
~~~
[stack@undercloud-4 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks                                           |
+--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+
| 18fc41df-1718-4d55-97b6-e7ce27c69054 | cirros-test1 | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fede:e904, 10.0.0.102 |
| 436fadef-459b-4c7d-b146-3ea2f9120a00 | cirros-test2 | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe7d:f526, 10.0.0.109 |
| 8f9d7634-e71f-4f37-bcd9-d4a2ee6adf9d | cirros-test3 | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe40:a120, 10.0.0.114 |
| 6ba1cd0e-fe5c-4eb0-8323-71f22ca8e1dd | cirros-test4 | ACTIVE | -          | Running     | provider1=2000:10::f816:3eff:fe46:db, 10.0.0.101   |
| 834b00b2-c521-40db-ac28-974ca2bdec8e | cirros-test5 | ERROR  | -          | NOSTATE     |                                                    |
| 53fb8e43-539a-499f-8831-100a307c0304 | cirros-test6 | ERROR  | -          | NOSTATE     |                                                    |
+--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+
~~~

~~~
[root@overcloud-compute-1 virt]# cat /sys/devices/system/node/node*/meminfo | grep -i hugepages_
Node 0 HugePages_Total:  1024
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:  1024
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
[root@overcloud-compute-1 virt]# virsh list
 Id    Name                           State
----------------------------------------------------
 54    instance-00000023              running
 55    instance-00000024              running
 56    instance-00000025              running
 57    instance-00000026              running

[root@overcloud-compute-1 virt]# for i in {54..57}; do virsh vcpupin $i; done
VCPU: CPU Affinity
----------------------------------
   0: 1

VCPU: CPU Affinity
----------------------------------
   0: 2

VCPU: CPU Affinity
----------------------------------
   0: 5

VCPU: CPU Affinity
----------------------------------
   0: 6
~~~

The problem seems to be fixed by this setting!

Comment 26 Andreas Karis 2017-12-19 18:05:49 UTC

[root@overcloud-compute-1 virt]# grep reserved_huge /etc/nova/nova.conf  -B1
[DEFAULT]
reserved_huge_pages=node:0,size:2048,count:512
reserved_huge_pages=node:1,size:2048,count:512