Environment Red Hat OpenStack Platform 10 Issue When spawning an instance and scheduling it onto a compute node which still has sufficient pCPUs for the instance and also sufficient free huge pages for the instance memory, nova returns: Raw [stack@undercloud-4 ~]$ nova show 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc (...) | fault | {"message": "Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc. Last exception: internal error: process exited while connecting to monitor: 2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=cha", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 492, in build_instances | | | filter_properties, instances[0].uuid) | | | File \"/usr/lib/python2.7/site-packages/nova/scheduler/utils.py\", line 184, in populate_retry | | | raise exception.MaxRetriesExceeded(reason=msg) | | | ", "created": "2017-11-23T19:53:22Z"} (...) And /var/log/nova/nova-compute.log on the compute node gives the following ERROR message: Raw 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [req-2ad59cdf-4901-4df1-8bd7-ebaea20b9361 5d1785ee87294a6fad5e2bdddd91cc20 8c307c08d2234b339c504bfdd896c13e - - -] [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] Instance failed to spawn 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] Traceback (most recent call last): 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2087, in _build_resources 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] yield resources 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1928, in _build_and_run_instance 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] block_device_info=block_device_info) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2674, in spawn 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] destroy_disks_on_failure=True) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5013, in _create_domain_and_network 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] destroy_disks_on_failure) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] self.force_reraise() 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] six.reraise(self.type_, self.value, self.tb) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4985, in _create_domain_and_network 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] post_xml_callback=post_xml_callback) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4903, in _create_domain 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] guest.launch(pause=pause) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 144, in launch 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] self._encoded_xml, errors='ignore') 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] self.force_reraise() 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] six.reraise(self.type_, self.value, self.tb) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 139, in launch 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] return self._domain.createWithFlags(flags) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] result = proxy_call(self._autowrap, f, *args, **kwargs) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] rv = execute(f, *args, **kwargs) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] six.reraise(c, e, tb) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] rv = meth(*args, **kwargs) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] libvirtError: internal error: process exited while connecting to monitor: 2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] 2017-11-23T19:53:20.477183Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] 2017-11-23 19:53:21.021 153615 ERROR nova.compute.manager [instance: 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc] Additionally, libvirt creates the following log file: Raw [root@overcloud-compute-1 qemu]# cat instance-00000006.log 2017-11-23 19:53:02.145+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/5-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 2017-11-23T19:53:03.217386Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-11-23T19:53:03.359799Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/5-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM 2017-11-23 19:53:03.630+0000: shutting down, reason=failed 2017-11-23 19:53:10.052+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-6-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-6-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 2017-11-23T19:53:11.466399Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-11-23T19:53:11.729226Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/6-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM 2017-11-23 19:53:12.159+0000: shutting down, reason=failed 2017-11-23 19:53:19.370+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-10.el7), hostname: overcloud-compute-1.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000006,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-7-instance-00000006/master-key.aes -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.8-5.el7ost,serial=4f88fcca-0cd3-4e19-8dc4-4436a54daff8,uuid=1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-7-instance-00000006/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhu9758ef15-d2 -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:d6:89:65,bus=pci.0,addr=0x3 -add-fd set=0,fd=29 -chardev file,id=charserial0,path=/dev/fdset/0,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.2.8:2 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 2017-11-23T19:53:20.311446Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-11-23T19:53:20.477183Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7-instance-00000006,share=yes,size=536870912,host-nodes=0,policy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM 2017-11-23 19:53:20.724+0000: shutting down, reason=failed
Root Cause Nova by default will first fill up NUMA node 0 if there are still free pCPUs. This issue happens when the requested pCPUs still fir into NUMA 0, but the hugepages on NUMA 0 aren't sufficient for the instance memory to fit. Unfortunately, at time of this writing, one cannot tell nova to spawn an instance on a specific NUMA node. Diagnostic Steps On a hypervisor with 2MB hugepages and 512 free hugepages per NUMA node: Raw [root@overcloud-compute-1 ~]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0 And with the following NUMA architecture: Raw [root@overcloud-compute-1 nova]# lscpu | grep -i NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 Spawn 3 instances with the following flavor (1 vCPU and 512 MB or memory): Raw [stack@undercloud-4 ~]$ nova flavor-show m1.tiny +----------------------------+-------------------------------------------------------------+ | Property | Value | +----------------------------+-------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 8 | | extra_specs | {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "large"} | | id | 49debbdb-c12e-4435-97ef-f575990b352f | | name | m1.tiny | | os-flavor-access:is_public | True | | ram | 512 | | rxtx_factor | 1.0 | | swap | | | vcpus | 1 | +----------------------------+-------------------------------------------------------------+ The new instance will boot and will use memory from NUMA 1: Raw [stack@undercloud-4 ~]$ nova list | grep d98772d1-119e-48fa-b1d9-8a68411cba0b | d98772d1-119e-48fa-b1d9-8a68411cba0b | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe8d:a6ef, 10.0.0.102 | Raw [root@overcloud-compute-1 nova]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 256 Node 1 HugePages_Surp: 0 Raw nova boot --nic net-id=$NETID --image cirros --flavor m1.tiny --key-name id_rsa cirros-test0 The 3rd instance fails to boot: Raw [stack@undercloud-4 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | 1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc | cirros-test0 | ERROR | - | NOSTATE | | | a44c43ca-49ad-43c5-b8a1-543ed8ab80ad | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe0f:565b, 10.0.0.105 | | e21ba401-6161-45e6-8a04-6c45cef4aa3e | cirros-test0 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe69:18bd, 10.0.0.111 | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ From the compute node, we can see that free hugepages on NUMA Node 0 are exhausted, whereas in theory there's still enough space on NUMA node 1: Raw [root@overcloud-compute-1 qemu]# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 2048 kB Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 2048 kB Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0 /var/log/nova/nova-compute.log reveals that the instance CPU shall be pinned to NUMA node 0: Raw <name>instance-00000006</name> <uuid>1b72e7a1-c298-4c92-8d2c-0a9fe886e9bc</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="14.0.8-5.el7ost"/> <nova:name>cirros-test0</nova:name> <nova:creationTime>2017-11-23 19:53:00</nova:creationTime> <nova:flavor name="m1.tiny"> <nova:memory>512</nova:memory> <nova:disk>8</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>1</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="5d1785ee87294a6fad5e2bdddd91cc20">admin</nova:user> <nova:project uuid="8c307c08d2234b339c504bfdd896c13e">admin</nova:project> </nova:owner> <nova:root type="image" uuid="6350211f-5a11-4e02-a21a-cb1c0d543214"/> </nova:instance> </metadata> <memory unit='KiB'>524288</memory> <currentMemory unit='KiB'>524288</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> </hugepages> </memoryBacking> <vcpu placement='static'>1</vcpu> <cputune> <shares>1024</shares> <vcpupin vcpu='0' cpuset='2'/> <emulatorpin cpuset='2'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> In the above, also look at the nodeset='0' in the numatune section, which indicates that memory shall be claimed from NUMA 0.
Thanks for the excellent bug report here, Andreas. To summarize, the issue is that there are sufficient free hugepages on a second NUMA node but nova is not smart enough to choose CPUs from this node instead of the first one. Is this correct? If so, this sounds like a valid issue with the scheduler. As you've noted in the Customer Portal solution, the obvious workaround is to simply allocate more hugepages, but this wastes resources and nova should handle this better IMO.
Hi Stephen, Yes, from a customer environment and from my lab, I can confirm this behavior. I did not look into the code, so I cannot tell you if this is really what happens. But looking at nova as a black box, this seems to be nova's behavior. I don't believe it's the scheduler: the scheduler passes, the instance tries to boot on the compute node. It seems more to be related to openstack-nova-compute which should select the other NUMA node with free memory, but it looks as if it cared first and foremost about free CPUs to pick the NUMA node. When placing an instance on a NUMA node, nova should consider all resources (cpu, memory, etc.) and only then make a decision about where to put the instance. By the way, this could also be a configuration issue, if this feature already exists and simply is non-default. Thanks, Andreas
The instances that are using a NUMA topology should be scheduled on isolated host aggregate that is because the memory is counting differently. So mixing instances with and without NUMA topology creates such behavior. Basically in Nova world all instances that use pinning, huge pages, realtime, numa feature are using a NUMA topology.
In this case, instances aren't actually being mixed. All instances are Cisco CSR1kv, using a single flavor with 4G memory, 2 vCPUs, and {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "any"} The first 10-12 instances start fine, but once all the 1G hugepages are allocated from NUMA0, with 2 left on NUMA0 and 52 available on NUMA1, that's when it fails. Multiple quantity retries, where 10 instances are booted, result in 4-6 succeeding and the rest failing with this error. Retrying this repeatedly eventually results in the maximum number of instances running, but only after 40% of them failing to start.
I need sosreport because Im not able to reproduce the case. In my env the guests are well placed on the host NUMA nodes. I started several instances and all were well assigned on the NUMA nodes with HP available. An other point is that, to schedule on NUMA1, since you have cpu_policy=dedicated you need to have free pCPUs on that host NUMA node. Also are you using vpu_pin_set option to exclude some pCPUs?
An other point I just noted. My tests are on master since I don't think that there were changes in this part of code but based on comment 4 it seems that could be the case. I will try that. But please in that time f you can share sosreport that could help.
Created attachment 1367497 [details] sosreport from lab compute node
Ok I found the issue. Not sure that will be so easy to fix but probably backportable in all cases. We only check for small pages available on the host NUMA node when verifying whether we can fit the guest NUMA node. The thing which makes the work a bit difficult is that the hugepages placement and page size selection (when using ANY) is done in a different place than the pinning. So my worry is a large refactor to fix the issue.
*** Bug 1499083 has been marked as a duplicate of this bug. ***
*** Bug 1519540 has been marked as a duplicate of this bug. ***
(In reply to Sahid Ferdjaoui from comment #17) > Ok I found the issue. Not sure that will be so easy to fix but probably > backportable in all cases. We only check for small pages available on the > host NUMA node when verifying whether we can fit the guest NUMA node. The > thing which makes the work a bit difficult is that the hugepages placement > and page size selection (when using ANY) is done in a different place than > the pinning. So my worry is a large refactor to fix the issue. Any setting for flavor mem_page_size that might be a workaround? We attempted large, 4GB, 2048, 2, any, with similar results. Or, kernel hugepage size/count? Instance flavor is 4G memory, no variations. Would this also occur when using CPU pinning and SR-IOV?
Hi Joe, I also tried your suggestion in my lab, interleaving the vCPUs: 2017-12-15 15:09:36.357 547419 DEBUG oslo_service.service [req-d2f1dd34-cdcb-489f-99f6-820ed85e2a9f - - - - -] vcpu_pin_set = 1,5,2,6,3,7 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2622 But that doesn't work for me, either: [root@overcloud-compute-1 ~]# lscpu | grep -i numa NUMA node(s): 2 NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 [root@overcloud-compute-1 ~]# virsh list Id Name State ---------------------------------------------------- 33 instance-00000014 running 34 instance-00000015 running [root@overcloud-compute-1 ~]# virsh vcpupinset 33 error: unknown command: 'vcpupinset' [root@overcloud-compute-1 ~]# virsh vcpupin 33 VCPU: CPU Affinity ---------------------------------- 0: 1 [root@overcloud-compute-1 ~]# virsh vcpupin 34 VCPU: CPU Affinity ---------------------------------- 0: 2 2017-12-15 15:15:33.875 547419 ERROR nova.virt.libvirt.guest [req-fffa6257-c757-4ab8-8081-01d8ffabaa26 ffe8d5e0f97b4849bfcb901f52dcac76 4dc6c5de84134974a8282eb8a39f8cd1 - - -] Error launching a defined domain with XML: <domain type='kvm'> <name>instance-00000016</name> <uuid>7edb8e02-203f-4974-8664-ba31014230a4</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="14.0.8-5.el7ost"/> <nova:name>cirros-test3</nova:name> <nova:creationTime>2017-12-15 15:15:30</nova:creationTime> <nova:flavor name="m1.tiny"> 2017-12-15 15:15:34.859 547419 DEBUG nova.compute.manager [req-fffa6257-c757-4ab8-8081-01d8ffabaa26 ffe8d5e0f97b4849bfcb901f52dcac76 4dc6c5de84134974a8282eb8a39f8cd1 - - -] [instance: 7edb8e 02-203f-4974-8664-ba31014230a4] Build of instance 7edb8e02-203f-4974-8664-ba31014230a4 was re-scheduled: internal error: process exited while connecting to monitor: 2017-12-15T15:15:33.49890 3Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/3 (label charserial1) 2017-12-15T15:15:33.672454Z qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/37-instance-00000016,share=yes,size=536870912,host-nodes=0,po licy=bind: os_mem_prealloc: Insufficient free host memory pages available to allocate guest RAM
I can see some workarounds that will consist of adding VMs which will be used to create padding or by tweaking the vcpu_pin_set option. - You can have an instance configured to fit in the last pCPUs available in the host NUMA 0 (consider to use hw:cpu_policy=dedicated, hw:numa_nodes=1) - You can update the vpu_pin_set to remove the last pCPUs in the host NUMA node 0, start the guest and then you can revert your change in vcpu_pin_set option.
I did not notice that in my first investigation but using DPDK implies huge pages consumed. If we look at your env without any guests running on compute-1 we can see that: [root@overcloud-compute-1 nova]# cat /sys/devices/system/node/node*/meminfo | grep -i hugepages_ Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 512 Node 1 HugePages_Surp: 0 Both of the host NUMA nodes already consume 512 pages. An option "rreserved_huge_pages" had been introduced in nova.conf [0] Basically what you want to do is to indicate to Nova that a part of the huge pages available will be used by other components. reserved_huge_pages=node:0,size:2048,count:512 reserved_huge_pages=node:1,size:2048,count:512 Please let me know whether that is fixing your issue. Thanks, s. [0] https://review.openstack.org/#/c/292499/
Hi Sahid, It seems that you made some code changes in the lab: ~~~ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py: # TODO(sahid): We are converting all calls from a /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py: # TODO(sahid): We are converting all calls from a /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py: # TODO(sahid): Needs to use get_info but more changes have to /usr/lib/python2.7/site-packages/nova/virt/hardware.py:# TODO(sahid): Move numa related to hardward/numa.py /usr/lib/python2.7/site-packages/nova/virt/hardware.py: LOG.debug("sahid mempages new %s", newcell.mempages) [root@overcloud-compute-1 ~]# rpm -qf /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py python-nova-14.0.8-5.el7ost.noarch [root@overcloud-compute-1 ~]# rpm -qV python-nova-14.0.8-5.el7ost.noarch S.5....T. /usr/lib/python2.7/site-packages/nova/virt/hardware.py S.5....T. /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py ~~~ ~~~ [root@overcloud-compute-1 ~]# yum reinstall python-nova-14.0.8-5.el7ost.noarch -y (...) ~~~ ~~~ [root@overcloud-compute-1 virt]# diff -ruN /root/nova-sahid /usr/lib/python2.7/site-packages/nova diff -ruN /root/nova-sahid/virt/hardware.py /usr/lib/python2.7/site-packages/nova/virt/hardware.py --- /root/nova-sahid/virt/hardware.py 2017-12-19 12:02:42.948676221 +0000 +++ /usr/lib/python2.7/site-packages/nova/virt/hardware.py 2017-10-24 03:44:48.000000000 +0000 @@ -829,8 +829,6 @@ :returns: objects.InstanceNUMACell instance with pinning information, or None if instance cannot be pinned to the given host """ - LOG.debug("host memepages %s", host_cell.mempages) - LOG.debug("instance pagesize requested: %s", instance_cell.pagesize) if host_cell.avail_cpus < len(instance_cell.cpuset): LOG.debug('Not enough available CPUs to schedule instance. ' 'Oversubscription is not possible with pinned instances. ' @@ -929,10 +927,8 @@ pagesize = None if instance_cell.pagesize: - LOG.debug("pagesize requested: %s", instance_cell.pagesize) pagesize = _numa_cell_supports_pagesize_request( host_cell, instance_cell) - LOG.debug("pagesize %s, node=%s", pagesize, host_cell) if not pagesize: LOG.debug('Host does not support requested memory pagesize. ' 'Requested: %d kB', instance_cell.pagesize) @@ -1403,7 +1399,6 @@ if instancecell.pagesize and instancecell.pagesize > 0: newcell.mempages = _numa_pagesize_usage_from_cell( hostcell, instancecell, sign) - LOG.debug("sahid mempages new %s", newcell.mempages) if instance.cpu_pinning_requested: pinned_cpus = set(instancecell.cpu_pinning.values()) if free: diff -ruN /root/nova-sahid/virt/libvirt/driver.py /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py --- /root/nova-sahid/virt/libvirt/driver.py 2017-12-19 11:22:06.503893967 +0000 +++ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py 2017-10-24 03:44:48.000000000 +0000 @@ -5343,7 +5343,7 @@ reserved=_get_reserved_memory_for_cell( self, cell.id, pages.size)) for pages in cell.mempages] - LOG.debug("mempages1 %s", mempages) + cell = objects.NUMACell(id=cell.id, cpuset=cpuset, memory=cell.memory / units.Ki, cpu_usage=0, memory_usage=0, [root@overcloud-compute-1 virt]# ~~~ All of the above being extra logging, though, I guess that this had no impact on your tests. I restarted openstack-nova-compute: ~~~ [root@overcloud-compute-1 virt]# systemctl restart openstack-nova-compute (...) [root@overcloud-compute-1 virt]# grep reserved_huge_pages /var/log/nova/nova-compute.log | tail -n1 2017-12-19 17:56:40.727 26691 DEBUG oslo_service.service [req-e681e97d-7d99-4ba8-bee7-5f7a3f655b21 - - - - -] reserved_huge_pages = [{'node': '0', 'count': '512', 'size': '2048'}, {'node': '1', 'count': '512', 'size': '2048'}] log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2622 [root@overcloud-compute-1 virt]# ~~~ I repeated the test: ~~~ [stack@undercloud-4 ~]$ NETID=e17bd36d-4296-40ff-affe-803c954de05a ; for i in 2 3 ; do nova boot --nic net-id=$NETID --image cirros --flavor m1.tiny --key-name id_rsa cirros-test$i ; sleep 3 ;done ~~~ I spawned a total of 6 VMs: ~~~ [stack@undercloud-4 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ | 18fc41df-1718-4d55-97b6-e7ce27c69054 | cirros-test1 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fede:e904, 10.0.0.102 | | 436fadef-459b-4c7d-b146-3ea2f9120a00 | cirros-test2 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe7d:f526, 10.0.0.109 | | 8f9d7634-e71f-4f37-bcd9-d4a2ee6adf9d | cirros-test3 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe40:a120, 10.0.0.114 | | 6ba1cd0e-fe5c-4eb0-8323-71f22ca8e1dd | cirros-test4 | ACTIVE | - | Running | provider1=2000:10::f816:3eff:fe46:db, 10.0.0.101 | | 834b00b2-c521-40db-ac28-974ca2bdec8e | cirros-test5 | ERROR | - | NOSTATE | | | 53fb8e43-539a-499f-8831-100a307c0304 | cirros-test6 | ERROR | - | NOSTATE | | +--------------------------------------+--------------+--------+------------+-------------+----------------------------------------------------+ ~~~ ~~~ [root@overcloud-compute-1 virt]# cat /sys/devices/system/node/node*/meminfo | grep -i hugepages_ Node 0 HugePages_Total: 1024 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 1024 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 [root@overcloud-compute-1 virt]# virsh list Id Name State ---------------------------------------------------- 54 instance-00000023 running 55 instance-00000024 running 56 instance-00000025 running 57 instance-00000026 running [root@overcloud-compute-1 virt]# for i in {54..57}; do virsh vcpupin $i; done VCPU: CPU Affinity ---------------------------------- 0: 1 VCPU: CPU Affinity ---------------------------------- 0: 2 VCPU: CPU Affinity ---------------------------------- 0: 5 VCPU: CPU Affinity ---------------------------------- 0: 6 ~~~ The problem seems to be fixed by this setting!
[root@overcloud-compute-1 virt]# grep reserved_huge /etc/nova/nova.conf -B1 [DEFAULT] reserved_huge_pages=node:0,size:2048,count:512 reserved_huge_pages=node:1,size:2048,count:512