nova is reporting incorrect available memory on the hypervisor. This is causing instances to crash when exceeding max memory. It seems that nova considers hugepages as available memory, however these instances do not use a flavor with "mem_page_size" and thus will not assign any. nova reporting "free_ram_mb=393907" vs meminfo "MemAvailable=255052952kB" ram_allocation_ratio = 1.0 ~~~ [root@compute2 qemu]# cat /proc/meminfo MemTotal: 527923424 kB MemFree: 254329696 kB MemAvailable: 255052952 kB Buffers: 3132 kB Cached: 1683296 kB SwapCached: 0 kB Active: 132714168 kB Inactive: 751716 kB Active(anon): 131790480 kB Inactive(anon): 4936 kB Active(file): 923688 kB Inactive(file): 746780 kB Unevictable: 693856 kB Mlocked: 693872 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 960 kB Writeback: 0 kB AnonPages: 132473524 kB Mapped: 79256 kB Shmem: 2024 kB Slab: 384300 kB SReclaimable: 148304 kB SUnreclaim: 235996 kB KernelStack: 15584 kB PageTables: 273516 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 196852848 kB Committed_AS: 134627936 kB VmallocTotal: 34359738367 kB VmallocUsed: 1769212 kB VmallocChunk: 34089432048 kB HardwareCorrupted: 0 kB AnonHugePages: 610304 kB HugePages_Total: 128 HugePages_Free: 128 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB DirectMap4k: 385340 kB DirectMap2M: 12906496 kB DirectMap1G: 523239424 kB +---------------------------+------------------------------------------+ | Property | Value | +---------------------------+------------------------------------------+ | cpu_info_arch | x86_64 | | cpu_info_features | ["pku", "rtm", "tsc_adjust", "tsc- | | | deadline", "pge", "xsaveopt", "smep", | | | "fpu", "monitor", "lm", "tsc", "adx", | | | "fxsr", "tm", "pclmuldq", "xgetbv1", | | | "vme", "arat", "de", "aes", "pse", | | | "sse", "f16c", "ds", "mpx", "avx512f", | | | "avx2", "pbe", "mbm_local", "cx16", | | | "ds_cpl", "movbe", "cmt", "vmx", "sep", | | | "avx512dq", "xsave", "erms", "hle", | | | "est", "smx", "abm", "sse4.1", "sse4.2", | | | "acpi", "mbm_total", "pdcm", "mmx", | | | "osxsave", "dca", "popcnt", "invtsc", | | | "tm2", "pcid", "rdrand", "avx512vl", | | | "x2apic", "smap", "clflush", "dtes64", | | | "xtpr", "avx512bw", "msr", "fma", "cx8", | | | "mce", "avx512cd", "ht", "pni", | | | "rdseed", "apic", "fsgsbase", "rdtscp", | | | "ssse3", "pse36", "mtrr", "avx", | | | "syscall", "invpcid", "cmov", | | | "clflushopt", "pat", "3dnowprefetch", | | | "nx", "pae", "mca", "pdpe1gb", "xsavec", | | | "lahf_lm", "sse2", "ss", "bmi1", "bmi2", | | | "xsaves"] | | cpu_info_model | Skylake-Client | | cpu_info_topology_cells | 2 | | cpu_info_topology_cores | 20 | | cpu_info_topology_sockets | 1 | | cpu_info_topology_threads | 2 | | cpu_info_vendor | Intel | | current_workload | 0 | | disk_available_least | 2034 | | free_disk_gb | 2027 | | free_ram_mb | 393907 | | host_ip | 192.168.1.20 | | hypervisor_hostname | compute2 | | hypervisor_type | QEMU | | hypervisor_version | 2009000 | | id | 57 | | local_gb | 2047 | | local_gb_used | 20 | | memory_mb | 523955 | | memory_mb_used | 130048 | | running_vms | 1 | | service_disabled_reason | None | | service_host | compute2 | | service_id | 189 | | state | up | | status | enabled | | vcpus | 80 | | vcpus_used | 16 | +---------------------------+------------------------------------------+ ~~~
They're not specifying CPU pin set for Nova, however using CPU pinning. As can be seen in this example: [stack@director tests]$ openstack flavor show xtest-x-flv +----------------------------+----------------------------------------------------------+ | Field | Value | +----------------------------+----------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 10 | | id | 9c364889-7cd2-42da-b0aa-ac9e8e1944a7 | | name | xtest-x-flv | | os-flavor-access:is_public | True | | properties | hw:cpu_policy='dedicated', hw:cpu_thread_policy='prefer' | | ram | 2048 | | rxtx_factor | 1.0 | | swap | | | vcpus | 2 | +----------------------------+----------------------------------------------------------+ [stack@director tests]$ openstack server show 534e22aa-cf88-47af-80e2-0ef955800d8e +--------------------------------------+-----------------------------------------------------------------------------+ | Field | Value | +--------------------------------------+-----------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute2 | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute2 | | OS-EXT-SRV-ATTR:instance_name | instance-000023ca | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2017-11-29T20:28:23.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | edn-net1=192.168.78.24, fd00:4888:2000:f411:524:ff2:0:28; sriov_a=fd00:a::3 | | config_drive | True | | created | 2017-11-29T20:27:50Z | | flavor | compute2 (9c364889-7cd2-42da-b0aa-ac9e8e1944a7) | | hostId | 1a6530f1a6d7f440d44fcb0a2ea68e2f6e37797db8dd1ec92fae7ecb | | id | 534e22aa-cf88-47af-80e2-0ef955800d8e | | image | xtest-x-img (1776ac84-ee53-4702-8425-3a3e0a63ee9e) | | key_name | xtest-x-key | | name | xtest-x-srv0 | | os-extended-volumes:volumes_attached | [{u'id': u'dde6245f-fca7-4ded-bebb-b9c1ebe1039d'}] | | progress | 0 | | project_id | 643ab7a6e77841b78abbc67d8eb6bfbb | | properties | | | security_groups | [{u'name': u'xtest-scp'}] | | status | ACTIVE | | updated | 2017-11-29T20:28:23Z | | user_id | c7e640e605a841478ffaab52f254f96f | +--------------------------------------+-----------------------------------------------------------------------------+ [stack@director tests]$ ssh compute2 sudo virsh vcpuinfo instance-000023ca VCPU: 0 CPU: 18 State: running CPU time: 50.8s CPU Affinity: ------------------y------------------------------------------------------------- VCPU: 1 CPU: 58 State: running CPU time: 30.6s CPU Affinity: ----------------------------------------------------------y--------------------- ~~~ egrep '^scheduler_available|^scheduler_default' /etc/nova/nova.conf scheduler_available_filters=nova.scheduler.filters.all_filters scheduler_available_filters=nova.scheduler.filters.pci_passthrough_filter.PciPassthroughFilter scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,NUMATopologyFilter,AggregateInstanceExtraSpecsFilter,PciPassthroughFilter,SameHostFilter,DifferentHostFilter,AggregateCoreFilter ~~~ NumaTopologyFilter is configured. -MM
This issue looks similar to bug 1499083. I'm worried about to see both of this option configured in same time: hw:cpu_policy='dedicated', hw:cpu_thread_policy='prefer'. An other point is that I do not understand, the instances are well scheduled and started? no error? When we use cpu_policy=dedicated we normally do not let the memory float across NUMA nodes so I would expect QEMU to raise an error and so not let Nova to start the instance if we are considering to use a host NUMA node which does not provide enough small pages. - A potential workaround could be to configure the flavor to specifically use small pages by using: mem_page_size=MEMPAGES_SMALL - Don't you have any relevant error messages to share?
@Sahid Yeah, that workaround does somewhat work and so does disabling hp altogether. However, this is not fit the needs of the CU. As for the error I will post it bellow. -MM
*** This bug has been marked as a duplicate of bug 1517004 ***
Patch posted upstream: https://review.openstack.org/#/c/532168/
Customer confirmed that libvirt is properly reporting memory <capabilities> <host> <uuid>8588f4c2-7853-49da-95ed-5df1018f7c41</uuid> <cpu> <arch>x86_64</arch> <model>Haswell-noTSX</model> <vendor>Intel</vendor> <microcode version='58'/> <topology sockets='1' cores='12' threads='2'/> <feature name='vme'/> <feature name='ds'/> <feature name='acpi'/> <feature name='ss'/> <feature name='ht'/> <feature name='tm'/> <feature name='pbe'/> <feature name='dtes64'/> <feature name='monitor'/> <feature name='ds_cpl'/> <feature name='vmx'/> <feature name='smx'/> <feature name='est'/> <feature name='tm2'/> <feature name='xtpr'/> <feature name='pdcm'/> <feature name='dca'/> <feature name='osxsave'/> <feature name='f16c'/> <feature name='rdrand'/> <feature name='arat'/> <feature name='tsc_adjust'/> <feature name='cmt'/> <feature name='xsaveopt'/> <feature name='pdpe1gb'/> <feature name='abm'/> <feature name='invtsc'/> <pages unit='KiB' size='4'/> <pages unit='KiB' size='1048576'/> </cpu> <power_management> <suspend_mem/> </power_management> <migration_features> <live/> <uri_transports> <uri_transport>tcp</uri_transport> <uri_transport>rdma</uri_transport> </uri_transports> </migration_features> <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>201197900</memory> <pages unit='KiB' size='4'>25133651</pages> <pages unit='KiB' size='1048576'>96</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> <cpus num='24'> <cpu id='0' socket_id='0' core_id='0' siblings='0,24'/> <cpu id='1' socket_id='0' core_id='1' siblings='1,25'/> <cpu id='2' socket_id='0' core_id='2' siblings='2,26'/> <cpu id='3' socket_id='0' core_id='3' siblings='3,27'/> <cpu id='4' socket_id='0' core_id='4' siblings='4,28'/> <cpu id='5' socket_id='0' core_id='5' siblings='5,29'/> <cpu id='6' socket_id='0' core_id='8' siblings='6,30'/> <cpu id='7' socket_id='0' core_id='9' siblings='7,31'/> <cpu id='8' socket_id='0' core_id='10' siblings='8,32'/> <cpu id='9' socket_id='0' core_id='11' siblings='9,33'/> <cpu id='10' socket_id='0' core_id='12' siblings='10,34'/> <cpu id='11' socket_id='0' core_id='13' siblings='11,35'/> <cpu id='24' socket_id='0' core_id='0' siblings='0,24'/> <cpu id='25' socket_id='0' core_id='1' siblings='1,25'/> <cpu id='26' socket_id='0' core_id='2' siblings='2,26'/> <cpu id='27' socket_id='0' core_id='3' siblings='3,27'/> <cpu id='28' socket_id='0' core_id='4' siblings='4,28'/> <cpu id='29' socket_id='0' core_id='5' siblings='5,29'/> <cpu id='30' socket_id='0' core_id='8' siblings='6,30'/> <cpu id='31' socket_id='0' core_id='9' siblings='7,31'/> <cpu id='32' socket_id='0' core_id='10' siblings='8,32'/> <cpu id='33' socket_id='0' core_id='11' siblings='9,33'/> <cpu id='34' socket_id='0' core_id='12' siblings='10,34'/> <cpu id='35' socket_id='0' core_id='13' siblings='11,35'/> </cpus> </cell> <cell id='1'> <memory unit='KiB'>201326592</memory> <pages unit='KiB' size='4'>25165824</pages> <pages unit='KiB' size='1048576'>96</pages> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> </distances> <cpus num='24'> <cpu id='12' socket_id='1' core_id='0' siblings='12,36'/> <cpu id='13' socket_id='1' core_id='1' siblings='13,37'/> <cpu id='14' socket_id='1' core_id='2' siblings='14,38'/> <cpu id='15' socket_id='1' core_id='3' siblings='15,39'/> <cpu id='16' socket_id='1' core_id='4' siblings='16,40'/> <cpu id='17' socket_id='1' core_id='5' siblings='17,41'/> <cpu id='18' socket_id='1' core_id='8' siblings='18,42'/> <cpu id='19' socket_id='1' core_id='9' siblings='19,43'/> <cpu id='20' socket_id='1' core_id='10' siblings='20,44'/> <cpu id='21' socket_id='1' core_id='11' siblings='21,45'/> <cpu id='22' socket_id='1' core_id='12' siblings='22,46'/> <cpu id='23' socket_id='1' core_id='13' siblings='23,47'/> <cpu id='36' socket_id='1' core_id='0' siblings='12,36'/> <cpu id='37' socket_id='1' core_id='1' siblings='13,37'/> <cpu id='38' socket_id='1' core_id='2' siblings='14,38'/> <cpu id='39' socket_id='1' core_id='3' siblings='15,39'/> <cpu id='40' socket_id='1' core_id='4' siblings='16,40'/> <cpu id='41' socket_id='1' core_id='5' siblings='17,41'/> <cpu id='42' socket_id='1' core_id='8' siblings='18,42'/> <cpu id='43' socket_id='1' core_id='9' siblings='19,43'/> <cpu id='44' socket_id='1' core_id='10' siblings='20,44'/> <cpu id='45' socket_id='1' core_id='11' siblings='21,45'/> <cpu id='46' socket_id='1' core_id='12' siblings='22,46'/> <cpu id='47' socket_id='1' core_id='13' siblings='23,47'/> </cpus> </cell> </cells> </topology> <secmodel> <model>selinux</model> <doi>0</doi> <baselabel type='kvm'>system_u:system_r:svirt_t:s0</baselabel> <baselabel type='qemu'>system_u:system_r:svirt_tcg_t:s0</baselabel> </secmodel> <secmodel> <model>dac</model> <doi>0</doi> <baselabel type='kvm'>+107:+107</baselabel> <baselabel type='qemu'>+107:+107</baselabel> </secmodel> </host> <guest> <os_type>hvm</os_type> <arch name='i686'> <wordsize>32</wordsize> <emulator>/usr/libexec/qemu-kvm</emulator> <machine maxCpus='240'>pc-i440fx-rhel7.4.0</machine> <machine canonical='pc-i440fx-rhel7.4.0' maxCpus='240'>pc</machine> <machine maxCpus='240'>pc-i440fx-rhel7.0.0</machine> <machine maxCpus='240'>rhel6.3.0</machine> <machine maxCpus='240'>rhel6.4.0</machine> <machine maxCpus='240'>rhel6.0.0</machine> <machine maxCpus='240'>pc-i440fx-rhel7.1.0</machine> <machine maxCpus='240'>pc-i440fx-rhel7.2.0</machine> <machine maxCpus='255'>pc-q35-rhel7.3.0</machine> <machine maxCpus='240'>rhel6.5.0</machine> <machine maxCpus='384'>pc-q35-rhel7.4.0</machine> <machine canonical='pc-q35-rhel7.4.0' maxCpus='384'>q35</machine> <machine maxCpus='240'>rhel6.6.0</machine> <machine maxCpus='240'>rhel6.1.0</machine> <machine maxCpus='240'>rhel6.2.0</machine> <machine maxCpus='240'>pc-i440fx-rhel7.3.0</machine> <domain type='qemu'/> <domain type='kvm'> <emulator>/usr/libexec/qemu-kvm</emulator> </domain> </arch> <features> <cpuselection/> <deviceboot/> <disksnapshot default='on' toggle='no'/> <acpi default='on' toggle='yes'/> <apic default='on' toggle='no'/> <pae/> <nonpae/> </features> </guest> <guest> <os_type>hvm</os_type> <arch name='x86_64'> <wordsize>64</wordsize> <emulator>/usr/libexec/qemu-kvm</emulator> <machine maxCpus='240'>pc-i440fx-rhel7.4.0</machine> <machine canonical='pc-i440fx-rhel7.4.0' maxCpus='240'>pc</machine> <machine maxCpus='240'>pc-i440fx-rhel7.0.0</machine> <machine maxCpus='240'>rhel6.3.0</machine> <machine maxCpus='240'>rhel6.4.0</machine> <machine maxCpus='240'>rhel6.0.0</machine> <machine maxCpus='240'>pc-i440fx-rhel7.1.0</machine> <machine maxCpus='240'>pc-i440fx-rhel7.2.0</machine> <machine maxCpus='255'>pc-q35-rhel7.3.0</machine> <machine maxCpus='240'>rhel6.5.0</machine> <machine maxCpus='384'>pc-q35-rhel7.4.0</machine> <machine canonical='pc-q35-rhel7.4.0' maxCpus='384'>q35</machine> <machine maxCpus='240'>rhel6.6.0</machine> <machine maxCpus='240'>rhel6.1.0</machine> <machine maxCpus='240'>rhel6.2.0</machine> <machine maxCpus='240'>pc-i440fx-rhel7.3.0</machine> <domain type='qemu'/> <domain type='kvm'> <emulator>/usr/libexec/qemu-kvm</emulator> </domain> </arch> <features> <cpuselection/> <deviceboot/> <disksnapshot default='on' toggle='no'/> <acpi default='on' toggle='yes'/> <apic default='on' toggle='no'/> </features> </guest> </capabilities> Connection to cmp0 closed.
The issue looks to be different now, right? It seems this time you can't spawn some instances where previously it was that, they get killed by OOM-killer. What are the instances that are failing to boot? What is the error reported? can you share nova compute logs? Is there any third party components which are also consuming memory on host? It's important to consider reserving memory. You are trying to fit all memory on each NUMA nodes and It's not something I would recommend. It's important to take into account the QEMU overhead. The guest memory can't float so if there are any other component using it (emulator threads, qemu overhead, devices...) the guest could be killed. Two options need to be take into account [0]: reserved_host_memory_mb reserved_huge_pages [0] https://docs.openstack.org/ocata/config-reference/compute/config-options.html
Validated that the code works as intended but noted that it breaks oversubscription of memory for instances with a NUMA topology but no hugepages: https://bugzilla.redhat.com/show_bug.cgi?id=1664702 verification steps ------------------ 1. On the controller node of a fresh deployment: $ mysql -u nova -ss -r -e "select numa_topology from nova.compute_nodes where ID=2 ;" | python -m json.tool ... "mempages": [ { "nova_object.changes": [ "total", "used", "reserved", "size_kb" ], "nova_object.data": { "reserved": 0, "size_kb": 4, "total": 8594000, "used": 0 }, "nova_object.name": "NUMAPagesTopology", "nova_object.namespace": "nova", "nova_object.version": "1.1" }, ... 2. launch a flavor $ openstack flavor create --id 1234 --vcpus 1 --ram 64 --disk 15 test-flavor $ openstack flavor set --property hw:numa_nodes=1 test-flavor $ openstack server create test-vm --flavor test-flavor --image test-rhel75 3. Comparing the value in the db $ mysql -u nova -ss -r -e "select numa_topology from nova.compute_nodes where ID=2 ;" | python -m json.tool ... "mempages": [ { "nova_object.changes": [ "total", "used", "reserved", "size_kb" ], "nova_object.data": { "reserved": 0, "size_kb": 4, "total": 8594000, "used": 16384 }, "nova_object.name": "NUMAPagesTopology", "nova_object.namespace": "nova", "nova_object.version": "1.1" }, ... Running a full diff on the output of both requests we can check: $ diff before.log after.log ... 20c20 < "cpu_usage": 0, --- > "cpu_usage": 1, 41c41 < "memory_usage": 0, --- > "memory_usage": 64, 54c54 < "used": 0 --- > "used": 16384 62d61 ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0074