Set the Customer Escalation = Yes, as warranted by internal ACE, EN-31894, case 02686309.
After update the mlx firmware they don't have the rte panic anymore, but, despite configure vxlan, the tunnel does not exist. We tried to enable vxlan feature in the driver, but that doesn't help. There is not icmp or ipv4 except for announcements and broadcasts in the internal bridge. There are no other flows
+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | description | None | | disk | 16 | | extra_specs | {'hw:cpu_policy': 'dedicated', 'hw:cpu_sockets': '1', 'hw:cpu_threads': '2', 'hw:cpu_thread_policy': 'require', 'hw:emulator_threads_policy': 'share', 'hw:mem_page_size': 'large'} | | id | 8e8efd7f-7f55-4620-899a-b252733f2950 | | name | dpdk-tuned | | os-flavor-access:is_public | True | | properties | hw:cpu_policy='dedicated', hw:cpu_sockets='1', hw:cpu_thread_policy='require', hw:cpu_threads='2', hw:emulator_threads_policy='share', hw:mem_page_size='large' | | ram | 16384 | | rxtx_factor | 1.0 | | swap | 0 | | vcpus | 8 | +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | dpdk | | OS-EXT-SRV-ATTR:host | osp16v00-dpdk.t-mobile.lab | | OS-EXT-SRV-ATTR:hostname | dpdk-02 | | OS-EXT-SRV-ATTR:hypervisor_hostname | osp16v00-dpdk.t-mobile.lab | | OS-EXT-SRV-ATTR:instance_name | instance-0000001d | | OS-EXT-SRV-ATTR:kernel_id | | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | | | OS-EXT-SRV-ATTR:reservation_id | r-xie2u57a | | OS-EXT-SRV-ATTR:root_device_name | /dev/vda | | OS-EXT-SRV-ATTR:user_data | None | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2020-08-04T18:14:33.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | nmnet-1107=10.145.57.246 | | adminPass | oNeQUKqim3jC | | config_drive | | | created | 2020-08-04T18:14:22Z | | description | None | | flavor | disk='16', ephemeral='0', extra_specs.hw:cpu_policy='dedicated', extra_specs.hw:cpu_sockets='1', extra_specs.hw:cpu_thread_policy='require', extra_specs.hw:cpu_threads='2', extra_specs.hw:emulator_threads_policy='share', extra_specs.hw:mem_page_size='large', original_name='dpdk-tuned', ram='16384', swap='0', vcpus='8' | | hostId | 140c0eebe41ebe8cd7e73c23d6b356588be051b09d5db31be9b82de9 | | host_status | UP | | id | 2bef57ad-f9bb-4f89-a644-54c9de1a416f | | image | tmo2.rhel8 (46795702-4672-4b6f-b437-df4616b2334d) | | key_name | None | | locked | False | | locked_reason | None | | name | dpdk-02 | | progress | 0 | | project_id | 8f8ab640878949f282ef5442d123ca8a | | properties | | | security_groups | name='default' | | server_groups | [] | | status | ACTIVE | | tags | [] | | trusted_image_certificates | None | | updated | 2020-08-04T18:14:32Z | | user_id | 0234a294d866447f969cef32d209beec | | volumes_attached | | +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ <domain type='kvm' id='5'> <name>instance-0000001d</name> <uuid>2bef57ad-f9bb-4f89-a644-54c9de1a416f</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="20.1.2-0.20200401205214.28324e6.el8ost"/> <nova:name>dpdk-02</nova:name> <nova:creationTime>2020-08-04 18:14:30</nova:creationTime> <nova:flavor name="dpdk-tuned"> <nova:memory>16384</nova:memory> <nova:disk>16</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>8</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="0234a294d866447f969cef32d209beec">admin</nova:user> <nova:project uuid="8f8ab640878949f282ef5442d123ca8a">admin</nova:project> </nova:owner> <nova:root type="image" uuid="46795702-4672-4b6f-b437-df4616b2334d"/> </nova:instance> </metadata> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <vcpu placement='static' cpuset='0-3,64-67'>8</vcpu> <cputune> <shares>8192</shares> </cputune> <resource> <partition>/machine</partition> </resource> <sysinfo type='smbios'> <system> <entry name='manufacturer'>Red Hat</entry> <entry name='product'>OpenStack Compute</entry> <entry name='version'>20.1.2-0.20200401205214.28324e6.el8ost</entry> <entry name='serial'>2bef57ad-f9bb-4f89-a644-54c9de1a416f</entry> <entry name='uuid'>2bef57ad-f9bb-4f89-a644-54c9de1a416f</entry> <entry name='family'>Virtual Machine</entry> </system> </sysinfo> <os> <type arch='x86_64' machine='pc-i440fx-rhel7.6.0'>hvm</type> <boot dev='hd'/> <smbios mode='sysinfo'/> </os> <features> <acpi/> <apic/> </features> <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-IBPB</model> <vendor>AMD</vendor> <topology sockets='1' cores='4' threads='2'/> <feature policy='require' name='x2apic'/> <feature policy='require' name='tsc-deadline'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='clwb'/> <feature policy='require' name='umip'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='cmp_legacy'/> <feature policy='require' name='perfctr_core'/> <feature policy='require' name='wbnoinvd'/> <feature policy='require' name='amd-ssbd'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='disable' name='monitor'/> <feature policy='disable' name='svm'/> <feature policy='require' name='topoext'/> </cpu> <clock offset='utc'> <timer name='pit' tickpolicy='delay'/> <timer name='rtc' tickpolicy='catchup'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/var/lib/nova/instances/2bef57ad-f9bb-4f89-a644-54c9de1a416f/disk'/> <backingStore type='file' index='1'> <format type='raw'/> <source file='/var/lib/nova/instances/_base/d054a106e1ec4afcdbba977f2ddc41035e866b16'/> <backingStore/> </backingStore> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <controller type='usb' index='0' model='piix3-uhci'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <interface type='vhostuser'> <mac address='fa:16:3e:a9:3b:47'/> <source type='unix' path='/var/lib/vhost_sockets/vhufec3adce-e8' mode='server'/> <target dev='vhufec3adce-e8'/> <model type='virtio'/> <driver rx_queue_size='1024' tx_queue_size='1024'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/3'/> <log file='/var/lib/nova/instances/2bef57ad-f9bb-4f89-a644-54c9de1a416f/console.log' append='off'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/3'> <source path='/dev/pts/3'/> <log file='/var/lib/nova/instances/2bef57ad-f9bb-4f89-a644-54c9de1a416f/console.log' append='off'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <input type='tablet' bus='usb'> <alias name='input0'/> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'> <alias name='input1'/> </input> <input type='keyboard' bus='ps2'> <alias name='input2'/> </input> <graphics type='vnc' port='5901' autoport='yes' listen='192.168.20.21'> <listen type='address' address='192.168.20.21'/> </graphics> <video> <model type='cirrus' vram='16384' heads='1' primary='yes'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='virtio'> <stats period='10'/> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> </devices> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+42477</label> <imagelabel>+107:+42477</imagelabel> </seclabel> </domain> ############################################################################################### (magenta) [stack@director16 envfiles]$ openstack flavor show dpdk-basic +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | description | None | | disk | 16 | | extra_specs | {'hw:mem_page_size': 'large'} | | id | e01b7da7-ff70-4544-8af4-c050e29783c4 | | name | dpdk-basic | | os-flavor-access:is_public | True | | properties | hw:mem_page_size='large' | | ram | 16384 | | rxtx_factor | 1.0 | | swap | 0 | | vcpus | 8 | +----------------------------+--------------------------------------+ +-------------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | dpdk | | OS-EXT-SRV-ATTR:host | osp16v00-dpdk.t-mobile.lab | | OS-EXT-SRV-ATTR:hostname | dpdk-01 | | OS-EXT-SRV-ATTR:hypervisor_hostname | osp16v00-dpdk.t-mobile.lab | | OS-EXT-SRV-ATTR:instance_name | instance-0000001a | | OS-EXT-SRV-ATTR:kernel_id | | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | | | OS-EXT-SRV-ATTR:reservation_id | r-vv2jo36n | | OS-EXT-SRV-ATTR:root_device_name | /dev/vda | | OS-EXT-SRV-ATTR:user_data | None | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2020-08-04T18:09:13.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | nmnet-1107=10.145.57.237 | | adminPass | 9zDj94PoN5yM | | config_drive | | | created | 2020-08-04T18:09:03Z | | description | None | | flavor | disk='16', ephemeral='0', extra_specs.hw:mem_page_size='large', original_name='dpdk-basic', ram='16384', swap='0', vcpus='8' | | hostId | 140c0eebe41ebe8cd7e73c23d6b356588be051b09d5db31be9b82de9 | | host_status | UP | | id | c38a21bc-0d44-4731-afae-724358afe7c3 | | image | tmo2.rhel8 (46795702-4672-4b6f-b437-df4616b2334d) | | key_name | None | | locked | False | | locked_reason | None | | name | dpdk-01 | | progress | 0 | | project_id | 8f8ab640878949f282ef5442d123ca8a | | properties | | | security_groups | name='default' | | server_groups | [] | | status | ACTIVE | | tags | [] | | trusted_image_certificates | None | | updated | 2020-08-04T18:09:12Z | | user_id | 0234a294d866447f969cef32d209beec | | volumes_attached | | +-------------------------------------+------------------------------------------------------------------------------------------------------------------------------+ <domain type='kvm' id='4'> <name>instance-0000001a</name> <uuid>c38a21bc-0d44-4731-afae-724358afe7c3</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="20.1.2-0.20200401205214.28324e6.el8ost"/> <nova:name>dpdk-01</nova:name> <nova:creationTime>2020-08-04 18:09:10</nova:creationTime> <nova:flavor name="dpdk-basic"> <nova:memory>16384</nova:memory> <nova:disk>16</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>8</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="0234a294d866447f969cef32d209beec">admin</nova:user> <nova:project uuid="8f8ab640878949f282ef5442d123ca8a">admin</nova:project> </nova:owner> <nova:root type="image" uuid="46795702-4672-4b6f-b437-df4616b2334d"/> </nova:instance> </metadata> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <vcpu placement='static' cpuset='0-3,64-67'>8</vcpu> <cputune> <shares>8192</shares> </cputune> <resource> <partition>/machine</partition> </resource> <sysinfo type='smbios'> <system> <entry name='manufacturer'>Red Hat</entry> <entry name='product'>OpenStack Compute</entry> <entry name='version'>20.1.2-0.20200401205214.28324e6.el8ost</entry> <entry name='serial'>c38a21bc-0d44-4731-afae-724358afe7c3</entry> <entry name='uuid'>c38a21bc-0d44-4731-afae-724358afe7c3</entry> <entry name='family'>Virtual Machine</entry> </system> </sysinfo> <os> <type arch='x86_64' machine='pc-i440fx-rhel7.6.0'>hvm</type> <boot dev='hd'/> <smbios mode='sysinfo'/> </os> <features> <acpi/> <apic/> </features> <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-IBPB</model> <vendor>AMD</vendor> <topology sockets='8' cores='1' threads='1'/> <feature policy='require' name='x2apic'/> <feature policy='require' name='tsc-deadline'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='clwb'/> <feature policy='require' name='umip'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='cmp_legacy'/> <feature policy='require' name='perfctr_core'/> <feature policy='require' name='wbnoinvd'/> <feature policy='require' name='amd-ssbd'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='disable' name='monitor'/> <feature policy='disable' name='svm'/> <feature policy='require' name='topoext'/> </cpu> <clock offset='utc'> <timer name='pit' tickpolicy='delay'/> <timer name='rtc' tickpolicy='catchup'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/var/lib/nova/instances/c38a21bc-0d44-4731-afae-724358afe7c3/disk'/> <backingStore type='file' index='1'> <format type='raw'/> <source file='/var/lib/nova/instances/_base/d054a106e1ec4afcdbba977f2ddc41035e866b16'/> <backingStore/> </backingStore> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <controller type='usb' index='0' model='piix3-uhci'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <interface type='vhostuser'> <mac address='fa:16:3e:d0:d9:88'/> <source type='unix' path='/var/lib/vhost_sockets/vhu1ad80d72-54' mode='server'/> <target dev='vhu1ad80d72-54'/> <model type='virtio'/> <driver rx_queue_size='1024' tx_queue_size='1024'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/1'/> <log file='/var/lib/nova/instances/c38a21bc-0d44-4731-afae-724358afe7c3/console.log' append='off'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/1'> <source path='/dev/pts/1'/> <log file='/var/lib/nova/instances/c38a21bc-0d44-4731-afae-724358afe7c3/console.log' append='off'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <input type='tablet' bus='usb'> <alias name='input0'/> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'> <alias name='input1'/> </input> <input type='keyboard' bus='ps2'> <alias name='input2'/> </input> <graphics type='vnc' port='5900' autoport='yes' listen='192.168.20.21'> <listen type='address' address='192.168.20.21'/> </graphics> <video> <model type='cirrus' vram='16384' heads='1' primary='yes'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='virtio'> <stats period='10'/> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> </devices> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+42477</label> <imagelabel>+107:+42477</imagelabel> </seclabel> </domain>
I tried with libvirt & qemu-kvm versions : libvirt-6.7.0-1.el8.x86_64 qemu-kvm-5.0.0-2.module+el8.3.0+7379+0505d6ca.x86_64 In a machine with AMD EPYC cpu - <capabilities> <host> <uuid>38334c44-4735-4336-5538-343232523850</uuid> <cpu> <arch>x86_64</arch> <model>EPYC-IBPB</model> <vendor>AMD</vendor> <microcode version='134222416'/> <counter name='tsc' frequency='2096060000'/> <topology sockets='1' dies='1' cores='2' threads='2'/> ... <topology> <cells num='8'> <cell id='0'> <memory unit='KiB'>15986848</memory> <pages unit='KiB' size='4'>1911848</pages> <pages unit='KiB' size='2048'>1000</pages> <pages unit='KiB' size='1048576'>6</pages> ... <cell id='4'> <memory unit='KiB'>16417016</memory> <pages unit='KiB' size='4'>3577406</pages> <pages unit='KiB' size='2048'>5</pages> <pages unit='KiB' size='1048576'>2</pages> 1. Assign hugepages for 1G size. echo 10 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages 2. Assign hugepages for 1G size. echo 1000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G 3. Start domain - <domain type='kvm'> <name>test1</name> <uuid>0d7f64bd-5bab-42bc-b1ce-c5f0be315477</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> </hugepages> </memoryBacking> <vcpu placement='static'>4</vcpu> <os> <type arch='x86_64' machine='pc-q35-rhel8.2.0'>hvm</type> <boot dev='hd'/> </os> <features> <acpi/> <apic/> <vmport state='off'/> </features> <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-IBPB</model> <vendor>AMD</vendor> <topology sockets='4' dies='1' cores='1' threads='1'/> <feature policy='require' name='x2apic'/> ... <numa> <cell id='0' cpus='0-3' memory='1048576' unit='KiB' memAccess='shared'/> </numa> # virsh start test1 Domain test1 started 4. Tried to edit the test1 with "virsh edit test1" and tried to change the nodeset='0' to nodeset='4', the node 4 is not recognized <memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> ==> nodeset='4' </hugepages> </memoryBacking> #virsh edit test1 error: hugepages: node 4 not found Failed. Try again? [y,n,i,f,?]: BTW, the "virsh freepages" can not display well when there is not huge pages assigned in node 1. #virsh freepages --all Node 0: 4KiB: 1461184 2048KiB: 1000 1048576KiB: 6 error: operation failed: page size 2048 is not available on node 1 From my test above, libvirt can report numa info, but there is some issue when trying to touch nodes except node 0.
There is a mistake in step4 in comment 33 and please ignore the step4.
The step to use memory from node "4" is as below - <domain type='kvm'> <name>test1</name> <uuid>0d7f64bd-5bab-42bc-b1ce-c5f0be315477</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> </hugepages> </memoryBacking> <vcpu placement='static'>4</vcpu> <numatune> <memnode cellid='0' mode='strict' nodeset='4'/> </numatune> ... <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-IBPB</model> <vendor>AMD</vendor> <topology sockets='4' dies='1' cores='1' threads='1'/> ... <numa> <cell id='0' cpus='0-3' memory='1048576' unit='KiB' memAccess='shared'/> </numa> </cpu> ... Start the domain: # virsh start test1 Domain test1 started So, from the test using pure libvirt, libvirt can report the numa info and use it.
can you try disableing multiple numa nodes in the bios. amd eypc cpus support reporting a configurabley number of numa ndoes. i can see form your example you have 8 numa nodes <cells num='8'> on the host where this behavior was broken we could see form /sys and /proc that the plathform was configured to report all 64 cores as one numa node. this is singnifcaltly worse form a performance perspective but some times people disable numa because they dont understand how to configure application properly in a numa environment.
(In reply to smooney from comment #36) > can you try disableing multiple numa nodes in the bios. > amd eypc cpus support reporting a configurabley number of numa ndoes. > > i can see form your example you have 8 numa nodes <cells num='8'> > on the host where this behavior was broken we could see form /sys and /proc > that the > plathform was configured to report all 64 cores as one numa node. > > this is singnifcaltly worse form a performance perspective but some times > people disable numa because they dont understand > how to configure application properly in a numa environment. I am using a hp-dl385g10 machine and I can't find where to disable the mulitiple numa node in the BIOS. I just see a numa_group_size_optimize field which is in gray status(can't modified). Could you tell where to set it with more detailed step or do you want to login the machine to set it? Thanks.
I tried to disable numa node with "node=off" configuration in kernel. And the <topology> <cells num='1'> <cell id='0'> <memory unit='KiB'>29785064</memory> <pages unit='KiB' size='4'>6397690</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>4</pages> <distances> <sibling id='0' value='10'/> </distances> <cpus num='16'> <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0'/> <cpu id='1' socket_id='0' die_id='0' core_id='8' siblings='1'/> <cpu id='2' socket_id='0' die_id='0' core_id='16' siblings='2'/> <cpu id='3' socket_id='0' die_id='0' core_id='24' siblings='3'/> <cpu id='4' socket_id='0' die_id='0' core_id='32' siblings='4'/> <cpu id='5' socket_id='0' die_id='0' core_id='40' siblings='5'/> <cpu id='6' socket_id='0' die_id='0' core_id='48' siblings='6'/> <cpu id='7' socket_id='0' die_id='0' core_id='56' siblings='7'/> <cpu id='8' socket_id='1' die_id='0' core_id='0' siblings='8'/> <cpu id='9' socket_id='1' die_id='0' core_id='8' siblings='9'/> <cpu id='10' socket_id='1' die_id='0' core_id='16' siblings='10'/> <cpu id='11' socket_id='1' die_id='0' core_id='24' siblings='11'/> <cpu id='12' socket_id='1' die_id='0' core_id='32' siblings='12'/> <cpu id='13' socket_id='1' die_id='0' core_id='40' siblings='13'/> <cpu id='14' socket_id='1' die_id='0' core_id='48' siblings='14'/> <cpu id='15' socket_id='1' die_id='0' core_id='56' siblings='15'/> </cpus> </cell> </cells> </topology> 2. Assign hugepages for 1G size. echo 4 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages 3. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages1G 4. Tried to start the domain and it succeeded. <domain type='kvm'> <name>test1</name> <uuid>0d7f64bd-5bab-42bc-b1ce-c5f0be315477</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> </hugepages> </memoryBacking> <vcpu placement='static'>4</vcpu> <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> <os> <type arch='x86_64' machine='pc-q35-rhel8.2.0'>hvm</type> <boot dev='hd'/> </os> <features> <acpi/> <apic/> <vmport state='off'/> </features> <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-IBPB</model> <vendor>AMD</vendor> <topology sockets='4' dies='1' cores='1' threads='1'/> <feature policy='require' name='x2apic'/> <feature policy='require' name='tsc-deadline'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='cmp_legacy'/> <feature policy='require' name='perfctr_core'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='disable' name='monitor'/> <feature policy='disable' name='svm'/> <feature policy='require' name='topoext'/> <numa> <cell id='0' cpus='0-3' memory='1048576' unit='KiB' memAccess='shared'/> </numa> </cpu> ... </domain> # virsh start test1 Domain test1 started
the bios option very per vendor. i do not have access to an amd system to say precisely what you need but amd and dell both dscribe the ablity to expose multiple or a single numa node in terms of differnet memory interleaving options https://developer.amd.com/wp-content/resources/56308-NUMA%20Topology%20for%20AMD%20EPYC%E2%84%A2%20Naples%20Family%20Processors.PDF https://downloads.dell.com/manuals/all-products/esuprt_solutions_int/esuprt_solutions_int_solutions_resources/servers-solution-resources_white-papers12_en-us.pdf#%5B%7B%22num%22%3A10%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C33%2C405%2C0%5D hp on the other hand has a seperate numa per socket option https://h20195.www2.hpe.com/V2/GetPDF.aspx/a00038346enw.pdf#%5B%7B%22num%22%3A29%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C34%2C259%2C0%5D effectively i belive that the customer has configured Channel-Pair Interleaving or Socket Interleaving or set numa per soeck to 1 which would have resulted in 1 numa node per socket or 1 numa node total if they has only one sock. by they way i have seen libvirt corerectly report numa toplogy on amd eypc systems in the past so this does seam to be somewhat hardware specific.
Smooney, Yes,the Bios options are different for different vendor & product series. In my machine, with "work load profile" set to "custom", I can set the memory interleaving mode to "Socket Interleaving" in BIOS, but there is no "RBSU->Memory Options->NUMA memory domains per socket (1, 2, 4)". Although the interleaving mode is set and some cpu options(such as : Enabled cores per Processor" to "0") are set also, there are still 8 cells that can be seen from the os.
ok so did you try setting "RBSU->Memory Options->NUMA memory domains per socket" to 1? that shoudl of reduced your numa nodes fomr 8 to 2 or 8 to 1 of you have one socket. there was also a question to the customer regarding there bios options. in my first comment https://bugzilla.redhat.com/show_bug.cgi?id=1860231#c27 i asked that they try to ensure that they have mupltile numa node enabled in there bios. ideally whoever is managing the case should have asked them to provide that info a week ago when it was requested but it looks like that still has not been done. i suspect if the virt team are not able to repoduce this behavior by althering the bios options we are rapidly approching a point were we will be block by a lack of a repoducer. form an openstack point of view while this info is missing form the libvirt responce there is noting more we can do so if the virt team cant repoduce it locally im not sure where redhat can go from here to push this forward so we will need teh customer ot provdie more info and try chanign the bios settign to ensure they have multiple numa nodes exposed.
Maybe I didn't make it clear in my last comment - There is no option that is like "NUMA memory domains per socket" in the BIOS/ RBSU. So, I can't set it. What I set is "Socket Interleaving" & some cpu options. And it seems not disable the multiple nodes.
ah ok then in that case i have set a dev nack repoducer flag on the bug and i think we need to get more info from the customer.
Hello. Please let me know what info do you need from the customer. Thanks!
Hello The libvirtd is running on nova_libvirtd container. I would have hoped nova_compute.log should have the logs. Right now I don't have the old environment running. I can get you the nova-compute.log from the new env. Thanks
(In reply to Sazzad Masud from comment #51) > Hello > > The libvirtd is running on nova_libvirtd container. I would have hoped > nova_compute.log should have the logs. Not really. nova_compute.log is Nova log. Libvirt debug logs are from libvirt and contain debuging info of internal state of libvirt (e.g. it may give me a hint why libvirt hasn't found any NUMA nodes). > Right now I don't have the old > environment running. I can get you the nova-compute.log from the new env. > > Thanks BTW: The sosreport link from comment 0 is no longer available. Is there a way to download it, please?
Just a quick note: there's currently a bug in OSP's configuration of 'log_outputs': it prevents capturing the debug log filters, unfortunately. Here's the quick fix: (1) Open this one on the Compute host: /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/libvirtd.conf And change this line: log_outputs="3:file:/var/log/libvirt/libvirtd.log" To (notice the "1" here - the change): log_outputs="1:file:/var/log/libvirt/libvirtd.log" (2) Then restart the 'nova_libvirt' container: $> podman restart nova_libvirt (Which should also restart the 'libvirtd' service.) (3) Then attach the /var/log/containers/nova/libvirtd.log from the Compute host (it is the same file from the''nova_libvirt' container — /var/log/libvirt/libvirtd.log) as plain text to this bug.
I've found hardware internally that is identical to the customer's hardware and have managed to reproduce the problem and confirm this is a libvirt bug. The situation is as follows... Dell PowerEdge R6515, with a AMD EPYC 7702P 64-Core Processor with HT enabled. This gives 128 logical CPUs. The firmware allows a choice of exposing 1, 2 or 4 NUMA nodes. The customer has it on the default settings, exposing a single NUMA node. The bug does NOT occur if setting the firmware to expose 2 or 4 NUMA nodes instead, so that is a potential short term workaround for the libvirt bug. Exposing multiple NUMA nodes may even have performance benefits for some workloads, but may also have negative impacts on Nova's ability to spawn guests with PCI devices, as the PCI devs will be associated with just one of the NUMA nodes. In terms of the libvirt problem.... In libvirt 5.6.0, the code is a little different from current libvirt releases, which invalidates what I said in comment #29. virCapabilitiesInitNUMA() is called to populate NUMA topology. If libnuma reports NUMA is not available, libvirt populates fake NUMA, otherwise it populates real NUMA. There is no fallback if real NUMA population fails, which is what's happening in this case. Newer libvirt will always fallback to fake NUMA if anything fails. The reason populating real NUMA info fails is due to a bug in virNumaGetNodeCPUs() dating from 2010. This has a hack introduced in: commit 628c93574758abb59e71160042524d321a33543f Author: Daniel P. Berrangé <berrange> Date: Tue Aug 17 11:09:28 2010 -0400 Fix handling of sparse NUMA topologies When finding a sparse NUMA topology, libnuma will return ENOENT the first time it is invoked. On subsequent invocations it will return success, but with an all-1's CPU mask. Check for this, to avoid polluting the capabilities XML with 4096 bogus CPUs It compared the reported CPU mask to an all-ones mask to detect whether a NUMA node was not present. This implicitly assumes that an all-ones mask is always an error condition and never valid data. This assumption is broken when the number of CPUs in a host is a power of 2 that is 64 or greater. What libvirt should have been doing is looking at the global "numa_all_nodes" mask exported by libnuma to determine what NUMA nodes are valid.
i would recommend enabling 4 numa nodes as that should provide the best performance. from a sriov perspective the flavors or image should be updated to use the preferred policy to work around that fact that all pci devices will be reported as associated with numa node 0. this can be don by adding "hw:pci_numa_affinity_policy=preferred" to the flavor extra specs or "hw_pci_numa_affinity_policy=preferred" to the image metadata the flavor is proably more correct in this case. dan while i think of it libvirt need to be able to handle case where the numa nodes are non linear too. e.g. 0,2,4,6 this can happen if you don't populate all the dim slots. nova actually does not break in this configuration and its also arguably and system installer error to not populate the dims following the manufactures guidelines but if you are hardening the system anyway that might be anther edge case to harden. i thin libvirt actually reports the right thing in this case but just said i would mention it.
Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2020-August/msg00726.html
Merged upstream: 24d7d85208 virnuma: Don't work around numa_node_to_cpus() for non-existent nodes v6.6.0-552-g24d7d85208
Verified the patch with upstream libvirt in a dell-per6515 machine- # numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 0 size: 31413 MB node 0 free: 16381 MB node distances: node 0 0: 10 Part of output from "virsh capabilities"- <topology> <cells num='1'> <cell id='0'> <memory unit='KiB'>32167400</memory> <pages unit='KiB' size='4'>5420410</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>10</pages> <distances> <sibling id='0' value='10'/> </distances> <cpus num='128'> <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0,64'/> <cpu id='1' socket_id='0' die_id='0' core_id='1' siblings='1,65'/> <cpu id='2' socket_id='0' die_id='0' core_id='2' siblings='2,66'/> <cpu id='3' socket_id='0' die_id='0' core_id='3' siblings='3,67'/> <cpu id='4' socket_id='0' die_id='0' core_id='4' siblings='4,68'/> <cpu id='5' socket_id='0' die_id='0' core_id='5' siblings='5,69'/> <cpu id='6' socket_id='0' die_id='0' core_id='6' siblings='6,70'/> <cpu id='7' socket_id='0' die_id='0' core_id='7' siblings='7,71'/> <cpu id='8' socket_id='0' die_id='0' core_id='8' siblings='8,72'/> ... <cpu id='127' socket_id='0' die_id='0' core_id='63' siblings='63,127'/> </cpus> </cell> </cells>
(In reply to Jing Qi from comment #62) > Verified the patch with upstream libvirt in a dell-per6515 machine- > Thing is, because of the following commit upstream libvirt will work even without commit mentioned in comment 59. Sort of - for two NUMA nodes it would use the fallback and fill only one fake NUMA node. commit 6cc992bd1a3d58c7daff8ee487e14076bed58d86 Author: Daniel P. Berrangé <berrange> AuthorDate: Fri Nov 29 09:55:59 2019 +0000 Commit: Daniel P. Berrangé <berrange> CommitDate: Mon Dec 9 10:17:27 2019 +0000 conf: move NUMA capabilities into self contained object The NUMA cells are stored directly in the virCapsHostPtr struct. This moves them into their own struct allowing them to be stored independantly of the rest of the host capabilities. The change is used as an excuse to switch the representation to use a GPtrArray too. Reviewed-by: Michal Privoznik <mprivozn> Signed-off-by: Daniel P. Berrangé <berrange> This commit is contained in v6.0.0-rc1. What we should test probably is to define a guest with two NUMA nodes, 128 vCPUs per each node and see what capabilities libvirt comes up with. After my commit (= with current upstream) the capabilities should contain two nodes, before I think it would contain only one (with all vCPUs mapped under it).
(In reply to Michal Privoznik from comment #63) > (In reply to Jing Qi from comment #62) > > Verified the patch with upstream libvirt in a dell-per6515 machine- > > > > Thing is, because of the following commit upstream libvirt will work even > without commit mentioned in comment 59. Sort of - for two NUMA nodes it > would use the fallback and fill only one fake NUMA node. Michal, I tried with the latest downstream libvirt version before I use upstream libvirt - libvirt-daemon-6.6.0-2.module+el8.3.0+7567+dc41c0a9.x86_64 The output of capabilities for the machine in comment 62 is as below (similar with comment 27). <topology> <cells num='0'> </cells> </topology>
> This commit is contained in v6.0.0-rc1. What we should test probably is to > define a guest with two NUMA nodes, 128 vCPUs per each node and see what > capabilities libvirt comes up with. After my commit (= with current > upstream) the capabilities should contain two nodes, before I think it would > contain only one (with all vCPUs mapped under it). I use 254 vcpus since more than 255 vCPUs require extended interrupt mode enabled on the iommu device, can you please help to confirm if it's ok? I defined a guest with two NUMA node, 128 vcpus for the first node and 127 vcpus for the second one - <cpu mode='host-passthrough' check='none' migratable='on'> <feature policy='require' name='svm'/> <numa> <cell id='0' cpus='0-127' memory='1048576' unit='KiB' memAccess='shared'/> <cell id='1' cpus='128-254' memory='1048576' unit='KiB' memAccess='shared'/> </numa> </cpu> With version : libvirt-daemon-6.6.0-2.module+el8.3.0+7567+dc41c0a9.x86_64 <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>1001112</memory> <pages unit='KiB' size='4'>250278</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> </distances> <cpus num='128'> <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0'/> <cpu id='1' socket_id='1' die_id='0' core_id='0' siblings='1'/> <cpu id='2' socket_id='2' die_id='0' core_id='0' siblings='2'/> <cpu id='3' socket_id='3' die_id='0' core_id='0' siblings='3'/> <cpu id='4' socket_id='4' die_id='0' core_id='0' siblings='4'/> .... <cpu id='127' socket_id='127' die_id='0' core_id='0' siblings='127'/> </cpus> </cell> <cell id='1'> <memory unit='KiB'>799924</memory> <pages unit='KiB' size='4'>199981</pages> <pages unit='KiB' size='2048'>0</pages> <pages unit='KiB' size='1048576'>0</pages> <distances> <sibling id='0' value='20'/> <sibling id='1' value='10'/> </distances> <cpus num='127'> <cpu id='128' socket_id='128' die_id='0' core_id='0' siblings='128'/> .... <cpu id='254' socket_id='254' die_id='0' core_id='0' siblings='254'/> </cpus> </cell> </cells> </topology>
(In reply to Jing Qi from comment #65) > > This commit is contained in v6.0.0-rc1. What we should test probably is to > > define a guest with two NUMA nodes, 128 vCPUs per each node and see what > > capabilities libvirt comes up with. After my commit (= with current > > upstream) the capabilities should contain two nodes, before I think it would > > contain only one (with all vCPUs mapped under it). > > I use 254 vcpus since more than 255 vCPUs require extended interrupt mode > enabled on the iommu device, can you please > help to confirm if it's ok? Ah, sorry. Morning coffee hadn't kicked when I was writing that. Your original test is okay. Please disregard comment 63.
Verified with libvirt-6.6.0-3.el8_rc.9b168aa093.x86_64. Steps are the same in comment 62.
Note that the fix from comment 59 caused a regression on hosts with disjoint NUMA nodes. It is tracked in bug 1876956.
Verified with libvirt-daemon-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64 Steps are the same in comment 62.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137