Bug 1963893
Summary: | When there are two numa empty nodes in cmd, the boot fails | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Zhenyu Zhang <zhenyzha> |
Component: | qemu-kvm | Assignee: | Guowen Shan <gshan> |
qemu-kvm sub component: | General | QA Contact: | Zhenyu Zhang <zhenyzha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | medium | CC: | bhe, drjones, gshan, imammedo, jinzhao, juzhang, lcapitulino, mrezanin, qzhang, virt-maint, yihyu, yuhuang |
Version: | 9.0 | Keywords: | Triaged |
Target Milestone: | beta | ||
Target Release: | 9.0 | ||
Hardware: | aarch64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-6.2.0-1.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-17 12:23:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1924294 |
Description
Zhenyu Zhang
2021-05-24 10:40:28 UTC
Hi, When I tested it on kvm guest with tcg accelerator on x86_64, and found it will trigger the 'FDT_ERR_EXISTS' issue when the amount of NUMA node is bigger than 2, even though the 3rd node is not empty. So seems arm64 kvm guest can't get more than 2 NUMA nodes. Thanks Baoquan (In reply to Baoquan He from comment #1) > Hi, > > When I tested it on kvm guest with tcg accelerator on x86_64, and found it > will trigger the 'FDT_ERR_EXISTS' issue when the amount of NUMA node is > bigger than 2, even though the 3rd node is not empty. > > So seems arm64 kvm guest can't get more than 2 NUMA nodes. Hello Baoquan, When I use 3 or 4 non-empty nodes, the guest can boot normally, and when use 3 nodes, 1 numa empty node in cmd, the guest can boot normally too. I also confirmed with the x86 function owner that the scene of two numa empty nodes on x86 can be boot normally. /usr/libexec/qemu-kvm \ -cpu host \ -m 6144,maxmem=32G,slots=4 \ -object memory-backend-ram,size=2048M,policy=default,id=mem-memN0 \ -object memory-backend-ram,size=2048M,policy=default,id=mem-memN1 \ -object memory-backend-ram,size=2048M,policy=default,id=mem-memN2 \ -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \ -numa node,memdev=mem-memN0,nodeid=0 \ -numa node,memdev=mem-memN1,nodeid=1 \ -numa node,memdev=mem-memN2,nodeid=2 VNC server running on ::1:5900 /usr/libexec/qemu-kvm \ -cpu host \ -m 8192,maxmem=32G,slots=4 \ -object memory-backend-ram,size=2048M,policy=default,id=mem-memN0 \ -object memory-backend-ram,size=2048M,policy=default,id=mem-memN1 \ -object memory-backend-ram,size=2048M,policy=default,id=mem-memN2 \ -object memory-backend-ram,size=2048M,policy=default,id=mem-memN3 \ -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \ -numa node,memdev=mem-memN0,nodeid=0 \ -numa node,memdev=mem-memN1,nodeid=1 \ -numa node,memdev=mem-memN2,nodeid=2 \ -numa node,memdev=mem-memN3,nodeid=3 VNC server running on ::1:5900 This is main part of my qemu command, it's run on my laptop, x86_64 system. qemu-system-aarch64 -accel tcg -M virt,gic-version=max -nodefaults -cpu max -m 4G,slots=4,maxmem=32G -smp 8,maxcpus=8,sockets=2,cores=2,threads=2 -object memory-backend-ram,size=4G,policy=default,id=mem-memN0 -numa node,memdev=mem-memN0,nodeid=0,cpus=0-3 -object memory-backend-ram,size=1G,policy=default,id=mem-mem1 -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1 -numa node,nodeid=1,cpus=4-5 -object memory-backend-ram,size=1G,policy=default,id=mem-mem2 -device pc-dimm,node=2,id=dimm-mem2,memdev=mem-mem2 -numa node,nodeid=2,cpus=6-7 Adding the last three lines of node2 will trigger below failure. Commenting them out will fix it. qemu-system-aarch64: FDT: Failed to create subnode /memory@140000000: FDT_ERR_EXISTS qemu-system-aarch64: network script /etc/qemu-ifdown failed with status 256 It's not about x86_64 guest. Thanks Baoquan I think this issue is existing on arm64 and it can be reproduced with upstream qemu with the following command line: /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ : -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \ FDT_ERR_EXISTS It's caused by the duplicated (FDT) memory node names when multiple empty nodes are given. The corresponding FDT node can't be created in this case. A patch is submitted to community, to use NUMA ID instead of the base address in the name to avoid the conflicts. Lets see what feedback I will receive: https://patchwork.kernel.org/project/qemu-devel/patch/20210601073004.106490-1-gshan@redhat.com/ Is it a common use-case to have empty nodes? Can libvirt generate this command-line? I'm not sure about libvirt. Zhenyu, could you help to check if empty NUMA node is allowed by libvirt xml? For QEMU itself, empty NUMA node can be created and populated first and then we can hot-add memory to this NUMA node. Otherwise, the hot-added memory has to be put into other non-empty NUMA nodes. (In reply to Guowen Shan from comment #8) > I'm not sure about libvirt. Zhenyu, could you help to check if empty NUMA > node > is allowed by libvirt xml? > > For QEMU itself, empty NUMA node can be created and populated first and then > we can hot-add memory to this NUMA node. Otherwise, the hot-added memory has > to be put into other non-empty NUMA nodes. Hello Gavin, After testing, libvirt xml currently doesn't support empty NUMA node. <cpu mode='host-passthrough' check='none'> <model fallback='forbid'/> <numa> <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/> <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/> <cell id='2'/> <cell id='3'/> </numa> </cpu> # virsh define avocado-vt-vm1.xml error: Failed to define domain from avocado-vt-vm1.xml error: XML error: missing element or attribute './@memory' And I found a bug related to libvirt, Bug 1662586 - Can not create a numa cell in the domain xml with zero CPUs or zero memory -- CLOSED DEFERRED Hi, Seems libvirt xml support empty numa node by specify memory='0'. E.g. <numa> <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/> <cell id='1' cpus='2' memory='0' unit='KiB'/> <cell id='2' cpus='3' memory='0' unit='KiB'/> </numa> The QEMU cli generated is '-numa node,nodeid=0,cpus=0-1,mem=1024 -numa node,nodeid=1,cpus=2,mem=0 -numa node,nodeid=2,cpus=3,mem=0'. It works well on x86 host, guest numa topology is as below. # numactl -H available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: 743 MB node 0 free: 59 MB node 1 cpus: 2 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: 3 node 2 size: 0 MB node 2 free: 0 MB Thanks, Yumei. This issue is actually ARM64 specific because I don't think device-tree is used by x86. There are two parts of work need to be finished, in order to fix the issue. First of all, there is no speicification how the memory node should be populated for empty NUMA node. For this, I've posted one kernel patch to address it. Once the kernel patch is accepted, I can post followup QEMU patch to implement and fix the issue: https://lkml.org/lkml/2021/6/28/216 (In reply to Yumei Huang from comment #10) > Hi, > > Seems libvirt xml support empty numa node by specify memory='0'. > > E.g. > <numa> > <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/> > <cell id='1' cpus='2' memory='0' unit='KiB'/> > <cell id='2' cpus='3' memory='0' unit='KiB'/> > </numa> > > The QEMU cli generated is '-numa node,nodeid=0,cpus=0-1,mem=1024 -numa > node,nodeid=1,cpus=2,mem=0 -numa node,nodeid=2,cpus=3,mem=0'. > I just noticed the behavior is different between i440fx and q35 machine types. Above test is under pc-i440fx machine type. For q35 machine type, '-numa node,mem' is no longer used as bug 1783355. Instead, '-numa node,memdev' is used. The qemu cli is as below. -object '{"qom-type":"memory-backend-ram","id":"ram-node0","size":1073741824}' \ -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \ -object '{"qom-type":"memory-backend-ram","id":"ram-node1","size":0}' \ -numa node,nodeid=1,cpus=2,memdev=ram-node1 \ -object '{"qom-type":"memory-backend-ram","id":"ram-node2","size":0}' \ -numa node,nodeid=2,cpus=3,memdev=ram-node2 \ The 'memory=xx' in xml defines the size of memory-backend, which doesn't take value '0'. The domain can be defined, but can't start. # virsh start rhel8 error: Failed to start domain 'rhel8' error: internal error: process exited while connecting to monitor: 2021-07-01T09:36:19.860281Z qemu-kvm: property 'size' of memory-backend-ram doesn't take value '0' So the empty numa node is not supported under q35 machine types. Sorry for inconvenience brought before. The kernel patches have been merged to linux-next. I will post QEMU patches to follow that and fix the reported issue, once they reaches linux-upstream. Note that we don't have to backport the kernel patches. https://lkml.org/lkml/2021/9/27/31 Documentation, dt, numa: Add note to empty NUMA node of, numa: Fetch empty NUMA node ID from distance map The QEMU patch (v5), fixing the issue, has been merged upstream: 99abb72520 hw/arm/virt: Don't create device-tree node for empty NUMA node https://mail.gnu.org/archive/html/qemu-arm/2021-10/msg00302.html With qemu-kvm-6.2.0-1.el9, operation meets expectations. /usr/libexec/qemu-kvm \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 4,sockets=2,cores=2,threads=1 \ -m 1024M,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ -monitor stdio /usr/libexec/qemu-kvm \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 4,sockets=2,cores=2,threads=1 \ -m 1024M,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ -monitor stdio QEMU 6.2.0 monitor - type 'help' for more information (qemu) VNC server running on ::1:5900 (qemu) info numa 4 nodes node 0 cpus: 0 1 node 0 size: 512 MB node 0 plugged: 0 MB node 1 cpus: 2 3 node 1 size: 512 MB node 1 plugged: 0 MB node 2 cpus: node 2 size: 0 MB node 2 plugged: 0 MB node 3 cpus: node 3 size: 0 MB node 3 plugged: 0 MB Hello Guowen, But I encountered an issue after logging in to the guest. The guest's internal numa check does not meet expectations. If I need to report a new bug here, please let me know. 1.boot guest: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \ -blockdev node-name=file_aavmf_vars,driver=file,filename=/home/kvm_autotest_root/images/avocado-vt-vm1_rhel900-aarch64-virtio-scsi.qcow2_VARS.fd,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \ -machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \ -cpu host \ -smp 4,sockets=2,cores=2,threads=1 \ -m 1024M,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -numa node,nodeid=2 \ -numa node,nodeid=3 \ -chardev socket,server=on,path=/tmp/monitor-qmpmonitor1-20211216-215103-cKvxIdvE,id=qmp_id_qmpmonitor1,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,server=on,path=/tmp/monitor-catch_monitor-20211216-215103-cKvxIdvE,id=qmp_id_catch_monitor,wait=off \ -mon chardev=qmp_id_catch_monitor,mode=control \ -serial unix:'/tmp/serial-serial0-20211216-215103-cKvxIdvE',server=on,wait=off \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel900-aarch64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \ -device virtio-net-pci,mac=9a:e6:48:3b:ac:11,rombar=0,id=idkc23q8,netdev=idfdQQ3e,bus=pcie-root-port-4,addr=0x0 \ -netdev tap,id=idfdQQ3e,vhost=on \ -vnc :20 \ -rtc base=utc,clock=host,driftfix=slew \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=6 \ -device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=7 \ -monitor stdio 2. check numa node: [root@dhcp19-243-4 ~]# numactl -H numactl -H available: 3 nodes (0-2) --------------------------------- hit issue only 3 numa node 0 cpus: 0 1 node 0 size: 462 MB node 0 free: 25 MB node 1 cpus: 2 3 node 1 size: 493 MB node 1 free: 38 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10 Zhenyu, the issue you found from comment#17 isn't the one, which is tracked by this bugzilla. The fix resolves the coredump when multiple empty NUMA nodes are specified because of device-tree. However, these empty NUMA nodes won't be exported to guest and it's expected behaviour. Set the bug to verified according to Comment 16. Discuss with guowen on comment 17, The issue you found is a known upstream issue and we needn't a bug for it now. And will test again after the new features of virtio-mem appear. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: qemu-kvm), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2307 |