Bug 1963893

Summary: When there are two numa empty nodes in cmd, the boot fails
Product: Red Hat Enterprise Linux 9 Reporter: Zhenyu Zhang <zhenyzha>
Component: qemu-kvmAssignee: Guowen Shan <gshan>
qemu-kvm sub component: General QA Contact: Zhenyu Zhang <zhenyzha>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: medium CC: bhe, drjones, gshan, imammedo, jinzhao, juzhang, lcapitulino, mrezanin, qzhang, virt-maint, yihyu, yuhuang
Version: 9.0Keywords: Triaged
Target Milestone: beta   
Target Release: 9.0   
Hardware: aarch64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-6.2.0-1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-17 12:23:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1924294    

Description Zhenyu Zhang 2021-05-24 10:40:28 UTC
Description of problem:
When there are two numa empty nodes in cmd, the boot fails.

Version-Release number of selected component (if applicable):
Test Environment:
Host Distro: RHEL-9.0.0-20210518.3
Host Kernel: kernel-5.12.0-1.el9.aarch64
qemu-kvm: qemu-kvm-6.0.0-2.el9

How reproducible:
5/5

Steps to Reproduce:
1. boot guest with two numa empty nodes
/usr/libexec/qemu-kvm \
-cpu host \
-m 4096,maxmem=32G,slots=4 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN0 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN1  \
-smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
-numa node,memdev=mem-memN0,nodeid=0  \
-numa node,memdev=mem-memN1,nodeid=1  \
-numa node,nodeid=2  \
-numa node,nodeid=3 
qemu-kvm: FDT: Failed to create subnode /memory@140000000: FDT_ERR_EXISTS  ---------- hit this issue


Actual results:
boot fails

Expected results:
normal boot

Additional info:
1. When there are one numa empty node in cmd, the boot will succeed.
/usr/libexec/qemu-kvm \
-cpu host \
-m 4096,maxmem=32G,slots=4 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN0 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN1  \
-smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
-numa node,memdev=mem-memN0,nodeid=0  \
-numa node,memdev=mem-memN1,nodeid=1  \
-numa node,nodeid=2
VNC server running on ::1:5900


2. hit this issue on rhel.8.5.0 too
Test Environment:
Host Distro: RHEL-8.5.0-20210521.n.1
Host Kernel: kernel-4.18.0-305.8.el8.aarch64
qemu-kvm: qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d

Comment 1 Baoquan He 2021-05-26 03:06:20 UTC
Hi,

When I tested it on kvm guest with tcg accelerator on x86_64, and found it will trigger the 'FDT_ERR_EXISTS' issue when the amount of NUMA node is bigger than 2, even though the 3rd node is not empty.

So seems arm64 kvm guest can't get more than 2 NUMA nodes.

Thanks
Baoquan

Comment 2 Zhenyu Zhang 2021-05-26 03:33:13 UTC
(In reply to Baoquan He from comment #1)
> Hi,
> 
> When I tested it on kvm guest with tcg accelerator on x86_64, and found it
> will trigger the 'FDT_ERR_EXISTS' issue when the amount of NUMA node is
> bigger than 2, even though the 3rd node is not empty.
> 
> So seems arm64 kvm guest can't get more than 2 NUMA nodes.

Hello Baoquan,

When I use 3 or 4 non-empty nodes, the guest can boot normally,
and when use 3 nodes, 1 numa empty node in cmd, the guest can boot normally too.

I also confirmed with the x86 function owner that the scene of two numa empty nodes on x86 can be boot normally.


/usr/libexec/qemu-kvm \
-cpu host \
-m 6144,maxmem=32G,slots=4 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN0 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN1  \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN2  \
-smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
-numa node,memdev=mem-memN0,nodeid=0  \
-numa node,memdev=mem-memN1,nodeid=1  \
-numa node,memdev=mem-memN2,nodeid=2
VNC server running on ::1:5900


/usr/libexec/qemu-kvm \
-cpu host \
-m 8192,maxmem=32G,slots=4 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN0 \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN1  \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN2  \
-object memory-backend-ram,size=2048M,policy=default,id=mem-memN3  \
-smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
-numa node,memdev=mem-memN0,nodeid=0  \
-numa node,memdev=mem-memN1,nodeid=1  \
-numa node,memdev=mem-memN2,nodeid=2  \
-numa node,memdev=mem-memN3,nodeid=3
VNC server running on ::1:5900

Comment 3 Baoquan He 2021-05-26 10:00:04 UTC
This is main part of my qemu command, it's run on my laptop, x86_64 system.

        qemu-system-aarch64
        -accel tcg
        -M virt,gic-version=max
        -nodefaults
        -cpu max
        -m 4G,slots=4,maxmem=32G
        -smp 8,maxcpus=8,sockets=2,cores=2,threads=2

        -object memory-backend-ram,size=4G,policy=default,id=mem-memN0
        -numa node,memdev=mem-memN0,nodeid=0,cpus=0-3

        -object memory-backend-ram,size=1G,policy=default,id=mem-mem1
        -device pc-dimm,node=1,id=dimm-mem1,memdev=mem-mem1
        -numa node,nodeid=1,cpus=4-5
       -object memory-backend-ram,size=1G,policy=default,id=mem-mem2
       -device pc-dimm,node=2,id=dimm-mem2,memdev=mem-mem2
       -numa node,nodeid=2,cpus=6-7

Adding the last three lines of node2 will trigger below failure. Commenting them out will fix it.

qemu-system-aarch64: FDT: Failed to create subnode /memory@140000000: FDT_ERR_EXISTS
qemu-system-aarch64: network script /etc/qemu-ifdown failed with status 256

It's not about x86_64 guest.

Thanks
Baoquan

Comment 6 Guowen Shan 2021-06-01 05:32:44 UTC
I think this issue is existing on arm64 and it can be reproduced with upstream qemu
with the following command line:

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
  -accel kvm -machine virt,gic-version=host                        \
  -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=512M                     \
  -object memory-backend-ram,id=mem1,size=512M                     \
  -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
  -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
  -numa node,nodeid=2                                              \
  -numa node,nodeid=3                                              \
    :
  -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
    
  qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
                       FDT_ERR_EXISTS

It's caused by the duplicated (FDT) memory node names when multiple
empty nodes are given. The corresponding FDT node can't be created
in this case. A patch is submitted to community, to use NUMA ID
instead of the base address in the name to avoid the conflicts. Lets
see what feedback I will receive:

https://patchwork.kernel.org/project/qemu-devel/patch/20210601073004.106490-1-gshan@redhat.com/

Comment 7 Luiz Capitulino 2021-06-07 12:23:14 UTC
Is it a common use-case to have empty nodes? Can libvirt generate this command-line?

Comment 8 Guowen Shan 2021-06-23 06:10:02 UTC
I'm not sure about libvirt. Zhenyu, could you help to check if empty NUMA node
is allowed by libvirt xml?

For QEMU itself, empty NUMA node can be created and populated first and then
we can hot-add memory to this NUMA node. Otherwise, the hot-added memory has
to be put into other non-empty NUMA nodes.

Comment 9 Zhenyu Zhang 2021-06-23 08:12:35 UTC
(In reply to Guowen Shan from comment #8)
> I'm not sure about libvirt. Zhenyu, could you help to check if empty NUMA
> node
> is allowed by libvirt xml?
> 
> For QEMU itself, empty NUMA node can be created and populated first and then
> we can hot-add memory to this NUMA node. Otherwise, the hot-added memory has
> to be put into other non-empty NUMA nodes.

Hello Gavin,

After testing, libvirt xml currently doesn't support empty NUMA node.

  <cpu mode='host-passthrough' check='none'>
    <model fallback='forbid'/>
    <numa>
      <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/>
      <cell id='2'/>
      <cell id='3'/>
    </numa>
  </cpu>

# virsh define avocado-vt-vm1.xml
error: Failed to define domain from avocado-vt-vm1.xml
error: XML error: missing element or attribute './@memory'


And I found a bug related to libvirt,
Bug 1662586 - Can not create a numa cell in the domain xml with zero CPUs or zero memory -- CLOSED DEFERRED

Comment 10 Yumei Huang 2021-07-01 02:59:46 UTC
Hi, 

Seems libvirt xml support empty numa node by specify memory='0'. 

E.g.
    <numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
      <cell id='1' cpus='2' memory='0' unit='KiB'/>
      <cell id='2' cpus='3' memory='0' unit='KiB'/>
    </numa>

The QEMU cli generated is '-numa node,nodeid=0,cpus=0-1,mem=1024 -numa node,nodeid=1,cpus=2,mem=0 -numa node,nodeid=2,cpus=3,mem=0'.


It works well on x86 host, guest numa topology is as below.

# numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 743 MB
node 0 free: 59 MB
node 1 cpus: 2
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 3
node 2 size: 0 MB
node 2 free: 0 MB

Comment 11 Guowen Shan 2021-07-01 05:49:28 UTC
Thanks, Yumei. This issue is actually ARM64 specific because I don't think
device-tree is used by x86. There are two parts of work need to be finished,
in order to fix the issue. First of all, there is no speicification how the
memory node should be populated for empty NUMA node. For this, I've posted
one kernel patch to address it. Once the kernel patch is accepted, I can
post followup QEMU patch to implement and fix the issue:

https://lkml.org/lkml/2021/6/28/216

Comment 12 Yumei Huang 2021-07-01 10:09:13 UTC
(In reply to Yumei Huang from comment #10)
> Hi, 
> 
> Seems libvirt xml support empty numa node by specify memory='0'. 
> 
> E.g.
>     <numa>
>       <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
>       <cell id='1' cpus='2' memory='0' unit='KiB'/>
>       <cell id='2' cpus='3' memory='0' unit='KiB'/>
>     </numa>
> 
> The QEMU cli generated is '-numa node,nodeid=0,cpus=0-1,mem=1024 -numa
> node,nodeid=1,cpus=2,mem=0 -numa node,nodeid=2,cpus=3,mem=0'.
> 

I just noticed the behavior is different between i440fx and q35 machine types.

Above test is under pc-i440fx machine type. 

For q35 machine type, '-numa node,mem' is no longer used as bug 1783355. Instead, '-numa node,memdev' is used. The qemu cli is as below.

    -object '{"qom-type":"memory-backend-ram","id":"ram-node0","size":1073741824}' \
    -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
    -object '{"qom-type":"memory-backend-ram","id":"ram-node1","size":0}' \
    -numa node,nodeid=1,cpus=2,memdev=ram-node1 \
    -object '{"qom-type":"memory-backend-ram","id":"ram-node2","size":0}' \
    -numa node,nodeid=2,cpus=3,memdev=ram-node2 \

The 'memory=xx' in xml defines the size of memory-backend, which doesn't take value '0'. The domain can be defined, but can't start. 

# virsh start rhel8
error: Failed to start domain 'rhel8'
error: internal error: process exited while connecting to monitor: 2021-07-01T09:36:19.860281Z qemu-kvm: property 'size' of memory-backend-ram doesn't take value '0'

So the empty numa node is not supported under q35 machine types.


Sorry for inconvenience brought before.

Comment 13 Guowen Shan 2021-10-05 11:08:28 UTC
The kernel patches have been merged to linux-next. I will post
QEMU patches to follow that and fix the reported issue, once
they reaches linux-upstream. Note that we don't have to backport
the kernel patches.

   https://lkml.org/lkml/2021/9/27/31

   Documentation, dt, numa: Add note to empty NUMA node
   of, numa: Fetch empty NUMA node ID from distance map

Comment 14 Guowen Shan 2021-10-25 03:04:53 UTC
The QEMU patch (v5), fixing the issue, has been merged upstream:

99abb72520 hw/arm/virt: Don't create device-tree node for empty NUMA node

https://mail.gnu.org/archive/html/qemu-arm/2021-10/msg00302.html

Comment 16 Zhenyu Zhang 2021-12-17 03:08:21 UTC
With qemu-kvm-6.2.0-1.el9, operation meets expectations.

/usr/libexec/qemu-kvm \
-accel kvm -machine virt,gic-version=host                        \
-cpu host -smp 4,sockets=2,cores=2,threads=1  \
-m 1024M,maxmem=64G \
-object memory-backend-ram,id=mem0,size=512M                     \
-object memory-backend-ram,id=mem1,size=512M                     \
-numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
-numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
-numa node,nodeid=2                                              \
-numa node,nodeid=3                                              \
-monitor stdio                                             

/usr/libexec/qemu-kvm \
-accel kvm -machine virt,gic-version=host                        \
-cpu host -smp 4,sockets=2,cores=2,threads=1  \
-m 1024M,maxmem=64G \
-object memory-backend-ram,id=mem0,size=512M                     \
-object memory-backend-ram,id=mem1,size=512M                     \
-numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
-numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
-numa node,nodeid=2                                              \
-numa node,nodeid=3                                              \
-monitor stdio  
QEMU 6.2.0 monitor - type 'help' for more information
(qemu) VNC server running on ::1:5900

(qemu) info numa 
4 nodes
node 0 cpus: 0 1
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus: 2 3
node 1 size: 512 MB
node 1 plugged: 0 MB
node 2 cpus:
node 2 size: 0 MB
node 2 plugged: 0 MB
node 3 cpus:
node 3 size: 0 MB
node 3 plugged: 0 MB

Comment 17 Zhenyu Zhang 2021-12-17 03:13:24 UTC
Hello  Guowen,

But I encountered an issue after logging in to the guest.
The guest's internal numa check does not meet expectations.
If I need to report a new bug here, please let me know.

1.boot guest:
/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1'  \
-sandbox on  \
-blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \
-blockdev node-name=file_aavmf_vars,driver=file,filename=/home/kvm_autotest_root/images/avocado-vt-vm1_rhel900-aarch64-virtio-scsi.qcow2_VARS.fd,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \
-machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \
-device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
-device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
-nodefaults \
-device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
-device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \
-cpu host \
-smp 4,sockets=2,cores=2,threads=1  \
-m 1024M,maxmem=64G \
-object memory-backend-ram,id=mem0,size=512M                     \
-object memory-backend-ram,id=mem1,size=512M                     \
-numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
-numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
-numa node,nodeid=2                                              \
-numa node,nodeid=3                                              \
-chardev socket,server=on,path=/tmp/monitor-qmpmonitor1-20211216-215103-cKvxIdvE,id=qmp_id_qmpmonitor1,wait=off  \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-chardev socket,server=on,path=/tmp/monitor-catch_monitor-20211216-215103-cKvxIdvE,id=qmp_id_catch_monitor,wait=off  \
-mon chardev=qmp_id_catch_monitor,mode=control  \
-serial unix:'/tmp/serial-serial0-20211216-215103-cKvxIdvE',server=on,wait=off \
-device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
-device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \
-blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel900-aarch64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
-device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
-device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
-device virtio-net-pci,mac=9a:e6:48:3b:ac:11,rombar=0,id=idkc23q8,netdev=idfdQQ3e,bus=pcie-root-port-4,addr=0x0  \
-netdev tap,id=idfdQQ3e,vhost=on  \
-vnc :20  \
-rtc base=utc,clock=host,driftfix=slew \
-enable-kvm \
-device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=6 \
-device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=7 \
-monitor stdio 

2. check numa node:
[root@dhcp19-243-4 ~]# numactl -H
numactl -H
available: 3 nodes (0-2)   --------------------------------- hit issue only 3 numa
node 0 cpus: 0 1
node 0 size: 462 MB
node 0 free: 25 MB
node 1 cpus: 2 3
node 1 size: 493 MB
node 1 free: 38 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node   0   1   2 
  0:  10  20  20 
  1:  20  10  20 
  2:  20  20  10

Comment 18 Guowen Shan 2021-12-17 04:39:58 UTC
Zhenyu, the issue you found from comment#17 isn't the one, which
is tracked by this bugzilla. The fix resolves the coredump when
multiple empty NUMA nodes are specified because of device-tree.
However, these empty NUMA nodes won't be exported to guest and
it's expected behaviour.

Comment 19 Zhenyu Zhang 2021-12-17 06:06:27 UTC
Set the bug to verified according to Comment 16.

Discuss with guowen on comment 17, 
The issue you found is a known upstream issue and we needn't a bug for it now.
And will test again after the new features of virtio-mem appear.

Comment 22 errata-xmlrpc 2022-05-17 12:23:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: qemu-kvm), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2307