Description of problem: When the guest numa set to 2, different numbers of CPUs are allocated to node0 and node1, Check "/sys/devices/system/node/node0/cpulist" information, which is opposite to the command line setting Version-Release number of selected component (if applicable): Test Environment: Host Distro: RHEL-8.5.0-20210506.n.0 Host Kernel: kernel-4.18.0-305.1.el8.aarch64 Guest Kernel: kernel-4.18.0-305.6.el8.aarch64 qemu-kvm: qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d How reproducible: 100% Steps to Reproduce: 1. boot a guest: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \ -blockdev node-name=file_aavmf_vars,driver=file,filename=/home/kvm_autotest_root/images/avocado-vt-vm1_rhel850-aarch64-virtio-scsi.qcow2_VARS.fd,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \ -machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \ -m 4096 \ -object memory-backend-ram,size=1024M,id=mem-mem0 \ -object memory-backend-ram,size=3072M,id=mem-mem1 \ -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ -numa node,memdev=mem-mem0,cpus=4,cpus=5 \ -------------------------- Assign cpus=4,cpus=5 to node0 -numa node,memdev=mem-mem1,cpus=0,cpus=1,cpus=2,cpus=3 \ -------------------------- Assign cpus=0,cpus=1,cpus=2,cpus=3 to node1 -cpu 'host' \ -chardev socket,id=qmp_id_qmpmonitor1,path=/tmp/monitor-qmpmonitor1-20210511-113935-rDQ6S2NS,server=on,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20210511-113935-rDQ6S2NS,server=on,wait=off \ -mon chardev=qmp_id_catch_monitor,mode=control \ -serial unix:'/tmp/serial-serial0-20210511-113935-rDQ6S2NS',server=on,wait=off \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel850-aarch64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \ -device virtio-net-pci,mac=9a:f0:bc:07:d3:07,rombar=0,id=idaooiuW,netdev=idgAEVQK,bus=pcie-root-port-4,addr=0x0 \ -netdev tap,id=idgAEVQK,vhost=on \ -vnc :20 \ -rtc base=utc,clock=host,driftfix=slew \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=6 \ -device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=7 \ -monitor stdio 2. Check-in HMP and meet expectations QEMU 6.0.0 monitor - type 'help' for more information (qemu) info numa 2 nodes node 0 cpus: 4 5 -------------------------- Assign cpus=4,cpus=5 to node0 node 0 size: 1024 MB node 0 plugged: 0 MB node 1 cpus: 0 1 2 3 -------------------------- Assign cpus=0,cpus=1,cpus=2,cpus=3 to node1 node 1 size: 3072 MB node 1 plugged: 0 MB 3.Check-in Guest and hit this issue [root@dhcp19-243-41 home]# cat /sys/devices/system/node/node0/cpulist 0-3 [root@dhcp19-243-41 home]# cat /sys/devices/system/node/node1/cpulist 4-5 Actual results: Set the guest CPU on numa node doesn't match with command line Expected results: Consistent with the command line and as expected Additional info: If cpu0 is assigned to node0, the result can be as expected: cli: 1. boot a guest: ...... > -object memory-backend-ram,size=1024M,id=mem-mem0 \ > -object memory-backend-ram,size=3072M,id=mem-mem1 \ > -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ > -numa node,memdev=mem-mem0,cpus=0,cpus=1 \ > -numa node,memdev=mem-mem1,cpus=2,cpus=3,cpus=4,cpus=5 \ ...... 2. Check-in HMP and meet expectations: (qemu) info numa 2 nodes node 0 cpus: 0 1 node 0 size: 1024 MB node 0 plugged: 0 MB node 1 cpus: 2 3 4 5 node 1 size: 3072 MB node 1 plugged: 0 MB 3.Check-in Guest and hit this issue: [root@localhost ~]# cat /sys/devices/system/node/node0/cpulist cat /sys/devices/system/node/node0/cpulist 0-1 [root@localhost ~]# cat /sys/devices/system/node/node1/cpulist cat /sys/devices/system/node/node1/cpulist 2-5
Test on x86 and ppc with the same pkg as comment0, not hit this issue.
Hi Zhenyu, Thanks for filling this BZ. We believe this is RHEL9 material for Virt-ARM since it's where we plan to work on improving guest numa support. Also, we believe this work should be done in conjunction with bug 1632238 so I'm setting it as a dep.
In another test scenario, the topology of numa is also inconsistent on rhel.9.0. When the guest boot with 32 numa nodes, guest numa node is 0 while expected numa node is 0-31. Test Environment: Host Distro: RHEL-9.0.0-20210515.3 Host Kernel: kernel-5.12.0-1.el9.aarch64 Guest Kernel: kernel-5.12.0-1.el9.aarch64 qemu-kvm: qemu-kvm-6.0.0-1.el9 1.boot guest with 32 numa nodes /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \ -blockdev node-name=file_aavmf_vars,driver=file,filename=/home/kvm_autotest_root/images/avocado-vt-vm1_rhel900-aarch64-virtio-scsi.qcow2_VARS.fd,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \ -machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \ -m 32768 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem0 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem1 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem2 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem3 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem4 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem5 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem6 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem7 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem8 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem9 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem10 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem11 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem12 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem13 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem14 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem15 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem16 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem17 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem18 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem19 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem20 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem21 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem22 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem23 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem24 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem25 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem26 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem27 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem28 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem29 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem30 \ -object memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem31 \ -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ -numa node,memdev=mem-mem0 \ -numa node,memdev=mem-mem1 \ -numa node,memdev=mem-mem2 \ -numa node,memdev=mem-mem3 \ -numa node,memdev=mem-mem4 \ -numa node,memdev=mem-mem5 \ -numa node,memdev=mem-mem6 \ -numa node,memdev=mem-mem7 \ -numa node,memdev=mem-mem8 \ -numa node,memdev=mem-mem9 \ -numa node,memdev=mem-mem10 \ -numa node,memdev=mem-mem11 \ -numa node,memdev=mem-mem12 \ -numa node,memdev=mem-mem13 \ -numa node,memdev=mem-mem14 \ -numa node,memdev=mem-mem15 \ -numa node,memdev=mem-mem16 \ -numa node,memdev=mem-mem17 \ -numa node,memdev=mem-mem18 \ -numa node,memdev=mem-mem19 \ -numa node,memdev=mem-mem20 \ -numa node,memdev=mem-mem21 \ -numa node,memdev=mem-mem22 \ -numa node,memdev=mem-mem23 \ -numa node,memdev=mem-mem24 \ -numa node,memdev=mem-mem25 \ -numa node,memdev=mem-mem26 \ -numa node,memdev=mem-mem27 \ -numa node,memdev=mem-mem28 \ -numa node,memdev=mem-mem29 \ -numa node,memdev=mem-mem30 \ -numa node,memdev=mem-mem31 \ -cpu 'host' \ -chardev socket,server=on,id=qmp_id_qmpmonitor1,wait=off,path=/tmp/monitor-qmpmonitor1-20210511-143057-yyB1p2Ij \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,server=on,id=qmp_id_catch_monitor,wait=off,path=/tmp/monitor-catch_monitor-20210511-143057-yyB1p2Ij \ -mon chardev=qmp_id_catch_monitor,mode=control \ -serial unix:'/tmp/serial-serial0-20210511-143057-yyB1p2Ij',server=on,wait=off \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel900-aarch64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \ -device virtio-net-pci,mac=9a:ec:79:d9:48:82,rombar=0,id=idzatz3H,netdev=idIivRnz,bus=pcie-root-port-4,addr=0x0 \ -netdev tap,id=idIivRnz,vhost=on \ -vnc :20 \ -rtc base=utc,clock=host,driftfix=slew \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=6 \ -device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=7 \ -monitor stdio 2. Check guest numa node [root@fedora ~]# cat /sys/devices/system/node/possible cat /sys/devices/system/node/possible 0 ----------------------------------------- expected numa node is 0-31 Additional info: With 3 numa nodes, guest numa node is 0-2. -object memory-backend-ram,size=512M,id=mem-mem0 \ -object memory-backend-ram,size=1024M,id=mem-mem1 \ -object memory-backend-ram,size=2560M,id=mem-mem2 \ -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ -numa node,memdev=mem-mem0,cpus=0,cpus=1 \ -numa node,memdev=mem-mem1,cpus=2,cpus=3 \ -numa node,memdev=mem-mem2,cpus=4,cpus=5 \ cat /sys/devices/system/node/possible 0-2 If we need to open a new bug here to track it separately, please let me know.
(In reply to Zhenyu Zhang from comment #0) > 3.Check-in Guest and hit this issue > [root@dhcp19-243-41 home]# cat /sys/devices/system/node/node0/cpulist > 0-3 > [root@dhcp19-243-41 home]# cat /sys/devices/system/node/node1/cpulist > 4-5 What memory is associated with node0 and with node1 per the guest? IOW, is this an issue with the guest kernel enumerating the nodes differently than QEMU or with the cpulists being associated incorrectly? (In reply to Zhenyu Zhang from comment #3) > In another test scenario, the topology of numa is also inconsistent on > rhel.9.0. > > When the guest boot with 32 numa nodes, guest numa node is 0 while expected > numa node is 0-31. > ... > If we need to open a new bug here to track it separately, please let me know. Yes, as this is a separate issue it should be a separate bug. However, it's not a bug. The RHEL9 kernel only supports up to 8 NUMA nodes. If the test works with 8, then there's no problem here. Thanks, drew
(In reply to Zhenyu Zhang from comment #0) > Description of problem: > When the guest numa set to 2, different numbers of CPUs are allocated to > node0 and node1, > Check "/sys/devices/system/node/node0/cpulist" information, which is > opposite to the command line setting > > > Version-Release number of selected component (if applicable): > Test Environment: > Host Distro: RHEL-8.5.0-20210506.n.0 > Host Kernel: kernel-4.18.0-305.1.el8.aarch64 > Guest Kernel: kernel-4.18.0-305.6.el8.aarch64 > qemu-kvm: qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d > > How reproducible: > 100% > > Steps to Reproduce: > 1. boot a guest: > /usr/libexec/qemu-kvm \ > -name 'avocado-vt-vm1' \ > -sandbox on \ > -blockdev > node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/ > QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \ > -blockdev > node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \ > -blockdev > node-name=file_aavmf_vars,driver=file,filename=/home/kvm_autotest_root/ > images/avocado-vt-vm1_rhel850-aarch64-virtio-scsi.qcow2_VARS.fd,auto-read- > only=on,discard=unmap \ > -blockdev > node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \ > -machine > virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ > -device > pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1, > chassis=1 \ > -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ > -nodefaults \ > -device > pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0, > chassis=2 \ > -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \ > -m 4096 \ > -object memory-backend-ram,size=1024M,id=mem-mem0 \ > -object memory-backend-ram,size=3072M,id=mem-mem1 \ > -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ > -numa node,memdev=mem-mem0,cpus=4,cpus=5 \ > -------------------------- Assign cpus=4,cpus=5 to node0 > -numa node,memdev=mem-mem1,cpus=0,cpus=1,cpus=2,cpus=3 \ > -------------------------- Assign cpus=0,cpus=1,cpus=2,cpus=3 to node1 > -cpu 'host' \ > -chardev > socket,id=qmp_id_qmpmonitor1,path=/tmp/monitor-qmpmonitor1-20210511-113935- > rDQ6S2NS,server=on,wait=off \ > -mon chardev=qmp_id_qmpmonitor1,mode=control \ > -chardev > socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20210511- > 113935-rDQ6S2NS,server=on,wait=off \ > -mon chardev=qmp_id_catch_monitor,mode=control \ > -serial > unix:'/tmp/serial-serial0-20210511-113935-rDQ6S2NS',server=on,wait=off \ > -device > pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0, > chassis=3 \ > -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ > -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ > -device > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0, > chassis=4 \ > -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \ > -blockdev > node-name=file_image1,driver=file,auto-read-only=on,discard=unmap, > aio=threads,filename=/home/kvm_autotest_root/images/rhel850-aarch64-virtio- > scsi.qcow2,cache.direct=on,cache.no-flush=off \ > -blockdev > node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_image1 \ > -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ > -device > pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0, > chassis=5 \ > -device > virtio-net-pci,mac=9a:f0:bc:07:d3:07,rombar=0,id=idaooiuW,netdev=idgAEVQK, > bus=pcie-root-port-4,addr=0x0 \ > -netdev tap,id=idgAEVQK,vhost=on \ > -vnc :20 \ > -rtc base=utc,clock=host,driftfix=slew \ > -enable-kvm \ > -device > pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0, > addr=0x2,chassis=6 \ > -device > pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=7 \ > -monitor stdio > > 2. Check-in HMP and meet expectations > QEMU 6.0.0 monitor - type 'help' for more information > (qemu) info numa > 2 nodes > node 0 cpus: 4 5 -------------------------- Assign cpus=4,cpus=5 to node0 > node 0 size: 1024 MB > node 0 plugged: 0 MB > node 1 cpus: 0 1 2 3 -------------------------- Assign > cpus=0,cpus=1,cpus=2,cpus=3 to node1 > node 1 size: 3072 MB > node 1 plugged: 0 MB > > 3.Check-in Guest and hit this issue > [root@dhcp19-243-41 home]# cat /sys/devices/system/node/node0/cpulist > 0-3 > [root@dhcp19-243-41 home]# cat /sys/devices/system/node/node1/cpulist > 4-5 > > Actual results: > Set the guest CPU on numa node doesn't match with command line > > Expected results: > Consistent with the command line and as expected > > Additional info: > If cpu0 is assigned to node0, the result can be as expected: > cli: > 1. boot a guest: > ...... > > -object memory-backend-ram,size=1024M,id=mem-mem0 \ > > -object memory-backend-ram,size=3072M,id=mem-mem1 \ > > -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ > > -numa node,memdev=mem-mem0,cpus=0,cpus=1 \ > > -numa node,memdev=mem-mem1,cpus=2,cpus=3,cpus=4,cpus=5 \ > ...... > > 2. Check-in HMP and meet expectations: > (qemu) info numa > 2 nodes > node 0 cpus: 0 1 > node 0 size: 1024 MB > node 0 plugged: 0 MB > node 1 cpus: 2 3 4 5 > node 1 size: 3072 MB > node 1 plugged: 0 MB > > 3.Check-in Guest and hit this issue: > [root@localhost ~]# cat /sys/devices/system/node/node0/cpulist > cat /sys/devices/system/node/node0/cpulist > 0-1 > [root@localhost ~]# cat /sys/devices/system/node/node1/cpulist > cat /sys/devices/system/node/node1/cpulist > 2-5 Hi, I noticed we have the exact same behavior on x86 for the 2 cases (cpu0 attached to node 1, this node becomes node #0 in the guest and cpu0 attached to node 0, then this latter is node #0 in the guest). Adding Igor in CC as I am surprised nobody complained yet about this, if this is a bug.
(In reply to Zhenyu Zhang from comment #3) > In another test scenario, the topology of numa is also inconsistent on > rhel.9.0. > > When the guest boot with 32 numa nodes, guest numa node is 0 while expected > numa node is 0-31. > > Test Environment: > Host Distro: RHEL-9.0.0-20210515.3 > Host Kernel: kernel-5.12.0-1.el9.aarch64 > Guest Kernel: kernel-5.12.0-1.el9.aarch64 > qemu-kvm: qemu-kvm-6.0.0-1.el9 > > 1.boot guest with 32 numa nodes > > /usr/libexec/qemu-kvm \ > -name 'avocado-vt-vm1' \ > -sandbox on \ > -blockdev > node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/ > QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \ > -blockdev > node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \ > -blockdev > node-name=file_aavmf_vars,driver=file,filename=/home/kvm_autotest_root/ > images/avocado-vt-vm1_rhel900-aarch64-virtio-scsi.qcow2_VARS.fd,auto-read- > only=on,discard=unmap \ > -blockdev > node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \ > -machine > virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ > -device > pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1, > chassis=1 \ > -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ > -nodefaults \ > -device > pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0, > chassis=2 \ > -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \ > -m 32768 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem0 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem1 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem2 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem3 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem4 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem5 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem6 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem7 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem8 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem9 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem10 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem11 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem12 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem13 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem14 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem15 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem16 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem17 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem18 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem19 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem20 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem21 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem22 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem23 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem24 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem25 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem26 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem27 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem28 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem29 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem30 \ > -object > memory-backend-ram,size=1024M,prealloc=yes,policy=default,id=mem-mem31 \ > -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ > -numa node,memdev=mem-mem0 \ > -numa node,memdev=mem-mem1 \ > -numa node,memdev=mem-mem2 \ > -numa node,memdev=mem-mem3 \ > -numa node,memdev=mem-mem4 \ > -numa node,memdev=mem-mem5 \ > -numa node,memdev=mem-mem6 \ > -numa node,memdev=mem-mem7 \ > -numa node,memdev=mem-mem8 \ > -numa node,memdev=mem-mem9 \ > -numa node,memdev=mem-mem10 \ > -numa node,memdev=mem-mem11 \ > -numa node,memdev=mem-mem12 \ > -numa node,memdev=mem-mem13 \ > -numa node,memdev=mem-mem14 \ > -numa node,memdev=mem-mem15 \ > -numa node,memdev=mem-mem16 \ > -numa node,memdev=mem-mem17 \ > -numa node,memdev=mem-mem18 \ > -numa node,memdev=mem-mem19 \ > -numa node,memdev=mem-mem20 \ > -numa node,memdev=mem-mem21 \ > -numa node,memdev=mem-mem22 \ > -numa node,memdev=mem-mem23 \ > -numa node,memdev=mem-mem24 \ > -numa node,memdev=mem-mem25 \ > -numa node,memdev=mem-mem26 \ > -numa node,memdev=mem-mem27 \ > -numa node,memdev=mem-mem28 \ > -numa node,memdev=mem-mem29 \ > -numa node,memdev=mem-mem30 \ > -numa node,memdev=mem-mem31 \ > -cpu 'host' \ > -chardev > socket,server=on,id=qmp_id_qmpmonitor1,wait=off,path=/tmp/monitor- > qmpmonitor1-20210511-143057-yyB1p2Ij \ > -mon chardev=qmp_id_qmpmonitor1,mode=control \ > -chardev > socket,server=on,id=qmp_id_catch_monitor,wait=off,path=/tmp/monitor- > catch_monitor-20210511-143057-yyB1p2Ij \ > -mon chardev=qmp_id_catch_monitor,mode=control \ > -serial > unix:'/tmp/serial-serial0-20210511-143057-yyB1p2Ij',server=on,wait=off \ > -device > pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0, > chassis=3 \ > -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ > -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ > -device > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0, > chassis=4 \ > -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \ > -blockdev > node-name=file_image1,driver=file,auto-read-only=on,discard=unmap, > aio=threads,filename=/home/kvm_autotest_root/images/rhel900-aarch64-virtio- > scsi.qcow2,cache.direct=on,cache.no-flush=off \ > -blockdev > node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_image1 \ > -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ > -device > pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0, > chassis=5 \ > -device > virtio-net-pci,mac=9a:ec:79:d9:48:82,rombar=0,id=idzatz3H,netdev=idIivRnz, > bus=pcie-root-port-4,addr=0x0 \ > -netdev tap,id=idIivRnz,vhost=on \ > -vnc :20 \ > -rtc base=utc,clock=host,driftfix=slew \ > -enable-kvm \ > -device > pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0, > addr=0x2,chassis=6 \ > -device > pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=7 \ > -monitor stdio > > 2. Check guest numa node > [root@fedora ~]# cat /sys/devices/system/node/possible > cat /sys/devices/system/node/possible > 0 > ----------------------------------------- expected numa node is 0-31 > > > > Additional info: > With 3 numa nodes, guest numa node is 0-2. > > -object memory-backend-ram,size=512M,id=mem-mem0 \ > -object memory-backend-ram,size=1024M,id=mem-mem1 \ > -object memory-backend-ram,size=2560M,id=mem-mem2 \ > -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ > -numa node,memdev=mem-mem0,cpus=0,cpus=1 \ > -numa node,memdev=mem-mem1,cpus=2,cpus=3 \ > -numa node,memdev=mem-mem2,cpus=4,cpus=5 \ > > cat /sys/devices/system/node/possible > 0-2 > > > If we need to open a new bug here to track it separately, please let me know. Hi, for that issue, please open a different BZ. Indeed, on x86 for the same NUMA config I get cat /sys/devices/cat /sys/devices/system/node/possible 0-31
(In reply to Eric Auger from comment #6) > (In reply to Zhenyu Zhang from comment #3) > > In another test scenario, the topology of numa is also inconsistent on > > rhel.9.0. > > > > When the guest boot with 32 numa nodes, guest numa node is 0 while expected > > numa node is 0-31. > > ... > > 2. Check guest numa node > > [root@fedora ~]# cat /sys/devices/system/node/possible > > cat /sys/devices/system/node/possible > > 0 > > ----------------------------------------- expected numa node is 0-31 > > I suspect there are also errors/warnings in dmesg about this, because the AArch64 RHEL9 kernel doesn't support 32 NUMA nodes, it only supports 8 (CONFIG_NODES_SHIFT is 3). ... > Hi, for that issue, please open a different BZ. Indeed, on x86 for the same > NUMA config I get > cat /sys/devices/cat /sys/devices/system/node/possible > 0-31 x86 supports 1024 nodes.
(In reply to Andrew Jones from comment #7) > (In reply to Eric Auger from comment #6) > > (In reply to Zhenyu Zhang from comment #3) > > > In another test scenario, the topology of numa is also inconsistent on > > > rhel.9.0. > > > > > > When the guest boot with 32 numa nodes, guest numa node is 0 while expected > > > numa node is 0-31. > > > > ... > > > 2. Check guest numa node > > > [root@fedora ~]# cat /sys/devices/system/node/possible > > > cat /sys/devices/system/node/possible > > > 0 > > > ----------------------------------------- expected numa node is 0-31 > > > > > I suspect there are also errors/warnings in dmesg about this, because the > AArch64 RHEL9 kernel doesn't support 32 NUMA nodes, it only supports 8 > (CONFIG_NODES_SHIFT is 3). > > ... > > Hi, for that issue, please open a different BZ. Indeed, on x86 for the same > > NUMA config I get > > cat /sys/devices/cat /sys/devices/system/node/possible > > 0-31 > > x86 supports 1024 nodes. I don't see any dmesg warning/error. However effectively as soon as you expose more than 8 nodes to the guest, cat /sys/devices/system/node/possible return 0 whereas with 8 nodes it returns the expected value of 0-7. So this looks as NOTABUG. If we want to get the CONFIG_NODES_SHIFT being increased on aarch64, we shall open a separate BZ.
(In reply to Eric Auger from comment #8) > (In reply to Andrew Jones from comment #7) > > (In reply to Eric Auger from comment #6) > > > (In reply to Zhenyu Zhang from comment #3) > > > > In another test scenario, the topology of numa is also inconsistent on > > > > rhel.9.0. > > > > > > > > When the guest boot with 32 numa nodes, guest numa node is 0 while expected > > > > numa node is 0-31. > > > > > > ... > > > > 2. Check guest numa node > > > > [root@fedora ~]# cat /sys/devices/system/node/possible > > > > cat /sys/devices/system/node/possible > > > > 0 > > > > ----------------------------------------- expected numa node is 0-31 > > > > > > > > I suspect there are also errors/warnings in dmesg about this, because the > > AArch64 RHEL9 kernel doesn't support 32 NUMA nodes, it only supports 8 > > (CONFIG_NODES_SHIFT is 3). > > > > ... > > > Hi, for that issue, please open a different BZ. Indeed, on x86 for the same > > > NUMA config I get > > > cat /sys/devices/cat /sys/devices/system/node/possible > > > 0-31 > > > > x86 supports 1024 nodes. > > I don't see any dmesg warning/error. However effectively as soon as you > expose more than 8 nodes to the guest, > cat /sys/devices/system/node/possible return 0 whereas with 8 nodes it > returns the expected value of 0-7. > So this looks as NOTABUG. If we want to get the CONFIG_NODES_SHIFT being > increased on aarch64, we shall open a separate BZ. Actually, I think we did increase the config. I didn't actually check the config when I replied before (I was going from memory), but now I just checked and we have CONFIG_NODES_SHIFT=6. So, assuming the machine you're on also has 6, then this is a bug, because we should now support up to 64 nodes.
(In reply to Andrew Jones from comment #9) > > Actually, I think we did increase the config. I didn't actually check the > config when I replied before (I was going from memory), but now I just > checked and we have CONFIG_NODES_SHIFT=6. So, assuming the machine you're on > also has 6, then this is a bug, because we should now support up to 64 nodes. The config I just checked was config-5.14.0-6.el9.aarch64
(In reply to Andrew Jones from comment #10) > (In reply to Andrew Jones from comment #9) > > > > Actually, I think we did increase the config. I didn't actually check the > > config when I replied before (I was going from memory), but now I just > > checked and we have CONFIG_NODES_SHIFT=6. So, assuming the machine you're on > > also has 6, then this is a bug, because we should now support up to 64 nodes. > > The config I just checked was > > config-5.14.0-6.el9.aarch64 And now I see it'll probably get bumped to 9. https://gitlab.com/cki-project/kernel-ark/-/merge_requests/1333
I tested with a guest where CONFIG_NODES_SHIFT=3.
With respect to the primary issue (reversed node id), to me the SRAT GICC affinity structure looks correct In proximity domain 0 we have ACPI processor UIDs 4,5 In proximity domain 1 we have we have ACPI processor UIDs 0-3 On guest if you look at dmesg output you find the same info: [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x0 -> Node 0 [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x1 -> Node 0 [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x2 -> Node 0 [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x3 -> Node 0 [ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x4 -> Node 1 [ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x5 -> Node 1 But proximity domains are not logical node ids. If you look at kernel numa/srat.c there are mapping arrays between those (pxm_to_node_map and node_to_pxm_map) acpi_parse_gicc_affinity /* Callback for Proximity Domain -> ACPI processor UID mapping */ (arch/arm64/kernel/acpi_numa.c) void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) acpi_map_pxm_to_node(pxm) node = first_unset_node(nodes_found_map); __acpi_map_pxm_to_node(pxm, node); and acpi_table_parse_srat() is called in the order of GICC Affinity Structures layout, which is built from CPU #0 upwards in qemu hw/arm/virt-acpi-build.c (build_srat). So this looks good to me.
So to me it looks the 2 different issues reported here are not real bugs. Do you agree to close the BZ as NOTABUG? Otherwise, please refine your expectations according to the provided comments. Thanks. Eric
(In reply to Eric Auger from comment #14) > So to me it looks the 2 different issues reported here are not real bugs. Do > you agree to close the BZ as NOTABUG? Otherwise, please refine your > expectations according to the provided comments. Thanks. Eric Hello Eric, Thanks for your patience. So we will eventually support at least 64 numa on ARM RHEL.9.0? I will update our autotest script based on this bug. For this bug, I think it is more appropriate to close it as CURRENTRELEASE. What do you think?
As per https://gitlab.com/cki-project/kernel-ark/-/merge_requests/1333, it should be 2⁹ = 512. This is for the 2d reported issue. Wrt the 1st one (proximity domain versus domain id), do you agree with the conclusion? I am OK with either NOTABUG or CURRENTRELEASE.
(In reply to Eric Auger from comment #5) > (In reply to Zhenyu Zhang from comment #0) [...] > > Hi, I noticed we have the exact same behavior on x86 for the 2 cases (cpu0 > attached to node 1, this node becomes node #0 in the guest and cpu0 attached > to node 0, then this latter is node #0 in the guest). Adding Igor in CC as I > am surprised nobody complained yet about this, if this is a bug. numbers in QEMU and kernel doesn't have to be the same, in general kernel uses its own numbering for numa nodes and CPU. (sometimes both match when kernel enumerates resources in the same order as declared by QEMU, but that happens by an accident and not by design). The important part is how resources are grouped together.
(In reply to Eric Auger from comment #16) > As per https://gitlab.com/cki-project/kernel-ark/-/merge_requests/1333, it > should be 2⁹ = 512. This is for the 2d reported issue. > Wrt the 1st one (proximity domain versus domain id), do you agree with the > conclusion? I am OK with either NOTABUG or CURRENTRELEASE. Hello Eric, Do you know in which version the increase of numa node is expected to be implemented? Because I use 5.14.0-6.el9.aarch64 to boot 128 numa guest, but the result of 'cat /sys/devices/system/node/possible' is 0. boot 64 numa guest meets expectations. On the other hand, the number of numa nodes we finally support on rhel.8 is 8 numa right?