Description of problem: QEMU got core dump if bind the guest on numa without nr huge pages Version-Release number of selected component (if applicable): kernel-4.18.0-75.el8.ppc64le qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.ppc64le How reproducible: 4/4 Enable 1G huge page on P9 like the following,Please add it to kernel line 1.default_hugepagesz=1G hugepagesz=1G hugepages=100 hugepagesz=2M hugepages=10240 mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G 2.bind the guest on node0 by numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,cap-hpt-max-page-size=16M,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=serial_id_serial0,path=/tmp/5,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel76-ppc64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /dev/hugepages1G -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio 3.if the nr huge page of node 0 is zero coincidently, qemu will get core dump. 4.And check the nr on node0 after getting core dump cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages 0 5.Checking the whole usage of huge page on the host,there was 50 nr on another node. [root@ibm-p9wr-04 ~]# cat /proc/meminfo|grep Huge - AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 50 HugePages_Free: 50 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 73400320 kB Actual results, Got core dump Expected results, Core dump doesn't happen under this extreme situation. For x86 platform, There wasn't core dump problem,but will print error messages instead. Additional info: No matter what,it shouldn't get core dump under this extreme situation.
Created attachment 1541682 [details] debuglog
It's not only about compatible guest but also native guest on P9,thanks.It is a ppc only bug.
Additional information, Tried it on *rhel7* host but it *wasn't* reproducible on P8 host. qemu-kvm-rhev-2.12.0-18.el7_6.3.ppc64le kernel-3.10.0-957.10.1.el7.ppc64le #numactl -m 0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,max-cpu-compat=power8 -nodefaults -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=viri_pci0,bus=pci.0,addr=0x4 -blockdev node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=rhel76-ppc64-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=disk1 -device virtio-net-pci,mac=9a:4c:4d:4e:4f:60,id=idtniYmJ,vectors=4,netdev=idG7NvsN,bus=pci.0,addr=0x5 -netdev tap,id=idG7NvsN,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -smp 1 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :1 -rtc base=utc,clock=host -boot menu=off,strict=off,order=cdn,once=c -enable-kvm -monitor stdio -chardev socket,id=serial_id_serial0,path=/tmp/4,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -m 1G,slots=256,maxmem=80G -numa node -monitor unix:/tmp/monitor3,server,nowait -mem-path /mnt/kvm_hugepage QEMU 2.12.0 monitor - type 'help' for more information (qemu) qemu-kvm: unable to map backing store for guest RAM: Cannot allocate memory qemu-kvm: falling back to regular RAM allocation. #cat /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages 0
Not much so far (still waiting for P8 machine to come back). Thread 1 "qemu-kvm" received signal SIGBUS, Bus error. [Switching to Thread 0x7ffff6587440 (LWP 14105)] __memcpy_power7 () at ../sysdeps/powerpc/powerpc64/power7/memcpy.S:107 107 stvx 6,0,dst (gdb) thread apply 1 bt Thread 1 (Thread 0x7ffff6587440 (LWP 14105)): #0 __memcpy_power7 () at ../sysdeps/powerpc/powerpc64/power7/memcpy.S:107 #1 0x0000000100194a08 in cpu_physical_memory_write_rom (as=<optimized out>, addr=<optimized out>, buf=<optimized out>, len=<optimized out>) at /usr/include/bits/string_fortified.h:34 #2 0x00000001004016d0 in rom_reset (unused=<optimized out>) at hw/core/loader.c:1098 #3 0x00000001003fc198 in qemu_devices_reset () at hw/core/reset.c:69 #4 0x0000000100289928 in spapr_machine_reset () at /usr/src/debug/qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.ppc64le/hw/ppc/spapr.c:1518 #5 0x00000001003b0b00 in qemu_system_reset (reason=<optimized out>) at vl.c:1750 #6 0x00000001001877a8 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4757 (gdb) continue Continuing. [Thread 0x7ffff554e850 (LWP 14632) exited] [Switching to Thread 0x7ffff6587440 (LWP 14625)] Thread 1 "qemu-kvm" hit Breakpoint 1, 0x00000001005fa7e8 in qemu_ram_mmap (fd=22, size=1073741824, align=1073741824, shared=false) at util/mmap-alloc.c:77 77 { (gdb) cont Continuing. [New Thread 0x7ffff554e850 (LWP 14637)] Thread 1 "qemu-kvm" hit Breakpoint 1, 0x00000001005fa7e8 in qemu_ram_mmap (fd=-1, size=262144, align=2097152, shared=false) at util/mmap-alloc.c:77 77 { (gdb) cont #if defined(__powerpc64__) && defined(__linux__) int anonfd = fd == -1 || qemu_fd_getpagesize(fd) == getpagesize() ? -1 : fd; ^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ 1G 64k (for comment 0 -m 1G, 2G for -m 1536M) int flags = anonfd == -1 ? MAP_ANONYMOUS : MAP_NORESERVE; void *ptr = mmap(0, total, PROT_NONE, flags | MAP_PRIVATE, anonfd, 0); #else void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); #endif So the only difference between ppc64 and x86 (and others) is that we request non-anonymous mapping. Not sure if this right direction as on P8 problem not reproducible as per comment 4.
Hi Serhii For comment4,I guess some steps were missing so I tried it again.And found the *same* problem to this one,I'm going to file a bug against *RHEL7.7* soon,Any problems please let me know. kernel-3.10.0-1037.el7.ppc64le qemu-kvm-rhev-2.12.0-26.el7.ppc64le How reproducible: 3/3 Steps to Reproduce: 1.Configure huge page on P8 host mkdir /dev/hugepages mount -t hugetlbfs -o pagesize=16M none /dev/hugepages echo 15000 > /proc/sys/vm/nr_hugepages echo 0 > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages [root@ibm-p8-kvm-02-qe runs]# cat /sys/devices/system/node/node*/meminfo | fgrep Huge Node 0 AnonHugePages: 0 kB Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 2.Bind guest's memory on numa node 0 numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=serial_id_serial0,path=/tmp/5,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel76-ppc64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /mnt/hugepages -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio 3. Actual results: QEMU 2.12.0 monitor - type 'help' for more information (qemu) Bus error and QEMU quits finally Expected results: It's an extreme situation,whatever it is not a good result for qemu dies such unhappy death.
cli, numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=serial_id_serial0,path=/tmp/5,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel76-ppc64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /dev/hugepages -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio
Bug 1703765 - [RHEL7] QEMU dies unhappy death if nr of huge page is zero.
I investigated this today. I understand why the problem occurs: because qemu doesn't preallocate the guest memory (without -mem-prealloc) the kernel allows it to overcommit. However when it needs to actually touch the guest memory, it can't find any on the right NUMA node and causes a SIGBUS. I can't see an easy way around this. I'm confused by why x86 doesn't seem to hit this problem: AFAICT it should behave the same. I wonder if it was actually tested on an x86 machine with more than one NUMA node, and with the memory allocated on a different node to the one where qemu is bound. Can we confirm the steps and behaviour on x86 please. [I tried to test x86 myself, but wasn't able to work out how to use Beaker to find an x86 machine which both permits RHEL8 installs and has more than one NUMA node]
On x86 host with two nodes,and similar steps with comment0. Build information kernel-4.18.0-84.el8.x86_64 qemu-kvm-core-3.1.0-24.module+el8.0.1+3132+0c0fb959.x86_64 1.mkdir /dev/hugepages 2.mount -t hugetlbfs -o pagesize=2M none /dev/hugepages 3.echo 1000 > /proc/sys/vm/nr_hugepages 4.cat /sys/devices/system/node/node*/meminfo | fgrep Huge Node 0 AnonHugePages: 2048 kB Node 0 ShmemHugePages: 0 kB Node 0 HugePages_Total: 125 Node 0 HugePages_Free: 125 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 16384 kB Node 1 ShmemHugePages: 0 kB Node 1 HugePages_Total: 875 Node 1 HugePages_Free: 875 Node 1 HugePages_Surp: 0 5.echo 0 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages [root@hp-dl380pg8-09 home]# cat /sys/devices/system/node/node*/meminfo | fgrep Huge Node 0 AnonHugePages: 2048 kB Node 0 ShmemHugePages: 0 kB Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 16384 kB Node 1 ShmemHugePages: 0 kB Node 1 HugePages_Total: 875 Node 1 HugePages_Free: 875 Node 1 HugePages_Surp: 0 6. numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pc -nodefaults -device VGA,bus=pci.0,addr=0x2 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel610-64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /dev/hugepages -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio (qemu) VNC server running on ::1:5900 error: kvm run failed Bad address EAX=00000000 EBX=00000000 ECX=00000010 EDX=000f045c ESI=00000000 EDI=00000000 EBP=00000000 ESP=00007000 EIP=000f045c EFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f6240 00000037 IDT= 000f627e 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=28 d0 0f 00 0f b7 d2 8d 44 24 02 e8 ff c7 ff ff 83 c4 28 c3 <53> 83 ec 20 e8 e0 c9 ff ff bb 00 00 00 40 8d 44 24 0c 50 8d 44 24 0c 50 8d 4c 24 0c 8d 54 Hello David, With the same steps on x86,qemu doesn't die but just give some warnings.But on ppc,qemu will quit.It is a big difference.Any problems please let me know.Thanks Best regards, Min
I believe the difference here between x86 and powerpc is essentially an accident, not related to anything fundamental: * On x86, the first time the guest memory is accessed is when attempting to actually run guest instructions via KVM. KVM detects that it can't allocate the memory and exits with an EFAULT. * On powerpc, before we attempt to execute guest instructions, we need to write the device tree into guest memory, this means we trigger the allocation failure within qemu itself, which is reported as a SIGBUS, killing the process. In both cases, we get a failure as soon as memory is accessed during normal operation. The only way to detect this earlier is to pre-allocate the guest memory (e.g. using -mem-prealloc). If we do that the problem goes away on both x86 and POWER. I don't think there's any way to detect this failure more cleanly on powerpc, apart from installing a SIGBUS handler in qemu, which I believe is more trouble than it's worth. So, I think this is behaving as expected - if your process is unable to allocate memory (under the constraints you've given it), a SIGBUS is what you get. So, I'm closing as WONTFIX. Andrea, could you double check my reasoning above?
Yes, the difference between powerpc and x86 seems cosmetic, it's still a failure just more graceful because it's the ioctl returning -EFAULT because it's get_user_pages in the kernel that hits on the allocation failure, instead of being qemu touching the memory first and getting a SIGBUS. Without -mem-prealloc the guest can always crash unexpectedly if it runs out of hugepages, because other processes or other guests may be eating into the hugetlbfs pool. It may look a nicer failure on x86, but from the guest point of view it's equally bad. Not using -mem-prealloc means the admin know what he's doing to avoid hugepages to run out while the guest runs. Like David said, using -mem-prealloc if consistency of the failure is required, sounds like the preferred fix here, it's also safer.
Thanks for the confirmation Andrea.