1686261 – [RHEL8]QEMU dies unhappy death if nr of huge page is zero.

Bug 1686261 - [RHEL8]QEMU dies unhappy death if nr of huge page is zero.

Summary: [RHEL8]QEMU dies unhappy death if nr of huge page is zero.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.1
Hardware:	ppc64le
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Gibson
QA Contact:	Min Deng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1703765
TreeView+	depends on / blocked

Reported:	2019-03-07 06:10 UTC by Min Deng
Modified:	2019-12-02 07:57 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 04:14:26 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
debuglog (12.67 KB, text/plain) 2019-03-07 06:12 UTC, Min Deng	no flags	Details
View All

Description Min Deng 2019-03-07 06:10:47 UTC

Description of problem:
QEMU got core dump if bind the guest on numa without nr huge pages

Version-Release number of selected component (if applicable):
kernel-4.18.0-75.el8.ppc64le
qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.ppc64le

How reproducible:
4/4

Enable 1G huge page on P9 like the following,Please add it to kernel line

1.default_hugepagesz=1G hugepagesz=1G hugepages=100 hugepagesz=2M hugepages=10240

mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G

2.bind the guest on node0 by 
  numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,cap-hpt-max-page-size=16M,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=serial_id_serial0,path=/tmp/5,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel76-ppc64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /dev/hugepages1G -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio

3.if the nr huge page of node 0 is zero coincidently,
  qemu will get core dump.

4.And check the nr on node0 after getting core dump
cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
0 
5.Checking the whole usage of huge page on the host,there was 50 nr on another node.

[root@ibm-p9wr-04 ~]# cat /proc/meminfo|grep Huge  - 
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      50
HugePages_Free:       50
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        73400320 kB

Actual results,
Got core dump

Expected results,
Core dump doesn't happen under this extreme situation.

For x86 platform,
There wasn't core dump problem,but will print error messages instead.


Additional info:

No matter what,it shouldn't get core dump under this extreme situation.

Comment 1 Min Deng 2019-03-07 06:12:15 UTC

Created attachment 1541682 [details]
debuglog

Comment 3 Min Deng 2019-03-07 06:19:44 UTC

It's not only about compatible guest but also native guest on P9,thanks.It is a ppc only bug.

Comment 4 Min Deng 2019-03-07 08:25:50 UTC

Additional information,
Tried it on *rhel7* host but it *wasn't* reproducible on P8 host.

qemu-kvm-rhev-2.12.0-18.el7_6.3.ppc64le
kernel-3.10.0-957.10.1.el7.ppc64le

#numactl -m 0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,max-cpu-compat=power8 -nodefaults -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=viri_pci0,bus=pci.0,addr=0x4 -blockdev node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=rhel76-ppc64-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=disk1 -device virtio-net-pci,mac=9a:4c:4d:4e:4f:60,id=idtniYmJ,vectors=4,netdev=idG7NvsN,bus=pci.0,addr=0x5 -netdev tap,id=idG7NvsN,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -smp 1 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :1 -rtc base=utc,clock=host -boot menu=off,strict=off,order=cdn,once=c -enable-kvm -monitor stdio -chardev socket,id=serial_id_serial0,path=/tmp/4,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -m 1G,slots=256,maxmem=80G -numa node -monitor unix:/tmp/monitor3,server,nowait -mem-path /mnt/kvm_hugepage
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: unable to map backing store for guest RAM: Cannot allocate memory
qemu-kvm: falling back to regular RAM allocation.

#cat /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages  
0

Comment 6 Serhii Popovych 2019-04-15 05:52:48 UTC

Not much so far (still waiting for P8 machine to come back).

Thread 1 "qemu-kvm" received signal SIGBUS, Bus error.
[Switching to Thread 0x7ffff6587440 (LWP 14105)]
__memcpy_power7 () at ../sysdeps/powerpc/powerpc64/power7/memcpy.S:107
107             stvx    6,0,dst

(gdb) thread apply 1 bt

Thread 1 (Thread 0x7ffff6587440 (LWP 14105)):
#0  __memcpy_power7 () at ../sysdeps/powerpc/powerpc64/power7/memcpy.S:107
#1  0x0000000100194a08 in cpu_physical_memory_write_rom (as=<optimized out>, addr=<optimized out>, buf=<optimized out>, len=<optimized out>)
    at /usr/include/bits/string_fortified.h:34
#2  0x00000001004016d0 in rom_reset (unused=<optimized out>) at hw/core/loader.c:1098
#3  0x00000001003fc198 in qemu_devices_reset () at hw/core/reset.c:69
#4  0x0000000100289928 in spapr_machine_reset () at /usr/src/debug/qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.ppc64le/hw/ppc/spapr.c:1518
#5  0x00000001003b0b00 in qemu_system_reset (reason=<optimized out>) at vl.c:1750
#6  0x00000001001877a8 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4757

(gdb) continue
Continuing.
[Thread 0x7ffff554e850 (LWP 14632) exited]
[Switching to Thread 0x7ffff6587440 (LWP 14625)]

Thread 1 "qemu-kvm" hit Breakpoint 1, 0x00000001005fa7e8 in qemu_ram_mmap (fd=22, size=1073741824, align=1073741824, shared=false)
    at util/mmap-alloc.c:77
77      {
(gdb) cont
Continuing.
[New Thread 0x7ffff554e850 (LWP 14637)]

Thread 1 "qemu-kvm" hit Breakpoint 1, 0x00000001005fa7e8 in qemu_ram_mmap (fd=-1, size=262144, align=2097152, shared=false) at util/mmap-alloc.c:77
77      {
(gdb) cont

#if defined(__powerpc64__) && defined(__linux__)
int anonfd = fd == -1 || qemu_fd_getpagesize(fd) == getpagesize() ? -1 : fd;
                         ^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^
                                  1G                     64k
                         (for comment 0 -m 1G, 2G for -m 1536M)
int flags = anonfd == -1 ? MAP_ANONYMOUS : MAP_NORESERVE;                                                                                         
void *ptr = mmap(0, total, PROT_NONE, flags | MAP_PRIVATE, anonfd, 0);
#else
void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
#endif

So the only difference between ppc64 and x86 (and others) is that we request
non-anonymous mapping. Not sure if this right direction as on P8 problem
not reproducible as per comment 4.

Comment 7 Min Deng 2019-04-15 06:12:11 UTC

Hi Serhii
   For comment4,I guess some steps were missing so I tried it again.And found the *same* problem to this one,I'm going to file a bug against *RHEL7.7* soon,Any problems please let me know.

kernel-3.10.0-1037.el7.ppc64le
qemu-kvm-rhev-2.12.0-26.el7.ppc64le

How reproducible:
3/3
Steps to Reproduce:
1.Configure huge page on P8 host
  mkdir /dev/hugepages
  mount -t hugetlbfs -o pagesize=16M none /dev/hugepages
  echo 15000 > /proc/sys/vm/nr_hugepages
  echo 0 > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
  [root@ibm-p8-kvm-02-qe runs]# cat /sys/devices/system/node/node*/meminfo | fgrep Huge
  Node 0 AnonHugePages:         0 kB
  Node 0 HugePages_Total:     0
  Node 0 HugePages_Free:      0
  Node 0 HugePages_Surp:      0

  

2.Bind guest's memory on numa node 0 
numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=serial_id_serial0,path=/tmp/5,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel76-ppc64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /mnt/hugepages -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio 

3.
Actual results:
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) Bus error
and QEMU quits finally

Expected results:
It's an extreme situation,whatever it is not a good result for qemu dies such unhappy death.

Comment 8 Min Deng 2019-04-15 06:13:23 UTC

cli,
numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pseries,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=serial_id_serial0,path=/tmp/5,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel76-ppc64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /dev/hugepages -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio

Comment 9 Min Deng 2019-04-28 08:04:05 UTC

Bug 1703765 - [RHEL7] QEMU dies unhappy death if nr of huge page is zero.

Comment 10 David Gibson 2019-05-10 06:26:04 UTC

I investigated this today.

I understand why the problem occurs: because qemu doesn't preallocate the guest memory (without -mem-prealloc) the kernel allows it to overcommit.  However when it needs to actually touch the guest memory, it can't find any on the right NUMA node and causes a SIGBUS.  I can't see an easy way around this.

I'm confused by why x86 doesn't seem to hit this problem: AFAICT it should behave the same.

I wonder if it was actually tested on an x86 machine with more than one NUMA node, and with the memory allocated on a different node to the one where qemu is bound.  Can we confirm the steps and behaviour on x86 please.

[I tried to test x86 myself, but wasn't able to work out how to use Beaker to find an x86 machine which both permits RHEL8 installs and has more than one NUMA node]

Comment 11 Min Deng 2019-05-10 07:47:36 UTC

On x86 host with two nodes,and similar steps with comment0.
Build information
kernel-4.18.0-84.el8.x86_64
qemu-kvm-core-3.1.0-24.module+el8.0.1+3132+0c0fb959.x86_64
   1.mkdir /dev/hugepages
   2.mount -t hugetlbfs -o pagesize=2M none /dev/hugepages
   3.echo 1000 > /proc/sys/vm/nr_hugepages
   4.cat /sys/devices/system/node/node*/meminfo | fgrep Huge
Node 0 AnonHugePages:      2048 kB
Node 0 ShmemHugePages:        0 kB
Node 0 HugePages_Total:   125
Node 0 HugePages_Free:    125
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:     16384 kB
Node 1 ShmemHugePages:        0 kB
Node 1 HugePages_Total:   875
Node 1 HugePages_Free:    875
Node 1 HugePages_Surp:      0

  5.echo 0 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
[root@hp-dl380pg8-09 home]# cat /sys/devices/system/node/node*/meminfo | fgrep Huge
Node 0 AnonHugePages:      2048 kB
Node 0 ShmemHugePages:        0 kB
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:     16384 kB
Node 1 ShmemHugePages:        0 kB
Node 1 HugePages_Total:   875
Node 1 HugePages_Free:    875
Node 1 HugePages_Surp:      0

  6. numactl --membind=0 /usr/libexec/qemu-kvm -name avocado-vt-vm1 -machine pc -nodefaults -device VGA,bus=pci.0,addr=0x2 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -device scsi-hd,id=image1,drive=drive_scsi11,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=rhel610-64-virtio-scsi.qcow2,node-name=drive_scsi1 -blockdev driver=qcow2,node-name=drive_scsi11,file=drive_scsi1 -m 1G,slots=256,maxmem=2T -mem-path /dev/hugepages -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -monitor stdio 

(qemu) VNC server running on ::1:5900
error: kvm run failed Bad address
EAX=00000000 EBX=00000000 ECX=00000010 EDX=000f045c
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00007000
EIP=000f045c EFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000f6240 00000037
IDT=     000f627e 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=28 d0 0f 00 0f b7 d2 8d 44 24 02 e8 ff c7 ff ff 83 c4 28 c3 <53> 83 ec 20 e8 e0 c9 ff ff bb 00 00 00 40 8d 44 24 0c 50 8d 44 24 0c 50 8d 4c 24 0c 8d 54


Hello David,
   With the same steps on x86,qemu doesn't die but just give some warnings.But on ppc,qemu will quit.It is a big difference.Any problems please let me know.Thanks
Best regards,
Min

Comment 12 David Gibson 2019-06-04 04:14:26 UTC

I believe the difference here between x86 and powerpc is essentially an accident, not related to anything fundamental:

 * On x86, the first time the guest memory is accessed is when attempting to actually run guest instructions via KVM.  KVM detects that it can't allocate the memory and exits with an EFAULT.

 * On powerpc, before we attempt to execute guest instructions, we need to write the device tree into guest memory, this means we trigger the allocation failure within qemu itself, which is reported as a SIGBUS, killing the process.

In both cases, we get a failure as soon as memory is accessed during normal operation.  The only way to detect this earlier is to pre-allocate the guest memory (e.g. using -mem-prealloc).  If we do that the problem goes away on both x86 and POWER.

I don't think there's any way to detect this failure more cleanly on powerpc, apart from installing a SIGBUS handler in qemu, which I believe is more trouble than it's worth.

So, I think this is behaving as expected - if your process is unable to allocate memory (under the constraints you've given it), a SIGBUS is what you get.

So, I'm closing as WONTFIX.

Andrea, could you double check my reasoning above?

Comment 13 Andrea Arcangeli 2019-06-05 22:03:48 UTC

Yes, the difference between powerpc and x86 seems cosmetic, it's still a failure just more graceful because it's the ioctl returning -EFAULT because it's get_user_pages in the kernel that hits on the allocation failure, instead of being qemu touching the memory first and getting a SIGBUS.

Without -mem-prealloc the guest can always crash unexpectedly if it runs out of hugepages, because other processes or other guests may be eating into the hugetlbfs pool. It may look a nicer failure on x86, but from the guest point of view it's equally bad. Not using -mem-prealloc means the admin know what he's doing to avoid hugepages to run out while the guest runs.

Like David said, using -mem-prealloc if consistency of the failure is required, sounds like the preferred fix here, it's also safer.

Comment 14 David Gibson 2019-06-06 00:04:46 UTC

Thanks for the confirmation Andrea.

Note You need to log in before you can comment on or make changes to this bug.