RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1347498 - [ppc64le] Guest can't boot up with hugepage memdev
Summary: [ppc64le] Guest can't boot up with hugepage memdev
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: ppc64le
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Thomas Huth
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-17 05:12 UTC by Zhengtong
Modified: 2017-07-27 05:58 UTC (History)
7 users (show)

Fixed In Version: qemu-kvm-rhev-2.6.0-16.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-07 21:19:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2673 0 normal SHIPPED_LIVE qemu-kvm-rhev bug fix and enhancement update 2016-11-08 01:06:13 UTC

Description Zhengtong 2016-06-17 05:12:16 UTC
Description of problem:
Boot guest with pc-dimm which use memory-backend-file as backend, the guest will stuck after slof view.

Version-Release number of selected component (if applicable):
Host kernel:3.10.0-433.el7.ppc64le
qemu-kvm-rhev-2.6.0-5.el7

How reproducible:
4/4

Steps to Reproduce:
1.In host. Alloc some huge pages
#echo 266 > /proc/sys/vm/nr_hugepages
2.Mount the hugetlbfs to the mount point:
#mount -t hugetlbfs -o pagesize=16384K none /mnt/kvm_hugepage
3.Boot guest with the pc-dimm device.
#/usr/libexec/qemu-kvm \
...
   -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \
    -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 \
...

Actual results:

The guest stuck after slof loaded.

Expected results:

The guest could be bootup without problems

Additional info:

Full guest boot cmd:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults  \
    -vga std  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_9lJfu6/monitor-qmpmonitor1-20160616-224516-9paJv3ys,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_9lJfu6/monitor-catch_monitor-20160616-224516-9paJv3ys,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_9lJfu6/serial-serial0-20160616-224516-9paJv3ys,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device pci-ohci,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,snapshot=on,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/test/staf-kvm-devel/workspace/usr/share/avocado/data/avocado-vt/images/RHEL-Server-7.3-ppc64le-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:93:94:95:96:97,id=ideoMOlt,vectors=4,netdev=idMh3R0C,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on  \
    -netdev tap,id=idMh3R0C,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 16384,slots=4,maxmem=32G \
    -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \
    -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -numa node,nodeid=0  \
    -numa node,nodeid=1 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -monitor stdio

Comment 2 Thomas Huth 2016-06-20 17:08:04 UTC
FWIW, the only relevant difference in the device tree between a failing boot with huge pages, and a normal, working boot without huge pages seems to be the "ibm,segment-page-sizes" property of the CPU nodes.

*Without* huge pages, the property contains:
 ibm,segment-page-sizes      0000000c  00000000  00000001  0000000c
                             00000000  00000010  00000110  00000001
                             00000010  00000001
and Linux prints during boot:
 ...
 Calling quiesce...
 returning from prom_init
 kexec: crashkernel=auto resulted in zero bytes of reserved memory.
 Using pSeries machine description
 Page sizes from device-tree:
 base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
 base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
 Using 1TB segments
 Found initrd at 0xc000000003700000:0xc000000004bb04cc
 ...

*With* huge pages, the property contains:
 ibm,segment-page-sizes      0000000c  00000000  00000002  0000000c
                             00000000  00000018  00000038  00000010
                             00000110  00000002  00000010  00000001
                             00000018  00000008  00000018  00000100
                             00000001  00000018  00000000
and since Linux does not print out anything after the "prom_init" line anymore, I had to take a dump of the VM here and extract the output from the printk buffer there:
kexec: crashkernel=auto resulted in zero bytes of reserved memory.
Allocated 4718592 bytes for 2048 pacas at c00000000fb80000
Using pSeries machine description
Page sizes from device-tree:
base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Using 1TB segments

Comment 3 Thomas Huth 2016-06-20 17:11:37 UTC
I just noticed there is even a kernel Oops recorded in the printk buffer of the crashed kernel:

Using 1TB segments
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0-445.el7.ppc64le #1
task: c000000001116790 ti: c000000001180000 task.ti: c000000001180000
NIP: 0000000000c819e0 LR: 0000000000c819e0 CTR: c0000000000b9ed0
REGS: c000000001183c30 TRAP: 0700   Not tainted  (3.10.0-445.el7.ppc64le)
MSR: 8000000000023003 <SF,FP,ME,RI,LE>  CR: 22002048  XER: 00000000
CFAR: 000000000005ca84 SOFTE: 0 
GPR00: 0000000000c819e0 c000000001183eb0 c000000001185f00 fffffffffffffffe 
GPR04: 0000000000000000 c000000001183e00 40016e7779000015 0000000000000190 
GPR08: 000000014052440b 0000000000000000 0000000000000000 c000000001314c48 
GPR12: 0000000000002200 c00000000fb80000 000000003dc55d58 0000000002b59360 
GPR16: 0000000002b59358 0000000002000000 0000000000000060 0000000002b58c70 
GPR20: 000000003dc55d50 fffffffffffffffd 000000003dc55d10 000000003e454400 
GPR24: 000000002fff0000 c000000001208190 c000000000000000 0000000050000000 
GPR28: c0000000010b8638 c000000000000000 c0000000015100c0 c000000001314c48 
NIP [0000000000c819e0] 0xc819e0
LR [0000000000c819e0] 0xc819e0
Call Trace:
[c000000001183eb0] [0000000000c819e0] 0xc819e0 (unreliable)
[c000000001183f50] [0000000000c79bec] 0xc79bec
[c000000001183f90] [0000000000009b34] 0x9b34
Instruction dump:
7d2a4a14 7fbe4840 409c0034 e8df046e e8be0000 eb7e0008 e8ff0472 7cbdd378 
78a50100 7fa3eb78 7c9dda14 4b3dadbd 0b030000 3bde0020 4bffffbc 3860ffff 
---[ end trace 1b75b31a2719ed1c ]---
Kernel panic - not syncing: Fatal exception
Rebooting in 180 seconds..RTAS system-reboot returned -1

Comment 4 Thomas Huth 2016-06-21 07:38:38 UTC
Not sure, but I think the problem might be that you've only added huge-page support to the additional DIMM, but not to the main memory. If I add the additional parameter "-mem-path /mnt/kvm_hugepage" so that huge pages are also enabled for the main memory, the guest boots up fine. So I got to find out whether it should work without the main "-mem-path" option, too (do you know whether this works on x86?), or whether we should simply refuse that configuration when starting QEMU...

Comment 5 Thomas Huth 2016-06-21 11:12:57 UTC
I've done some more research, and as far as I can see, the problem is indeed that the guest kernel tries to do htab_bolt_mapping() (see arch/powerpc/mm/hash_utils_64.c in the kernel sources) with huge pages on the main memory region which is not backed by huge pages on the host.

Comment 6 Thomas Huth 2016-06-21 11:33:51 UTC
Just found BZ 1265576 which seems to be related ... however, not sure yet why it's working there but not here...

Comment 7 Thomas Huth 2016-06-21 15:24:13 UTC
... ok, after reading through the other BZ, it seems like we're exactly hitting the situation described here:
https://bugzilla.redhat.com/show_bug.cgi?id=1265576#c20
So we likely need to add some more checks to QEMU to make sure that in such mixed configurations, we also do not signal huge page support to the guests.

Comment 9 Thomas Huth 2016-06-23 19:35:35 UTC
Patch has been merged upstream:
http://git.qemu.org/?p=qemu.git;a=commitdiff;h=86b50f2e1befc3340

Comment 10 Miroslav Rezanina 2016-07-01 08:24:25 UTC
Fix included in qemu-kvm-rhev-2.6.0-11.el7

Comment 12 Zhengtong 2016-07-02 08:57:41 UTC
I retest the case with the fixed version qemu-kvm-rhev-2.6.0-11.el7 , But it seems the problem was not resolved. It still can't boot up with the hugepage memdev.The start cmdline is the same with comment #c0. After I removing the hugepage memdev cmdline, the guest can boot up normally.

So I think the problem was not fixed in this version.

Host kernel version: 3.10.0-456.el7.ppc64le

Comment 13 Thomas Huth 2016-07-14 14:39:43 UTC
Darn! Looks like my patch only fixes the issue when there are no "-numa" options in the CLI parameters list. I was now able to reproduce the problem again with following shortened command line, too:

/usr/libexec/qemu-kvm -nographic -vga none -hda /path/to/hd.img -m 16384,slots=4,maxmem=32G -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 -numa node,nodeid=0 -numa node,nodeid=1

I'll have another closer look at the code to see what's going wrong now...

Comment 14 Thomas Huth 2016-07-15 08:32:24 UTC
I've now suggested another fix upstream:
http://news.gmane.org/find-root.php?message_id=1468570225-14101-1-git-send-email-thuth@redhat.com

Comment 15 Miroslav Rezanina 2016-07-26 06:56:40 UTC
Fix included in qemu-kvm-rhev-2.6.0-16.el7

Comment 16 Qunfang Zhang 2016-07-26 07:20:15 UTC
Hi, Zhengtong

Please help verify it in the latest version, thanks!

Comment 18 Zhengtong 2016-07-27 04:52:57 UTC
Tested with "qemu-kvm-rhev-2.6.0-16.el7", Both ppc64 and ppc64le guests can boot up successfully. didn't stuck on firmware anymore.


Boot cmd:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults  \
    -vga std  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_9lJfu6/monitor-qmpmonitor1-20160616-224516-9paJv3ys,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_9lJfu6/monitor-catch_monitor-20160616-224516-9paJv3ys,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_9lJfu6/serial-serial0-20160616-224516-9paJv3ys,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device pci-ohci,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,snapshot=on,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/RHEL-Server-7.3-ppc64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:93:94:95:96:97,id=ideoMOlt,vectors=4,netdev=idMh3R0C,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on  \
    -netdev tap,id=idMh3R0C,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 4096,slots=4,maxmem=32G \
    -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \
    -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -numa node,nodeid=0  \
    -numa node,nodeid=1 \
    -device usb-kbd \
    -device usb-mouse \
    -vnc :0  \
    -monitor stdio



So the bug is verified

Comment 20 errata-xmlrpc 2016-11-07 21:19:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html


Note You need to log in before you can comment on or make changes to this bug.