Bug 1347498
| Summary: | [ppc64le] Guest can't boot up with hugepage memdev | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Zhengtong <zhengtli> |
| Component: | qemu-kvm-rhev | Assignee: | Thomas Huth <thuth> |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.3 | CC: | juzhang, knoel, qzhang, thuth, virt-maint, xuhan, zhengtli |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | ppc64le | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-rhev-2.6.0-16.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-07 21:19:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
FWIW, the only relevant difference in the device tree between a failing boot with huge pages, and a normal, working boot without huge pages seems to be the "ibm,segment-page-sizes" property of the CPU nodes.
*Without* huge pages, the property contains:
ibm,segment-page-sizes 0000000c 00000000 00000001 0000000c
00000000 00000010 00000110 00000001
00000010 00000001
and Linux prints during boot:
...
Calling quiesce...
returning from prom_init
kexec: crashkernel=auto resulted in zero bytes of reserved memory.
Using pSeries machine description
Page sizes from device-tree:
base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Using 1TB segments
Found initrd at 0xc000000003700000:0xc000000004bb04cc
...
*With* huge pages, the property contains:
ibm,segment-page-sizes 0000000c 00000000 00000002 0000000c
00000000 00000018 00000038 00000010
00000110 00000002 00000010 00000001
00000018 00000008 00000018 00000100
00000001 00000018 00000000
and since Linux does not print out anything after the "prom_init" line anymore, I had to take a dump of the VM here and extract the output from the printk buffer there:
kexec: crashkernel=auto resulted in zero bytes of reserved memory.
Allocated 4718592 bytes for 2048 pacas at c00000000fb80000
Using pSeries machine description
Page sizes from device-tree:
base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Using 1TB segments
I just noticed there is even a kernel Oops recorded in the printk buffer of the crashed kernel: Using 1TB segments Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0-445.el7.ppc64le #1 task: c000000001116790 ti: c000000001180000 task.ti: c000000001180000 NIP: 0000000000c819e0 LR: 0000000000c819e0 CTR: c0000000000b9ed0 REGS: c000000001183c30 TRAP: 0700 Not tainted (3.10.0-445.el7.ppc64le) MSR: 8000000000023003 <SF,FP,ME,RI,LE> CR: 22002048 XER: 00000000 CFAR: 000000000005ca84 SOFTE: 0 GPR00: 0000000000c819e0 c000000001183eb0 c000000001185f00 fffffffffffffffe GPR04: 0000000000000000 c000000001183e00 40016e7779000015 0000000000000190 GPR08: 000000014052440b 0000000000000000 0000000000000000 c000000001314c48 GPR12: 0000000000002200 c00000000fb80000 000000003dc55d58 0000000002b59360 GPR16: 0000000002b59358 0000000002000000 0000000000000060 0000000002b58c70 GPR20: 000000003dc55d50 fffffffffffffffd 000000003dc55d10 000000003e454400 GPR24: 000000002fff0000 c000000001208190 c000000000000000 0000000050000000 GPR28: c0000000010b8638 c000000000000000 c0000000015100c0 c000000001314c48 NIP [0000000000c819e0] 0xc819e0 LR [0000000000c819e0] 0xc819e0 Call Trace: [c000000001183eb0] [0000000000c819e0] 0xc819e0 (unreliable) [c000000001183f50] [0000000000c79bec] 0xc79bec [c000000001183f90] [0000000000009b34] 0x9b34 Instruction dump: 7d2a4a14 7fbe4840 409c0034 e8df046e e8be0000 eb7e0008 e8ff0472 7cbdd378 78a50100 7fa3eb78 7c9dda14 4b3dadbd 0b030000 3bde0020 4bffffbc 3860ffff ---[ end trace 1b75b31a2719ed1c ]--- Kernel panic - not syncing: Fatal exception Rebooting in 180 seconds..RTAS system-reboot returned -1 Not sure, but I think the problem might be that you've only added huge-page support to the additional DIMM, but not to the main memory. If I add the additional parameter "-mem-path /mnt/kvm_hugepage" so that huge pages are also enabled for the main memory, the guest boots up fine. So I got to find out whether it should work without the main "-mem-path" option, too (do you know whether this works on x86?), or whether we should simply refuse that configuration when starting QEMU... I've done some more research, and as far as I can see, the problem is indeed that the guest kernel tries to do htab_bolt_mapping() (see arch/powerpc/mm/hash_utils_64.c in the kernel sources) with huge pages on the main memory region which is not backed by huge pages on the host. Just found BZ 1265576 which seems to be related ... however, not sure yet why it's working there but not here... ... ok, after reading through the other BZ, it seems like we're exactly hitting the situation described here: https://bugzilla.redhat.com/show_bug.cgi?id=1265576#c20 So we likely need to add some more checks to QEMU to make sure that in such mixed configurations, we also do not signal huge page support to the guests. Suggested a patch upstream: http://news.gmane.org/find-root.php?message_id=1466585405-3769-1-git-send-email-thuth@redhat.com Patch has been merged upstream: http://git.qemu.org/?p=qemu.git;a=commitdiff;h=86b50f2e1befc3340 Fix included in qemu-kvm-rhev-2.6.0-11.el7 I retest the case with the fixed version qemu-kvm-rhev-2.6.0-11.el7 , But it seems the problem was not resolved. It still can't boot up with the hugepage memdev.The start cmdline is the same with comment #c0. After I removing the hugepage memdev cmdline, the guest can boot up normally. So I think the problem was not fixed in this version. Host kernel version: 3.10.0-456.el7.ppc64le Darn! Looks like my patch only fixes the issue when there are no "-numa" options in the CLI parameters list. I was now able to reproduce the problem again with following shortened command line, too: /usr/libexec/qemu-kvm -nographic -vga none -hda /path/to/hd.img -m 16384,slots=4,maxmem=32G -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 -numa node,nodeid=0 -numa node,nodeid=1 I'll have another closer look at the code to see what's going wrong now... I've now suggested another fix upstream: http://news.gmane.org/find-root.php?message_id=1468570225-14101-1-git-send-email-thuth@redhat.com Fix included in qemu-kvm-rhev-2.6.0-16.el7 Hi, Zhengtong Please help verify it in the latest version, thanks! Tested with "qemu-kvm-rhev-2.6.0-16.el7", Both ppc64 and ppc64le guests can boot up successfully. didn't stuck on firmware anymore.
Boot cmd:
/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1' \
-sandbox off \
-machine pseries \
-nodefaults \
-vga std \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_9lJfu6/monitor-qmpmonitor1-20160616-224516-9paJv3ys,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_9lJfu6/monitor-catch_monitor-20160616-224516-9paJv3ys,server,nowait \
-mon chardev=qmp_id_catch_monitor,mode=control \
-chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_9lJfu6/serial-serial0-20160616-224516-9paJv3ys,server,nowait \
-device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
-device pci-ohci,id=usb1,bus=pci.0,addr=03 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
-drive id=drive_image1,snapshot=on,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/RHEL-Server-7.3-ppc64-virtio-scsi.qcow2 \
-device scsi-hd,id=image1,drive=drive_image1 \
-device virtio-net-pci,mac=9a:93:94:95:96:97,id=ideoMOlt,vectors=4,netdev=idMh3R0C,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on \
-netdev tap,id=idMh3R0C,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-m 4096,slots=4,maxmem=32G \
-object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \
-device pc-dimm,id=dimm-mem1,memdev=mem-mem1 \
-smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \
-numa node,nodeid=0 \
-numa node,nodeid=1 \
-device usb-kbd \
-device usb-mouse \
-vnc :0 \
-monitor stdio
So the bug is verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2673.html |
Description of problem: Boot guest with pc-dimm which use memory-backend-file as backend, the guest will stuck after slof view. Version-Release number of selected component (if applicable): Host kernel:3.10.0-433.el7.ppc64le qemu-kvm-rhev-2.6.0-5.el7 How reproducible: 4/4 Steps to Reproduce: 1.In host. Alloc some huge pages #echo 266 > /proc/sys/vm/nr_hugepages 2.Mount the hugetlbfs to the mount point: #mount -t hugetlbfs -o pagesize=16384K none /mnt/kvm_hugepage 3.Boot guest with the pc-dimm device. #/usr/libexec/qemu-kvm \ ... -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \ -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 \ ... Actual results: The guest stuck after slof loaded. Expected results: The guest could be bootup without problems Additional info: Full guest boot cmd: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox off \ -machine pseries \ -nodefaults \ -vga std \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_9lJfu6/monitor-qmpmonitor1-20160616-224516-9paJv3ys,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_9lJfu6/monitor-catch_monitor-20160616-224516-9paJv3ys,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_9lJfu6/serial-serial0-20160616-224516-9paJv3ys,server,nowait \ -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \ -device pci-ohci,id=usb1,bus=pci.0,addr=03 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \ -drive id=drive_image1,snapshot=on,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/test/staf-kvm-devel/workspace/usr/share/avocado/data/avocado-vt/images/RHEL-Server-7.3-ppc64le-virtio-scsi.qcow2 \ -device scsi-hd,id=image1,drive=drive_image1 \ -device virtio-net-pci,mac=9a:93:94:95:96:97,id=ideoMOlt,vectors=4,netdev=idMh3R0C,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on \ -netdev tap,id=idMh3R0C,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -m 16384,slots=4,maxmem=32G \ -object memory-backend-file,policy=default,mem-path=/mnt/kvm_hugepage,size=1G,id=mem-mem1 \ -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 \ -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \ -numa node,nodeid=0 \ -numa node,nodeid=1 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -vnc :0 \ -monitor stdio