Red Hat Bugzilla – Bug 1380258
ppc64le: > 1024GiB of guest RAM will conflict with IO
Last modified: 2017-11-27 22:42:13 EST
Description of problem: Boot up guest with 2048G maxmem, it's very slow. In my experiment, it hasn't boot up after 40 mins. With 1024G maxmem, guest works fine. We were aware of it before and had some related bz comment (bug 1263039 comment). Create the bz to track the issue. Version-Release number of selected component (if applicable): kernel-3.10.0-510.el7.ppc64le qemu-kvm-rhev-2.6.0-27.el7.ppc64le SLOF-20160223-6.gitdbbfda4.el7.noarch How reproducible: Always Steps to Reproduce: 1. Boot up a guest with 2048 maxmem, eg: # /usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 32G,slots=4,maxmem=2048G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -nodefaults -monitor stdio -rtc base=utc -device spapr-vscsi,id=scsi0,reg=0x1000 -drive file=rhel72-ppc64le-virtio.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,bootindex=1,id=scsi0-0-0-0 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1-0 -vnc :10 -msg timestamp=on -usb -device usb-tablet,id=tablet1 -vga std -qmp tcp:0:4666,server,nowait -netdev tap,id=hostnet1,script=/etc/qemu-ifup,vhost=on -device virtio-net-pci,netdev=hostnet1,id=net1,mac=00:54:5a:5f:5b:5c 2. 3. Actual results: Guest could not boot up after 40 mins... still waiting for it, now sure when it could boot up. Expected results: Guest could boot up with a few seconds. Additional info: Host info: # free -m total used free shared buff/cache available Mem: 519147 36950 480287 35 1908 480527 Swap: 4095 0 4095 # cat /proc/cpuinfo processor : 0 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 8 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 16 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 24 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 32 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 40 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 48 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 56 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 64 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 72 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 80 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 88 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 96 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 104 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 112 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) processor : 120 cpu : POWER8NVL (raw), altivec supported clock : 4023.000000MHz revision : 1.0 (pvr 004c 0100) timebase : 512000000 platform : PowerNV model : 8335-GTB machine : PowerNV 8335-GTB firmware : OPAL v3
Created attachment 1205815 [details] Screenshot for guest
Didn't boot up after nearly 1 day.
Ouch. That's much worse than I thought. I'm investigating.
Even with 1024G plus a bit more maxmem can reproduce it.
Unfortunately, I haven't been able to reproduce this. I'm unable to run a guest with maxmem > 256G on our machine, because it's not able to allocate enough contiguous memory for the guest's hash page table, so doesn't start it at all. What are the symptoms when the system doesn't start? Is there any output on the console at all? For simplicity can you also please try with no VGA or USB devices, just the spapr-vty console.
Ok, after IRC discussion, I see that this is not the bug I thought it was. There's a known problem with slow startup with large maxmem values, but that occurs before even the firmware executes. This bug is a hang during actual kernel boot up, and only seems to occur with large enough maxmem *and* VGA+USB present.
Yes, remove VGA, guest boots up successfully within a few seconds with 2048G maxmem: /usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 32G,slots=4,maxmem=2048G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -nodefaults -serial stdio -rtc base=utc -device spapr-vscsi,id=scsi0,reg=0x1000 -drive file=rhel72-ppc64le-virtio.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,bootindex=1,id=scsi0-0-0-0 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1-0 -netdev tap,id=hostnet1,script=/etc/qemu-ifup,vhost=on -device virtio-net-pci,netdev=hostnet1,id=net1,mac=00:54:5a:5f:5b:5c -usb -device usb-tablet,id=tablet1
Some further observations: * Trips with just VGA, but not USB (in this case the guest doesn't use the vga as console, but we still see the hang on the vty console) * Doesn't trip with i6300esb (as a different emulated PCI device) My best guess at this point would be somehow getting an overlap between the memory and some IO region, but I don't really know how.
Re-test for the following configuration, didn't got difference results: (1)RHEL7.3 guest and RHEL7.2-z (2) std vga and virtio-vga All can reproduce this bug.
An "info mtree" from qemu monitor could help. It seems the base address of PCI interface is at 0x0000010080000000, which is 1024 GiB + 2 GiB. So I think memory is overlapping PCI I/O address space. (qemu) info mtree address-space: memory 0000000000000000-ffffffffffffffff (prio 0, RW): system 0000000000000000-000000003fffffff (prio 0, RW): ppc_spapr.ram 0000000040000000-0000000fffffffff (prio 0, RW): hotplug-memory 0000010080000000-000001008000ffff (prio 0, RW): alias pci@800000020000000.io-alias @pci@800000020000000.io 0000000000000000-000000000000ffff 00000100a0000000-000001101fffffff (prio 0, RW): alias pci@800000020000000.mmio-alias @pci@800000020000000.mmio 0000000080000000-0000000fffffffff from include/hw/pci-host/spapr.h: #define SPAPR_PCI_WINDOW_BASE 0x10000000000ULL #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x00080000000ULL
Ouch. I suspect some kind of RAM / IO collision, but that's rather less subtle than I expected. I guess 1TiB of RAM seemed like an enormous about when I first wrote that constant. Ok, so two things we need to do: 1) In the short term, have qemu error out gracefully if >1TiB of RAM is requested. 2) Longer term we need to change the default placement of the PHBs. We wanted to change the spacing anyway to allow for more big-IO cards in each PCI domain (particularly for the nVidia cards which have enormous MMIO BARs). We can fold these two changes together. So, working out how to do the placement change without breaking compatibility or migration just took a big leap up my priority list.
*** Bug 1380618 has been marked as a duplicate of this bug. ***
Created attachment 1207073 [details] guest xml
Created attachment 1207074 [details] guest console logs
------- Comment From ckumar27@in.ibm.com 2016-10-04 01:38 EDT------- *** This bug has been marked as a duplicate of bug 147001 ***
I've posted a series of patches to address this upstream. It's structed as a minimal fix (2 patches) followed by a more extensive fix which fixes additional problems (2 more patches). Once it's thrashed out upstream, my intention is to backport the minimal fix for 7.3.z. 7.4 should get the whole set via rebase.
I've revised the series mentioned in comment 17, and posted upstream. I'm hoping this one will be good enough to merge.
This is now merged upstream, we should get the fix in the rebase.
Verified this bug with qemu-kvm-rhev-2.9.0-0.el7.patchwork201703291116.ppc64le.rpm with the same steps in comment 0, the issue has been fixed. Guest could boot up successfully now with around 1 min. I'll re-test once the official qemu-kvm-rhev-2.9 comes out.
This bug is verified pass with qemu-kvm-rhev-2.9.0-2.el7.ppc64le.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392