Description of problem: An error in the HPT resizing code in qemu means that after successfully resizing the hashed page table with KVM, the new size is not properly recorded in qemu. This means that after a migration an incorrect size will be used, most likely leading to a guest crash shortly afterwards. Version-Release number of selected component (if applicable): How reproducible: Theoretically close to 100%, haven't actually tried in practice yet. Steps to Reproduce: 1. Start a POWER8 guest on a POWER8 host 2. In the guest check the HPT size in /sys/kernel/debug/powerpc/hpt_order 3. Increase the guest HPT size by one by writing the value to /sys/kernel/debug/powerpc/hpt_order 4. Verify that the HPT resize completed successfully by checking /sys/kernel/debug/powerpc/hpt_order again 5. Migrate the guest (single host migration should be sufficient) Actual results: Guest crash, hang or other failure. Expected results: Guest continues to run normally. Additional info: NOTE: symptoms described above are theoretical, I haven't yet attempted this.
An upstream fix is merged in my staging tree, but not in master yet. Blocked justification: * This is a critical bug in a major new feature for rhel 7.5 * The fix is very simple - a one liner
Tested on P8 with: guest: kernel-3.10.0-847.el7.ppc64le host: kernel-3.10.0-847.el7.ppc64le qemu-kvm-rhev-2.10.0-20.el7.ppc64le CLI: ... -smp cpus=1,maxcpus=2,sockets=1,cores=1,threads=1 \ -m 8192,slots=4,maxmem=32G \ ... [-incoming tcp:0:4444] commands in src: # cat /sys/kernel/debug/powerpc/hpt_order 26 # echo 27 > /sys/kernel/debug/powerpc/hpt_order [ 49.353023] lpar: Attempting to resize HPT to shift 27 [ 49.556624] lpar: HPT resize to shift 27 complete (103 ms / 99 ms) # cat /sys/kernel/debug/powerpc/hpt_order 27 Migration: (qemu) migrate_set_speed 1G (qemu) migrate tcp:localhost:4444 Result on dst, qemu crashes: qemu-kvm: htab_load() bad index 4143282 (0+-1 entries) in htab stream (htab_shift=26) qemu-kvm: error while loading state section id 625(spapr/htab) qemu-kvm: load of migration failed: Invalid argument
As expected (see comment 2), following patch fixes the problem described in comment 3: hw/ppc/spapr_hcall: set htab_shift after kvmppc_resize_hpt_commit http://patchwork.ozlabs.org/patch/873114/
*** Bug 1547753 has been marked as a duplicate of this bug. ***
Reproduced on 3.10.0-845.el7.ppc64le and qemu-kvm-rhev-2.10.0-21.el7.ppc64le with the same steps as comment 3. [root@ibm-p8-kvm-02-qe qzhang]# /usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -rtc base=utc -device virtio-scsi-pci,bus=pci.0,id=scsi0,addr=0x3 -drive file=rhel75-ppc64le-virtio-scsi.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1 -msg timestamp=on -device nec-usb-xhci -device usb-tablet,id=tablet1 -qmp tcp:0:5555,server,nowait -vnc :11 -chardev socket,id=serial_id_serial0,path=/var/tmp/serial,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -monitor stdio -vga std -device usb-kbd -device virtio-net-pci,id=net0,netdev=netdev0,bus=pci.0,addr=0x9 -netdev tap,script=/etc/qemu-ifup,id=netdev0 -incoming tcp:0:5800 QEMU 2.10.0 monitor - type 'help' for more information (qemu) (qemu) (qemu) info status VM status: paused (inmigrate) (qemu) 2018-02-27T06:52:05.874959Z qemu-kvm: htab_load() bad index 2097136 (1+6191 entries) in htab stream (htab_shift=25) 2018-02-27T06:52:05.875105Z qemu-kvm: error while loading state section id 542(spapr/htab) 2018-02-27T06:52:05.876045Z qemu-kvm: load of migration failed: Invalid argument
Reproduced the issue on kernel-3.10.0-842.el7.ppc64le and qemu-kvm-rhev-2.10.0-20.el7.ppc64le. Steps,please refer to comment3. DST CLI,/usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -rtc base=utc -device virtio-scsi-pci,bus=pci.0,id=scsi0,addr=0x3 -drive file=rhel75-ppc64le-virtio-scsi.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1 -msg timestamp=on -device nec-usb-xhci -device usb-tablet,id=tablet1 -qmp tcp:0:5556,server,nowait -vnc :12 -chardev socket,id=serial_id_serial0,path=/tmp/SS,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -monitor stdio -vga std -device usb-kbd -device virtio-net-pci,id=net0,netdev=netdev0,bus=pci.0,addr=0x9 -netdev tap,script=/etc/qemu-ifup,id=netdev0 -incoming tcp:0:5800 QEMU 2.10.0 monitor - type 'help' for more information (qemu) 2018-03-02T03:21:37.812932Z qemu-kvm: htab_load() bad index 2059752 (0+-1 entries) in htab stream (htab_shift=25) 2018-03-02T03:21:37.813028Z qemu-kvm: error while loading state section id 542(spapr/htab) 2018-03-02T03:21:37.813932Z qemu-kvm: load of migration failed: Invalid argument
(In reply to Min Deng from comment #11) > Reproduced the issue on kernel-3.10.0-842.el7.ppc64le and > qemu-kvm-rhev-2.10.0-20.el7.ppc64le. ... > (qemu) 2018-03-02T03:21:37.812932Z qemu-kvm: htab_load() bad index 2059752 > (0+-1 entries) in htab stream (htab_shift=25) > 2018-03-02T03:21:37.813028Z qemu-kvm: error while loading state section id > 542(spapr/htab) > 2018-03-02T03:21:37.813932Z qemu-kvm: load of migration failed: Invalid > argument The backport has missed the GA window so it is not included in qemu-kvm-rhev-2.10.0-20.el7.ppc64le
------- Comment From satheera.com 2018-05-17 01:18 EDT------- Verified with below levels and found working. Host: 3.10.0-862.el7.ppc64le qemu-kvm-rhev-2.10.0-21.el7_5.1.ppc64le libvirt-3.9.0-14.el7.ppc64le Guest: 3.10.0-862.el7.ppc64le Test: # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] Red Hat Enterprise Linux Server 7.5 (Maipo) Kernel 3.10.0-862.el7.ppc64le on an ppc64le localhost login: root Password: Last login: Fri May 4 20:46:17 on hvc0 [root@localhost ~]# free -g total used free shared buff/cache available Mem: 3 0 2 0 0 3 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 23 # virsh attach-device vm1 mem.xml --live Device attached successfully # free -g total used free shared buff/cache available Mem: 11 0 10 0 0 11 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 23 # virsh managedsave vm1 Domain vm1 state saved by libvirt # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] [root@localhost ~]# ls anaconda-ks.cfg Regards, -Satheesh ------- Comment From satheera.com 2018-05-17 01:20 EDT------- Small typo in the result pasted earlier: # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] Red Hat Enterprise Linux Server 7.5 (Maipo) Kernel 3.10.0-862.el7.ppc64le on an ppc64le localhost login: root Password: Last login: Fri May 4 20:46:17 on hvc0 [root@localhost ~]# free -g total used free shared buff/cache available Mem: 3 0 2 0 0 3 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 23 # virsh attach-device vm1 mem.xml --live Device attached successfully # free -g total used free shared buff/cache available Mem: 11 0 10 0 0 11 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 25 # virsh managedsave vm1 Domain vm1 state saved by libvirt # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] [root@localhost ~]# ls anaconda-ks.cfg
------- Comment From satheera.com 2018-05-17 01:21 EDT------- Closing as the issue fixed.
Reproduced the issue on kernel-3.10.0-842.el7.ppc64le qemu-kvm-rhev-2.10.0-20.el7.ppc64le Verified the bug on kernel-3.10.0-915.el7.ppc64le - host kernel-3.10.0-919.el7.ppc64le - guest qemu-kvm-rhev-2.12.0-6.el7.ppc64le Steps, 1.boot up a guest with on src /usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -rtc base=utc -device virtio-scsi-pci,bus=pci.0,id=scsi0,addr=0x3 -drive file=rhel75-ppc64le-virtio-scsi.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1 -msg timestamp=on -device nec-usb-xhci -device usb-tablet,id=tablet1 -qmp tcp:0:5556,server,nowait -vnc :12 -chardev socket,id=serial_id_serial0,path=/tmp/SS,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -monitor stdio -vga std -device usb-kbd -device virtio-net-pci,id=net0,netdev=netdev0,bus=pci.0,addr=0x9 -netdev tap,script=/etc/qemu-ifup,id=netdev0 On dst, cli ... -incoming tcp:0:5800 2.change hpt_order cat /sys/kernel/debug/powerpc/hpt_order 25 echo 26 > /sys/kernel/debug/powerpc/hpt_order 3.(qemu) migrate_set_speed 1G (qemu) migrate tcp:localhost:5800 Actual results, Migrated successfully. Expected results, Migrated successfully.Guest continues to run normally. Base on above,the issue has been fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3443