Bug 1545016
Summary: | [ppc64] Migration will fail after HPT resizing | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | David Gibson <dgibson> | |
Component: | qemu-kvm-rhev | Assignee: | David Gibson <dgibson> | |
Status: | CLOSED ERRATA | QA Contact: | Min Deng <mdeng> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 7.5 | CC: | bugproxy, dzheng, hannsj_uhl, jen, jherrman, juzhang, knoel, lmiksik, lvivier, micai, michen, mrezanin, mtessun, qzhang, rhodain, salmy, toneata, virt-maint, xianwang | |
Target Milestone: | rc | Keywords: | Patch, ZStream | |
Target Release: | 7.6 | |||
Hardware: | ppc64le | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-rhev-2.12.0-1.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Due to an error in the code for resizing the hashed page table (HPT), migrated guests on an IBM POWER host terminated unexpectedly. This update ensures that the size of the HPT is recorded correctly during migration, which prevents the described crashes from occurring.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1550136 1552627 1554956 (view as bug list) | Environment: | ||
Last Closed: | 2018-11-01 11:04:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1507957, 1513404, 1528344, 1539427, 1550136, 1552627, 1554956, 1577865, 1609081 |
Description
David Gibson
2018-02-14 04:03:45 UTC
An upstream fix is merged in my staging tree, but not in master yet. Blocked justification: * This is a critical bug in a major new feature for rhel 7.5 * The fix is very simple - a one liner Tested on P8 with: guest: kernel-3.10.0-847.el7.ppc64le host: kernel-3.10.0-847.el7.ppc64le qemu-kvm-rhev-2.10.0-20.el7.ppc64le CLI: ... -smp cpus=1,maxcpus=2,sockets=1,cores=1,threads=1 \ -m 8192,slots=4,maxmem=32G \ ... [-incoming tcp:0:4444] commands in src: # cat /sys/kernel/debug/powerpc/hpt_order 26 # echo 27 > /sys/kernel/debug/powerpc/hpt_order [ 49.353023] lpar: Attempting to resize HPT to shift 27 [ 49.556624] lpar: HPT resize to shift 27 complete (103 ms / 99 ms) # cat /sys/kernel/debug/powerpc/hpt_order 27 Migration: (qemu) migrate_set_speed 1G (qemu) migrate tcp:localhost:4444 Result on dst, qemu crashes: qemu-kvm: htab_load() bad index 4143282 (0+-1 entries) in htab stream (htab_shift=26) qemu-kvm: error while loading state section id 625(spapr/htab) qemu-kvm: load of migration failed: Invalid argument As expected (see comment 2), following patch fixes the problem described in comment 3: hw/ppc/spapr_hcall: set htab_shift after kvmppc_resize_hpt_commit http://patchwork.ozlabs.org/patch/873114/ *** Bug 1547753 has been marked as a duplicate of this bug. *** Reproduced on 3.10.0-845.el7.ppc64le and qemu-kvm-rhev-2.10.0-21.el7.ppc64le with the same steps as comment 3. [root@ibm-p8-kvm-02-qe qzhang]# /usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -rtc base=utc -device virtio-scsi-pci,bus=pci.0,id=scsi0,addr=0x3 -drive file=rhel75-ppc64le-virtio-scsi.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1 -msg timestamp=on -device nec-usb-xhci -device usb-tablet,id=tablet1 -qmp tcp:0:5555,server,nowait -vnc :11 -chardev socket,id=serial_id_serial0,path=/var/tmp/serial,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -monitor stdio -vga std -device usb-kbd -device virtio-net-pci,id=net0,netdev=netdev0,bus=pci.0,addr=0x9 -netdev tap,script=/etc/qemu-ifup,id=netdev0 -incoming tcp:0:5800 QEMU 2.10.0 monitor - type 'help' for more information (qemu) (qemu) (qemu) info status VM status: paused (inmigrate) (qemu) 2018-02-27T06:52:05.874959Z qemu-kvm: htab_load() bad index 2097136 (1+6191 entries) in htab stream (htab_shift=25) 2018-02-27T06:52:05.875105Z qemu-kvm: error while loading state section id 542(spapr/htab) 2018-02-27T06:52:05.876045Z qemu-kvm: load of migration failed: Invalid argument Reproduced the issue on kernel-3.10.0-842.el7.ppc64le and qemu-kvm-rhev-2.10.0-20.el7.ppc64le. Steps,please refer to comment3. DST CLI,/usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -rtc base=utc -device virtio-scsi-pci,bus=pci.0,id=scsi0,addr=0x3 -drive file=rhel75-ppc64le-virtio-scsi.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1 -msg timestamp=on -device nec-usb-xhci -device usb-tablet,id=tablet1 -qmp tcp:0:5556,server,nowait -vnc :12 -chardev socket,id=serial_id_serial0,path=/tmp/SS,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -monitor stdio -vga std -device usb-kbd -device virtio-net-pci,id=net0,netdev=netdev0,bus=pci.0,addr=0x9 -netdev tap,script=/etc/qemu-ifup,id=netdev0 -incoming tcp:0:5800 QEMU 2.10.0 monitor - type 'help' for more information (qemu) 2018-03-02T03:21:37.812932Z qemu-kvm: htab_load() bad index 2059752 (0+-1 entries) in htab stream (htab_shift=25) 2018-03-02T03:21:37.813028Z qemu-kvm: error while loading state section id 542(spapr/htab) 2018-03-02T03:21:37.813932Z qemu-kvm: load of migration failed: Invalid argument (In reply to Min Deng from comment #11) > Reproduced the issue on kernel-3.10.0-842.el7.ppc64le and > qemu-kvm-rhev-2.10.0-20.el7.ppc64le. ... > (qemu) 2018-03-02T03:21:37.812932Z qemu-kvm: htab_load() bad index 2059752 > (0+-1 entries) in htab stream (htab_shift=25) > 2018-03-02T03:21:37.813028Z qemu-kvm: error while loading state section id > 542(spapr/htab) > 2018-03-02T03:21:37.813932Z qemu-kvm: load of migration failed: Invalid > argument The backport has missed the GA window so it is not included in qemu-kvm-rhev-2.10.0-20.el7.ppc64le ------- Comment From satheera.com 2018-05-17 01:18 EDT------- Verified with below levels and found working. Host: 3.10.0-862.el7.ppc64le qemu-kvm-rhev-2.10.0-21.el7_5.1.ppc64le libvirt-3.9.0-14.el7.ppc64le Guest: 3.10.0-862.el7.ppc64le Test: # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] Red Hat Enterprise Linux Server 7.5 (Maipo) Kernel 3.10.0-862.el7.ppc64le on an ppc64le localhost login: root Password: Last login: Fri May 4 20:46:17 on hvc0 [root@localhost ~]# free -g total used free shared buff/cache available Mem: 3 0 2 0 0 3 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 23 # virsh attach-device vm1 mem.xml --live Device attached successfully # free -g total used free shared buff/cache available Mem: 11 0 10 0 0 11 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 23 # virsh managedsave vm1 Domain vm1 state saved by libvirt # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] [root@localhost ~]# ls anaconda-ks.cfg Regards, -Satheesh ------- Comment From satheera.com 2018-05-17 01:20 EDT------- Small typo in the result pasted earlier: # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] Red Hat Enterprise Linux Server 7.5 (Maipo) Kernel 3.10.0-862.el7.ppc64le on an ppc64le localhost login: root Password: Last login: Fri May 4 20:46:17 on hvc0 [root@localhost ~]# free -g total used free shared buff/cache available Mem: 3 0 2 0 0 3 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 23 # virsh attach-device vm1 mem.xml --live Device attached successfully # free -g total used free shared buff/cache available Mem: 11 0 10 0 0 11 Swap: 1 0 1 #cat /sys/kernel/debug/powerpc/hpt_order 25 # virsh managedsave vm1 Domain vm1 state saved by libvirt # virsh start vm1 Domain vm1 started # virsh console vm1 Connected to domain vm1 Escape character is ^] [root@localhost ~]# ls anaconda-ks.cfg ------- Comment From satheera.com 2018-05-17 01:21 EDT------- Closing as the issue fixed. Reproduced the issue on kernel-3.10.0-842.el7.ppc64le qemu-kvm-rhev-2.10.0-20.el7.ppc64le Verified the bug on kernel-3.10.0-915.el7.ppc64le - host kernel-3.10.0-919.el7.ppc64le - guest qemu-kvm-rhev-2.12.0-6.el7.ppc64le Steps, 1.boot up a guest with on src /usr/libexec/qemu-kvm -name test -machine pseries,accel=kvm,usb=off -m 4G -smp 4,sockets=1,cores=4,threads=1 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -realtime mlock=off -rtc base=utc -device virtio-scsi-pci,bus=pci.0,id=scsi0,addr=0x3 -drive file=rhel75-ppc64le-virtio-scsi.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on -device scsi-cd,bus=scsi0.0,drive=drive-scsi0-0-1-0,bootindex=2,id=scsi0-0-1 -msg timestamp=on -device nec-usb-xhci -device usb-tablet,id=tablet1 -qmp tcp:0:5556,server,nowait -vnc :12 -chardev socket,id=serial_id_serial0,path=/tmp/SS,server,nowait -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 -monitor stdio -vga std -device usb-kbd -device virtio-net-pci,id=net0,netdev=netdev0,bus=pci.0,addr=0x9 -netdev tap,script=/etc/qemu-ifup,id=netdev0 On dst, cli ... -incoming tcp:0:5800 2.change hpt_order cat /sys/kernel/debug/powerpc/hpt_order 25 echo 26 > /sys/kernel/debug/powerpc/hpt_order 3.(qemu) migrate_set_speed 1G (qemu) migrate tcp:localhost:5800 Actual results, Migrated successfully. Expected results, Migrated successfully.Guest continues to run normally. Base on above,the issue has been fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3443 |