Hide Forgot
Description of problem: Guest reboot after migration from RHEL7.2.z -> RHEL7.4, test two guests: rhel7.2 and win2012r2, both reboot after migration. Version-Release number of selected component (if applicable): Source host(rhel7.2.z): kernel-3.10.0-327.50.1.el7.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.25.x86_64 Destination host(rhel7.4): kernel-3.10.0-558.el7.x86_64 qemu-kvm-rhev-2.8.0-3.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.boot rhel7.2 or win2012r2 guest in source host: # /usr/libexec/qemu-kvm \ -name rhel7 \ -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off \ -m 2048 \ -cpu Opteron_G4,check \ -realtime mlock=off \ -smp 4,maxcpus=4,sockets=4,cores=1,threads=1 \ -uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 \ -nodefaults \ -rtc base=utc,driftfix=slew \ -boot order=c,menu=on,strict=on \ -drive file=/mnt/rhel7.2.qcow2,if=none,id=drive-data-disk,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \ -device ide-hd,drive=drive-data-disk,id=system-disk,logical_block_size=512,physical_block_size=512,min_io_size=32,opt_io_size=64,discard_granularity=512,ver=fuxc-ver,bus=ide.0,unit=0 \ -net none \ -monitor stdio \ -qmp tcp:0:4466,server,nowait -serial unix:/tmp/ttym,server,nowait \ -spice port=5910,addr=0.0.0.0,disable-ticketing,seamless-migration=on \ -device qxl-vga,id=video0,ram_size=134217728,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 \ 2. boot the guest in destination host with "-incoming tcp:0:5800" 3. do migration (qemu)migration -d tcp:10.73.72.56:5800 Actual results: after step3, when migration is finished, guest reboot in destination. Expected results: guest does not reboot. Additional info: I also use the same comand line to test migration from RHEL7.3.z->RHEL7.4, not hit this issue.
Confirmed. Happens on Intel as well with a 7.3 guest
Easy reproduce with: /usr/libexec/qemu-kvm -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off,vmport=off -cpu IvyBridge -m 4096 -no-hpet -drive file=/home/vms/7.3-fromimage.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev stdio,mux=on,id=mon -mon chard ev=mon,mode=readline --device isa-serial,chardev=mon kvm tracing on the destination I see: <series of sane looking stuff, page faults, all normal, some console IO> CPU 0/KVM-11014 [014] .... 6629872.405344: kvm_exit: reason EPT_VIOLATION rip 0xffffffff81060eb6 info 181 0 CPU 0/KVM-11014 [014] .... 6629872.405345: kvm_page_fault: address 13fc09010 error_code 181 CPU 0/KVM-11014 [014] .... 6629872.405347: kvm_inj_exception: #PF (0x0) CPU 0/KVM-11014 [014] d... 6629872.405347: kvm_entry: vcpu 0 CPU 0/KVM-11014 [014] .... 6629872.405350: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xffffffff8168df90 info 0 800000fd CPU 0/KVM-11014 [014] .... 6629872.405352: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x30000 CPU 0/KVM-11014 [014] d... 6629872.405371: kvm_entry: vcpu 0 CPU 0/KVM-11014 [014] .... 6629872.405373: kvm_exit: reason EPT_VIOLATION rip 0x8000 info 184 0 CPU 0/KVM-11014 [014] .... 6629872.405373: kvm_page_fault: address 38000 error_code 184 CPU 0/KVM-11014 [014] d... 6629872.405380: kvm_entry: vcpu 0 <then more page faults from 8000..3f000 with codes of 184 or 183> then CPU 0/KVM-11014 [014] .... 6629872.405486: kvm_exit: reason EXCEPTION_NMI rip 0xfe04 info 0 80000306 CPU 0/KVM-11014 [014] .... 6629872.405495: kvm_emulate_insn: 30000:fe04:ff ff (real) CPU 0/KVM-11014 [014] .... 6629872.405496: kvm_inj_exception: #UD (0x0) CPU 0/KVM-11014 [014] d... 6629872.405497: kvm_entry: vcpu 0 CPU 0/KVM-11014 [014] .... 6629872.405499: kvm_exit: reason TRIPLE_FAULT rip 0xfe04 info 0 0 CPU 0/KVM-11014 [014] .... 6629872.405501: kvm_userspace_exit: reason KVM_EXIT_SHUTDOWN (8) CPU 0/KVM-11014 [014] .... 6629872.405502: kvm_fpu: unload
I think this is SMM/SMI related but am not sure yet. I think it ends up in SMM - for reasons I don't understand - then runs through an area of junk before finally hitting an undefined instruction and triple faulting. I've not tracked down what that initial 'external interrupt' is - it doesn't seem to match the vector of any registered device on the guest.
Still looking like SMM/SMI. 7.2 ends up setting CPU_INTERRUPT_SMI - but ignores it. When we read the migration stream we end up with that set and causing the SMI entry. What I've not figured out yet is why 7.3 doesn't do the entry - it has the SMI entry code.
OK, the reason 7.3 worked is it had a bug with SMIs didn't get delivered; that was fixed in 7.4 by: 68c6efe07a4729b54947658df4fceed84f3d0fef
Posted downstream: x86: Work around SMI breakages Please try this with lots of machine types and also with EFI firmware on q35.
Taken a different tack, and posted upstream, we'll need to wait for it to swing back around.
Fixed upstream in 2.9, commit fc3a1fd74fac0e3233060aaaf923fe8ec104b48f
(In reply to Dr. David Alan Gilbert from comment #7) > Posted downstream: > x86: Work around SMI breakages > > Please try this with lots of machine types and also with EFI firmware on q35. Verify this bug using: rhel7.2.z host: kernel-3.10.0-327.53.1.el7.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.25.x86_64 rhel7.4 host: kernel-3.10.0-663.el7.x86_64 qemu-kvm-rhev-2.9.0-2.el7.x86_64 Test migration rhel7.2.z<->rhel7.4 with machine types "-M rhel7.2.0/rhel7.1.0/rhel7.0.0/rhel6.6.0/rhel6.5.0", test two guests: rhel7.2.z and win2012r2. The result is pass, migration can be finished normally and guest does not reboot after migration. For qemu-kvm-rhev-2.9.0-2.el7.x86_64, only supports pc-q35-rhel7.4.0 and pc-q35-rhel7.3.0. Test rhel7.3.z<->rhel7.4 with EFI firmware on q35, guest is rhel7.3.z, migration can be finished normally and the guest does not reboot.
Based on comment #15, set this bug to be verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392