Description of problem:
Guest reboot after migration from RHEL7.2.z -> RHEL7.4, test two guests: rhel7.2 and win2012r2, both reboot after migration.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.boot rhel7.2 or win2012r2 guest in source host:
# /usr/libexec/qemu-kvm \
-name rhel7 \
-machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off \
-m 2048 \
-cpu Opteron_G4,check \
-realtime mlock=off \
-smp 4,maxcpus=4,sockets=4,cores=1,threads=1 \
-uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 \
-rtc base=utc,driftfix=slew \
-boot order=c,menu=on,strict=on \
-drive file=/mnt/rhel7.2.qcow2,if=none,id=drive-data-disk,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-device ide-hd,drive=drive-data-disk,id=system-disk,logical_block_size=512,physical_block_size=512,min_io_size=32,opt_io_size=64,discard_granularity=512,ver=fuxc-ver,bus=ide.0,unit=0 \
-net none \
-monitor stdio \
-qmp tcp:0:4466,server,nowait -serial unix:/tmp/ttym,server,nowait \
-spice port=5910,addr=0.0.0.0,disable-ticketing,seamless-migration=on \
-device qxl-vga,id=video0,ram_size=134217728,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 \
2. boot the guest in destination host with "-incoming tcp:0:5800"
3. do migration
(qemu)migration -d tcp:10.73.72.56:5800
after step3, when migration is finished, guest reboot in destination.
guest does not reboot.
I also use the same comand line to test migration from RHEL7.3.z->RHEL7.4, not hit this issue.
Happens on Intel as well with a 7.3 guest
Easy reproduce with:
/usr/libexec/qemu-kvm -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off,vmport=off -cpu IvyBridge -m 4096 -no-hpet -drive file=/home/vms/7.3-fromimage.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev stdio,mux=on,id=mon -mon chard
ev=mon,mode=readline --device isa-serial,chardev=mon
kvm tracing on the destination I see:
<series of sane looking stuff, page faults, all normal, some console IO>
CPU 0/KVM-11014  .... 6629872.405344: kvm_exit: reason EPT_VIOLATION rip 0xffffffff81060eb6 info 181 0
CPU 0/KVM-11014  .... 6629872.405345: kvm_page_fault: address 13fc09010 error_code 181
CPU 0/KVM-11014  .... 6629872.405347: kvm_inj_exception: #PF (0x0)
CPU 0/KVM-11014  d... 6629872.405347: kvm_entry: vcpu 0
CPU 0/KVM-11014  .... 6629872.405350: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xffffffff8168df90 info 0 800000fd
CPU 0/KVM-11014  .... 6629872.405352: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x30000
CPU 0/KVM-11014  d... 6629872.405371: kvm_entry: vcpu 0
CPU 0/KVM-11014  .... 6629872.405373: kvm_exit: reason EPT_VIOLATION rip 0x8000 info 184 0
CPU 0/KVM-11014  .... 6629872.405373: kvm_page_fault: address 38000 error_code 184
CPU 0/KVM-11014  d... 6629872.405380: kvm_entry: vcpu 0
<then more page faults from 8000..3f000 with codes of 184 or 183>
CPU 0/KVM-11014  .... 6629872.405486: kvm_exit: reason EXCEPTION_NMI rip 0xfe04 info 0 80000306
CPU 0/KVM-11014  .... 6629872.405495: kvm_emulate_insn: 30000:fe04:ff ff (real)
CPU 0/KVM-11014  .... 6629872.405496: kvm_inj_exception: #UD (0x0)
CPU 0/KVM-11014  d... 6629872.405497: kvm_entry: vcpu 0
CPU 0/KVM-11014  .... 6629872.405499: kvm_exit: reason TRIPLE_FAULT rip 0xfe04 info 0 0
CPU 0/KVM-11014  .... 6629872.405501: kvm_userspace_exit: reason KVM_EXIT_SHUTDOWN (8)
CPU 0/KVM-11014  .... 6629872.405502: kvm_fpu: unload
I think this is SMM/SMI related but am not sure yet.
I think it ends up in SMM - for reasons I don't understand - then runs through an area of junk before finally hitting an undefined instruction and triple faulting.
I've not tracked down what that initial 'external interrupt' is - it doesn't seem to match the vector of any registered device on the guest.
Still looking like SMM/SMI.
7.2 ends up setting CPU_INTERRUPT_SMI - but ignores it.
When we read the migration stream we end up with that set and causing the SMI entry.
What I've not figured out yet is why 7.3 doesn't do the entry - it has the SMI entry code.
OK, the reason 7.3 worked is it had a bug with SMIs didn't get delivered;
that was fixed in 7.4 by:
x86: Work around SMI breakages
Please try this with lots of machine types and also with EFI firmware on q35.
Taken a different tack, and posted upstream, we'll need to wait for it to swing back around.
Fixed upstream in 2.9, commit fc3a1fd74fac0e3233060aaaf923fe8ec104b48f
(In reply to Dr. David Alan Gilbert from comment #7)
> Posted downstream:
> x86: Work around SMI breakages
> Please try this with lots of machine types and also with EFI firmware on q35.
Verify this bug using:
Test migration rhel7.2.z<->rhel7.4 with machine types "-M rhel7.2.0/rhel7.1.0/rhel7.0.0/rhel6.6.0/rhel6.5.0", test two guests: rhel7.2.z and win2012r2. The result is pass, migration can be finished normally and guest does not reboot after migration.
For qemu-kvm-rhev-2.9.0-2.el7.x86_64, only supports pc-q35-rhel7.4.0 and pc-q35-rhel7.3.0. Test rhel7.3.z<->rhel7.4 with EFI firmware on q35, guest is rhel7.3.z, migration can be finished normally and the guest does not reboot.
Based on comment #15, set this bug to be verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.