Bug 1420679

Summary: Guest reboot after migration from RHEL7.2.z -> RHEL7.4
Product: Red Hat Enterprise Linux 7 Reporter: huiqingding <huding>
Component: qemu-kvm-rhevAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED ERRATA QA Contact: huiqingding <huding>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: chayang, dgilbert, huding, juzhang, knoel, michen, mrezanin, qzhang, virt-maint, xianwang, zhengtli
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.9.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 23:44:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1387372    
Bug Blocks: 1376765    

Description huiqingding 2017-02-09 10:01:03 UTC
Description of problem:
Guest reboot after migration from RHEL7.2.z -> RHEL7.4, test two guests: rhel7.2 and win2012r2, both reboot after migration.

Version-Release number of selected component (if applicable):
Source host(rhel7.2.z):
kernel-3.10.0-327.50.1.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.25.x86_64

Destination host(rhel7.4):
kernel-3.10.0-558.el7.x86_64
qemu-kvm-rhev-2.8.0-3.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.boot rhel7.2 or win2012r2 guest in source host:
# /usr/libexec/qemu-kvm \
-name rhel7 \
-machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off \
-m 2048 \
-cpu Opteron_G4,check \
-realtime mlock=off \
-smp 4,maxcpus=4,sockets=4,cores=1,threads=1 \
-uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 \
-nodefaults \
-rtc base=utc,driftfix=slew \
-boot order=c,menu=on,strict=on \
-drive file=/mnt/rhel7.2.qcow2,if=none,id=drive-data-disk,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-device ide-hd,drive=drive-data-disk,id=system-disk,logical_block_size=512,physical_block_size=512,min_io_size=32,opt_io_size=64,discard_granularity=512,ver=fuxc-ver,bus=ide.0,unit=0 \
-net none \
-monitor stdio \
-qmp tcp:0:4466,server,nowait -serial unix:/tmp/ttym,server,nowait \
-spice port=5910,addr=0.0.0.0,disable-ticketing,seamless-migration=on \
-device qxl-vga,id=video0,ram_size=134217728,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 \

2. boot the guest in destination host with "-incoming tcp:0:5800"

3. do migration
(qemu)migration -d tcp:10.73.72.56:5800

Actual results:
after step3, when migration is finished, guest reboot in destination.

Expected results:
guest does not reboot.

Additional info:
I also use the same comand line to test migration from RHEL7.3.z->RHEL7.4, not hit this issue.

Comment 2 Dr. David Alan Gilbert 2017-02-16 10:36:55 UTC
Confirmed.
Happens on Intel as well with a 7.3 guest

Comment 3 Dr. David Alan Gilbert 2017-02-16 14:01:23 UTC
Easy reproduce with:

/usr/libexec/qemu-kvm -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off,vmport=off -cpu IvyBridge -m 4096 -no-hpet -drive file=/home/vms/7.3-fromimage.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev stdio,mux=on,id=mon -mon chard
ev=mon,mode=readline --device isa-serial,chardev=mon

kvm tracing on the destination I see:
<series of sane looking stuff, page faults, all normal, some console IO>
       CPU 0/KVM-11014 [014] .... 6629872.405344: kvm_exit: reason EPT_VIOLATION rip 0xffffffff81060eb6 info 181 0
       CPU 0/KVM-11014 [014] .... 6629872.405345: kvm_page_fault: address 13fc09010 error_code 181
       CPU 0/KVM-11014 [014] .... 6629872.405347: kvm_inj_exception: #PF (0x0)
       CPU 0/KVM-11014 [014] d... 6629872.405347: kvm_entry: vcpu 0
       CPU 0/KVM-11014 [014] .... 6629872.405350: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xffffffff8168df90 info 0 800000fd
       CPU 0/KVM-11014 [014] .... 6629872.405352: kvm_enter_smm: vcpu 0: entering SMM, smbase 0x30000
       CPU 0/KVM-11014 [014] d... 6629872.405371: kvm_entry: vcpu 0
       CPU 0/KVM-11014 [014] .... 6629872.405373: kvm_exit: reason EPT_VIOLATION rip 0x8000 info 184 0
       CPU 0/KVM-11014 [014] .... 6629872.405373: kvm_page_fault: address 38000 error_code 184
       CPU 0/KVM-11014 [014] d... 6629872.405380: kvm_entry: vcpu 0

<then more page faults from 8000..3f000 with codes of 184 or 183>
then
       CPU 0/KVM-11014 [014] .... 6629872.405486: kvm_exit: reason EXCEPTION_NMI rip 0xfe04 info 0 80000306
       CPU 0/KVM-11014 [014] .... 6629872.405495: kvm_emulate_insn: 30000:fe04:ff ff (real)
       CPU 0/KVM-11014 [014] .... 6629872.405496: kvm_inj_exception: #UD (0x0)
       CPU 0/KVM-11014 [014] d... 6629872.405497: kvm_entry: vcpu 0
       CPU 0/KVM-11014 [014] .... 6629872.405499: kvm_exit: reason TRIPLE_FAULT rip 0xfe04 info 0 0
       CPU 0/KVM-11014 [014] .... 6629872.405501: kvm_userspace_exit: reason KVM_EXIT_SHUTDOWN (8)
       CPU 0/KVM-11014 [014] .... 6629872.405502: kvm_fpu: unload

Comment 4 Dr. David Alan Gilbert 2017-02-16 18:26:36 UTC
I think this is SMM/SMI related but am not sure yet.
I think it ends up in SMM - for reasons I don't understand - then runs through an area of junk before finally hitting an undefined instruction and triple faulting.
I've not tracked down what that initial 'external interrupt' is - it doesn't seem to match the vector of any registered device on the guest.

Comment 5 Dr. David Alan Gilbert 2017-02-17 12:10:00 UTC
Still looking like SMM/SMI.
7.2 ends up setting CPU_INTERRUPT_SMI - but ignores it.
When we read the migration stream we end up with that set and causing the SMI entry.
What I've not figured out yet is why 7.3 doesn't do the entry - it has the SMI entry code.

Comment 6 Dr. David Alan Gilbert 2017-02-17 12:48:51 UTC
OK, the reason 7.3 worked is it had a bug with SMIs didn't get delivered;
that was fixed in 7.4 by:
68c6efe07a4729b54947658df4fceed84f3d0fef

Comment 7 Dr. David Alan Gilbert 2017-02-17 17:14:14 UTC
Posted downstream:
  x86: Work around SMI breakages

Please try this with lots of machine types and also with EFI firmware on q35.

Comment 8 Dr. David Alan Gilbert 2017-02-23 13:38:49 UTC
Taken a different tack, and posted upstream, we'll need to wait for it to swing back around.

Comment 9 Dr. David Alan Gilbert 2017-03-06 10:32:46 UTC
Fixed upstream in 2.9, commit fc3a1fd74fac0e3233060aaaf923fe8ec104b48f

Comment 15 huiqingding 2017-05-04 08:52:01 UTC
(In reply to Dr. David Alan Gilbert from comment #7)
> Posted downstream:
>   x86: Work around SMI breakages
> 
> Please try this with lots of machine types and also with EFI firmware on q35.

Verify this bug using:
rhel7.2.z host:
kernel-3.10.0-327.53.1.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.25.x86_64
rhel7.4 host:
kernel-3.10.0-663.el7.x86_64
qemu-kvm-rhev-2.9.0-2.el7.x86_64

Test migration rhel7.2.z<->rhel7.4 with machine types "-M rhel7.2.0/rhel7.1.0/rhel7.0.0/rhel6.6.0/rhel6.5.0", test two guests: rhel7.2.z and win2012r2. The result is pass, migration can be finished normally and guest does not reboot after migration.  

For qemu-kvm-rhev-2.9.0-2.el7.x86_64, only supports pc-q35-rhel7.4.0 and pc-q35-rhel7.3.0. Test rhel7.3.z<->rhel7.4 with EFI firmware on q35, guest is rhel7.3.z, migration can be finished normally and the guest does not reboot.

Comment 16 huiqingding 2017-05-04 08:52:34 UTC
Based on comment #15, set this bug to be verified.

Comment 18 errata-xmlrpc 2017-08-01 23:44:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 19 errata-xmlrpc 2017-08-02 01:22:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 20 errata-xmlrpc 2017-08-02 02:14:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 21 errata-xmlrpc 2017-08-02 02:55:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 22 errata-xmlrpc 2017-08-02 03:19:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 23 errata-xmlrpc 2017-08-02 03:37:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392