Bug 1414246

Summary: Win2016 guest hang after doing migration between two hosts that has a time lag that dst host is 10 minutes backwards than src host
Product: Red Hat Enterprise Linux 6 Reporter: xianwang <xianwang>
Component: qemu-kvmAssignee: Hai Huang <hhuang>
Status: CLOSED NOTABUG QA Contact: xianwang <xianwang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.9CC: ailan, chayang, dgilbert, hhuang, jen, michen, mkenneth, qzhang, rbalakri, virt-maint, xianwang, zhengtli
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-07 11:52:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description xianwang 2017-01-18 07:09:14 UTC
Description of problem:
Do migration between two x86 AMD hosts on win2016 vm, dst host is 10 minutes backwards than src host,then after migration completed, the vm hang while network work well, ie, host can ping vm successfully.

Version-Release number of selected component (if applicable):
Host(both src and dst):
2.6.32-682.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.500.el6.x86_64

Guest:
win2016
virtio-win.iso.el6

How reproducible:
4/4

Steps to Reproduce:
1.Boot a win2016 vm on src host as the following qemu cli:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1' \
    -machine pc  \
    -nodefaults  \
    -vga qxl \
    -device ich9-usb-ehci1,id=usb1,addr=1d.7,multifunction=on,bus=pci.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=1d.0,firstport=0,bus=pci.0 \
    -drive id=drive_image1,if=none,aio=native,cache=none,format=qcow2,file=/root/win2016-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=04 \
    -drive id=drive_data1,if=none,aio=native,cache=none,format=qcow2,file=/root/data1.qcow2 \
    -device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0,addr=05 \
    -device virtio-net-pci,mac=9a:e3:e4:e5:e6:e7,id=idY87fGi,vectors=4,netdev=idjQ51hm,bus=pci.0,addr=06 \
    -netdev tap,id=idjQ51hm,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 4096 \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu host \
    -drive id=drive_cd1,if=none,snapshot=off,aio=native,cache=none,media=cdrom,file=/root/en_windows_server_2016_x64_dvd_9327751.iso \
    -device ide-drive,id=cd1,drive=drive_cd1,bootindex=1,bus=ide.0,unit=0 \
    -drive id=drive_virtio,if=none,snapshot=off,aio=native,cache=none,media=cdrom,file=/root/virtio-win.iso.el6 \
    -device ide-drive,id=virtio,drive=drive_virtio,bootindex=2,bus=ide.1,unit=1 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -spice port=5901,disable-ticketing \
    -monitor stdio \
    -qmp tcp:0:8881,server,nowait \
    -rtc base=localtime,clock=host,driftfix=slew  \
    -boot order=cdn,once=d,menu=off,strict=off \
    -enable-kvm \
2.Boot a vm on dst host as the same qemu cli as src host, appending:
  -incoming tcp:0:5801
note:dst host and src host have a time lag,the dst host is 10 minutes backwards than src host, ie, when the dst host is 00:49:30 the src host is 00:59:30.
3.Do migration from src host to dst host
(qemu) migrate -d tcp:10.66.144.53:5801
(qemu) migrate_set_downtime 20
(qemu) info migrate
Migration status: completed
total time: 43042 milliseconds
downtime: 21191 milliseconds
transferred ram: 1542593 kbytes
remaining ram: 8 kbytes
total ram: 4325768 kbytes

Actual results:
After migration completed,vm hang on dst host about 10 minutes,and after 10 minutes later, vm can work well.

During vm hang, dst host can ping vm successfully, "top" and "free -h"on the dst host, results are as follows:

[root@amd-5600-4-1 ~]# top
top - 01:54:44 up 2 days,  4:18,  3 users,  load average: 0.51, 0.74, 0.39
Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.8%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3855876k total,  3724732k used,   131144k free,     7640k buffers
Swap:  3997692k total,    69940k used,  3927752k free,  1984704k cached

[root@amd-5600-4-1 ~]# free -h
             total       used       free     shared    buffers     cached
Mem:          3.7G       3.5G       134M        12K       7.6M       1.9G
-/+ buffers/cache:       1.7G       2.0G
Swap:         3.8G        68M       3.7G

[root@amd-5600-4-1 ~]# ping 10.66.145.13
PING 10.66.145.13 (10.66.145.13) 56(84) bytes of data.
64 bytes from 10.66.145.13: icmp_seq=1 ttl=128 time=2.94 ms
64 bytes from 10.66.145.13: icmp_seq=2 ttl=128 time=0.753 ms
64 bytes from 10.66.145.13: icmp_seq=3 ttl=128 time=0.896 ms
64 bytes from 10.66.145.13: icmp_seq=4 ttl=128 time=0.744 ms
64 bytes from 10.66.145.13: icmp_seq=5 ttl=128 time=0.919 ms
64 bytes from 10.66.145.13: icmp_seq=6 ttl=128 time=0.776 ms
^C
--- 10.66.145.13 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5857ms
rtt min/avg/max/mdev = 0.744/1.171/2.940/0.794 ms



Expected results:
after migration completed, vm work well on dst host


Additional info:

Comment 1 xianwang 2017-01-18 08:52:52 UTC
Rhel6.9 host arch x86_64 both Intel and AMD machine have this bug.

Host(both src and dst):
2.6.32-682.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.500.el6.x86_64

Guest:
win2016
virtio-win.iso.el6

Comment 2 xianwang 2017-01-18 09:04:32 UTC
Rhel6.9 host arch x86_64 both Intel and AMD machine don't have this bug with rhel69 guest

Host(both src and dst):
2.6.32-682.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.500.el6.x86_64

Guest:
rhel69-64-virtio.qcow2
2.6.32-671.el6.x86_64

Comment 7 Jeff Nelson 2017-01-23 18:45:02 UTC
>Do migration between two x86 AMD hosts on win2016 vm, dst host is 10 minutes
>backwards than src host,

I believe this is documented as an invalid scenario; both src and dst hosts must have close-to-identical times. Reassigning for followup.

Comment 8 Dr. David Alan Gilbert 2017-01-23 18:51:23 UTC
Yes I think this is documented that the src and dst host must have close times; however, I'm curious,  if the dst host is 10 minutes behind, and you migrate the VM, but leave the VM running for 10 minutes, does the VM then unhang at the end of those 10 minutes?

Comment 9 xianwang 2017-01-26 03:20:30 UTC
Hi, Jeff and David,
(1)If this scenario is invalid, why this bug can't be reproduced for rhel69 guest on rhel69 hosts, and can't be reproduced for win2016 guest on rhel73 hosts? Now, this bug is only produced for windows guests on rhel69 hosts.

(2)when the dst host is 10 minutes before than src host, windows guest can work well after migration. So, did rhel69 product only support the scenario that dst host is 10 minutes before than src host?  and not support the scenario that dst host is 10 minutes backward than src host? 

(3)For comment 8,I am not sure if I get David's point right."if the dst host is 10 minutes behind, and you migrate the VM, but leave the VM running for 10 minutes", after migration, the VM is hang and can't operate it via vncviewer,10 minutes later, vm can be operated normally.

Comment 10 Dr. David Alan Gilbert 2017-01-26 09:21:44 UTC
(In reply to xianwang from comment #9)
> Hi, Jeff and David,
> (1)If this scenario is invalid, why this bug can't be reproduced for rhel69
> guest on rhel69 hosts, and can't be reproduced for win2016 guest on rhel73
> hosts? Now, this bug is only produced for windows guests on rhel69 hosts.

The problem here is related to the time as seen by the guest; rhel guests don't worry about that much, but Windows guests get upset by it.
However, I can't answer why this doesn't happen for win guests on rhel 73.

> (2)when the dst host is 10 minutes before than src host, windows guest can
> work well after migration. So, did rhel69 product only support the scenario
> that dst host is 10 minutes before than src host?  and not support the
> scenario that dst host is 10 minutes backward than src host? 

Your wording is incorrect - when you say 'backward' and 'before' which way around do you mean?  Give me an example of the time on both clocks.

> (3)For comment 8,I am not sure if I get David's point right."if the dst host
> is 10 minutes behind, and you migrate the VM, but leave the VM running for
> 10 minutes", after migration, the VM is hang and can't operate it via
> vncviewer,10 minutes later, vm can be operated normally.

One thing we've seen before is a case like this:
    a) Src host has clock at 9:00am
    b) Dst host has clock at 8:50am
    c) We migrate
    d) VM appears hung
    e) Once dst host gets to 9:00am the VM recovers

  I wanted to know if this is the case you're seeing.

Dave

Comment 11 xianwang 2017-02-07 04:37:01 UTC
(In reply to Dr. David Alan Gilbert from comment #10)
> (In reply to xianwang from comment #9)
> > Hi, Jeff and David,
> > (1)If this scenario is invalid, why this bug can't be reproduced for rhel69
> > guest on rhel69 hosts, and can't be reproduced for win2016 guest on rhel73
> > hosts? Now, this bug is only produced for windows guests on rhel69 hosts.
> 
> The problem here is related to the time as seen by the guest; rhel guests
> don't worry about that much, but Windows guests get upset by it.
> However, I can't answer why this doesn't happen for win guests on rhel 73.
> 
> > (2)when the dst host is 10 minutes before than src host, windows guest can
> > work well after migration. So, did rhel69 product only support the scenario
> > that dst host is 10 minutes before than src host?  and not support the
> > scenario that dst host is 10 minutes backward than src host? 
> 
> Your wording is incorrect - when you say 'backward' and 'before' which way
> around do you mean?  Give me an example of the time on both clocks.
> 
> > (3)For comment 8,I am not sure if I get David's point right."if the dst host
> > is 10 minutes behind, and you migrate the VM, but leave the VM running for
> > 10 minutes", after migration, the VM is hang and can't operate it via
> > vncviewer,10 minutes later, vm can be operated normally.
> 
> One thing we've seen before is a case like this:
>     a) Src host has clock at 9:00am
>     b) Dst host has clock at 8:50am
>     c) We migrate
>     d) VM appears hung
>     e) Once dst host gets to 9:00am the VM recovers
> 
>   I wanted to know if this is the case you're seeing.
> 
> Dave

Hi, David,

> > (2)when the dst host is 10 minutes before than src host, windows guest can
> > work well after migration. So, did rhel69 product only support the scenario
> > that dst host is 10 minutes before than src host?  and not support the
> > scenario that dst host is 10 minutes backward than src host? 
> 
> Your wording is incorrect - when you say 'backward' and 'before' which way
> around do you mean?  Give me an example of the time on both clocks.
>
>I am sorry my wording is incorrect. I mean if rhel69 product only support the scenario that dst host is 10 minutes forward than src host for example: 
a) Src host has clock at 8:50am
b) Dst host has clock at 9:00am
and don't support the scenario that dst host is 10 minutes backward than src host for example:
a) Src host has clock at 9:00am
b) Dst host has clock at 8:50am
> 
> > (3)For comment 8,I am not sure if I get David's point right."if the dst host
> > is 10 minutes behind, and you migrate the VM, but leave the VM running for
> > 10 minutes", after migration, the VM is hang and can't operate it via
> > vncviewer,10 minutes later, vm can be operated normally.
> 
> One thing we've seen before is a case like this:
>     a) Src host has clock at 9:00am
>     b) Dst host has clock at 8:50am
>     c) We migrate
>     d) VM appears hung
>     e) Once dst host gets to 9:00am the VM recovers
>
>Yes,test step and result of this bug are as you said a,b,c,d,e.