Bug 523457

Summary: windows guests time drift after migration.
Product: Red Hat Enterprise Linux 5 Reporter: lihuang <lihuang>
Component: kvmAssignee: Glauber Costa <gcosta>
Status: CLOSED DUPLICATE QA Contact: Lawrence Lim <llim>
Severity: medium Docs Contact:
Priority: high    
Version: 5.4CC: azarembo, bstein, dshaks, gcosta, juzhang, lmr, michen, ndai, ovirt-maint, srao, syeghiay, tburke, tools-bugs, virt-maint, ykaul
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-17 11:13:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 518493    
Attachments:
Description Flags
time drift result from the scratch build none

Description lihuang 2009-09-15 15:26:03 UTC
Description of problem:
On Intel host
the offset of win2k8 Datacenter (64bit) is about 9.45s after migration.

On AMD host
the offset of winXP is about 10.75s 


In https://bugzilla.redhat.com/show_bug.cgi?id=521794#c6
and https://bugzilla.redhat.com/show_bug.cgi?id=521794#c8
more information could be found 


Version-Release number of selected component (if applicable):
kvm-83-105.el5_4.3

How reproducible:


Steps to Reproduce:
1.start the guest     
2.sync guest time with host  (ntpdate.exe -b $host )
3.implement migration
4.query the offset   (ntpdate.exe -q $host)
  
Actual results:


Expected results:
no drift

Additional info:

CLI : 
/usr/libexec/qemu-kvm -smp 2 -m 4G -cpu qemu64,+sse2 -startdate now -drive
file=/mnt/winXP-32-virtio-back.raw,media=disk,if=ide -name host2guest
-usbdevice tablet -net nic,vlan=0,macaddr=00:55:18:02:B2:fe,model=virtio -net
tap,vlan=0,script=/etc/qemu-ifup -vnc :2 -monitor stdio -notify all
-rtc-td-hack

Comment 1 Glauber Costa 2009-09-15 16:07:32 UTC
Please test the rpm from:

https://brewweb.devel.redhat.com/taskinfo?taskID=1986372

Comment 2 Lawrence Lim 2009-09-21 15:02:01 UTC
Created attachment 361958 [details]
time drift result from the scratch build

Comment 3 Lucas Meneghel Rodrigues 2009-10-01 02:42:48 UTC
I was able to reproduce the problem. To summarize, the problem is well reproducible and the time drifts we are seeing are in the range 1.2-1.3s, considered to be unacceptable.

Guest: Windows XP 32 bits
Host: Intel Xeon, 8 processors
model name	: Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
cpu MHz		: 2656.000
cache size	: 6144 KB
KVM version: kvm-83-105.el5_4.8 (latest build from brew - RHEL 5.4.z maintenance stream at the time this procedure was attempted.)

Host OS: RHEL 5.4, Linux virtlab104.virt.bos.redhat.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Possibly important data: I wasn't able to reproduce the problem using -net user, so this *might* indicate that tap device + bridge is somehow influencing.

It took us some time to set up the environment and figure out the whole procedure, so I will register here if someone else needs to. This procedure is based on the comments mentioned on this current report:

https://bugzilla.redhat.com/show_bug.cgi?id=521794#c8

1) Make sure you have a working bridge and a ifup script that will be used to configure the tap device for qemu. As an example, if the name of the bridge device is br0, the ifup script looks like:

#!/bin/sh
switch=br0
/sbin/ifconfig $1 0.0.0.0 up
/usr/sbin/brctl addif ${switch} $1

If your lab has a DHCP server then your VM will be able to pick up an address via DHCP and therefore be visible for the rest of your lab's machines.

2) Install WinXP 32 bit under a VM (raw file format was chosen). The qemu command line used was:

/usr/libexec/qemu-kvm -smp 2 -m 4G -cpu qemu64,+sse2 -startdate now -drive file=winxp.raw,media=disk,if=ide -drive file=drivers.iso,if=ide,media=cdrom,index=2 -name host2guest -usbdevice tablet -net nic,vlan=0,macaddr=00:55:18:02:B2:fe,model=virtio -net tap,vlan=0,script=/etc/qemu-ifup-eth0, -vnc :2 -monitor stdio -notify all -rtc-td-hack

This will return the qemu monitor on stdin, so it's more convenient to type commands in there.

3) Install the latest WinXP virtio network drivers available by the time this text is being written
http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers

4) Install NTP for windows, that can be found under
http://www.meinberg.de/download/ntp/windows/ntp-4.2.4p7@copenhagen-o-win32-setup.exe
Make sure the ntp daemon is never executed. We only want to use the ability to sync the guest's clock with host's clock using ntpdate.exe

5) Make sure your host can answer to clock syncing requests. I've looked at the routes:

[root@virtlab104 ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.16.72.0      *               255.255.252.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
default         10.16.75.254    0.0.0.0         UG    0      0        0 eth0

Then had to add the line

restrict 10.16.72.0 mask 255.255.255.0 nomodify notrap

To /etc/ntp.conf and restart the ntpd service in order to make it answer to sync requests.

6) For more convenience and therefore optionally, turn off the firewall and disable selinux so you can do the local migration more easily:

iptables -F

echo 0 > /selinux/disable

7) On your host, for each physical CPU present, start a command that brings a high level of CPU stress. Let's say your system has 8 cores, you can do the following:

for cpu in $(seq 0 7); do dd if=/dev/urandom of=/dev/null bs=500M count=10 & ; done

8) Prepare the VM that is going to receive migration accordingly. The command used was:

/usr/libexec/qemu-kvm -smp 2 -m 4G -cpu qemu64,+sse2 --incoming tcp:0:4444 -startdate now -drive file=winxp.raw,media=disk,if=ide -drive file=drivers.iso,if=ide,media=cdrom,index=2 -name host2guest -usbdevice tablet -net nic,vlan=0,macaddr=00:55:18:02:B2:fe,model=virtio -net user,vlan=0, -vnc :3 -monitor stdio -notify all -rtc-td-hack

9) On your windows guest, open a terminal and sync the guest time with host time:

cd C:\Program Files\NTP\bin
ntpdate.exe -b [hostname of the virt host]

Twice, so in the first execution we'll sync and in the second we'll verify that the clocks are in sync.

10) Now, to the migration: Use the migration command on the first VM monitor:

migrate -d tcp:0:4444

wait until the migration is complete (info migrate on the source vm qemu monitor will let you know when it's complete)

11) Get to the new VM (we started it on VNC port 3) and then execute again:

ntpdate.exe -b [hostname of the virt host]

And you will see the drift.

Let me know if there's anything else I can do to help.

Comment 4 Dor Laor 2009-10-01 13:04:57 UTC
Did you do migration on the same host?
If that's the case, certainly, 1.2s is way to big. Need to get to the bottom of it to understand why it happens. Is it related to networking? On migration we publish a new mac on the host bridge. Will ntp handle it?

Comment 5 Dor Laor 2009-10-01 13:14:28 UTC
I went to two QE engineers and tried reproduce it on different systems between different hosts that their clocks were not synced.
All the cpus were loaded. One guest was rhel and the other winXP.

After 10 migrations I only gained 4 seconds gap -> 0.4s per migration.

Comment 6 Lucas Meneghel Rodrigues 2009-10-01 14:29:59 UTC
Yes, it was on the same host. The procedure I came up with is based on all the reports mentioned on this same bugzilla, NTP for windows was very handy for allowing to perform manual testing with simplicity.

Comment 7 Lucas Meneghel Rodrigues 2009-10-02 19:43:42 UTC
Update: I thought at first that the problem was well reproducible, because I was able to reproduce it some times, even with a clean boot. However, it turns out that it's intermittent: Some times we can see it, some times don't. I ask Lawrence's assistance in debugging and reproducing the problem, so myself and Glauber can do further investigation.

Comment 8 Dor Laor 2009-10-05 15:56:09 UTC
Today I run a rhel5.4 guest -smp2 with -no-kvm-pit-reinjectio and in the guest kernel cmd line added "divider=10 notsc lpj=xxxx". When loaded both host and guest, the guest had serious time drift.
The time drift exist regardless of migration! Migration may add 1-2 seconds but it's not the root cause. The main problem is that we drift without pv clock.

Avi estimates that the host scheduler treat qemu as a batch job and gives bigger slices to it, thus causing us to miss irq injections (since the prev irq was un-eoi). Note that playing with some scheduler /sys parametrs workaround it on upstream host kernel. There are no parameters for tweaking the scheduler in rhel5.4.

The problem is that if we do reinject irqs in the host (canceling the -no-kvm-pit-reinjection flag), the rhel guest does some adjustment for compensate for the lost irqs while the host does the same. The result is that the guest has a negative drift - its clock runs faster than the host.

Comment 16 Dor Laor 2009-11-10 13:48:25 UTC

*** This bug has been marked as a duplicate of bug 531701 ***

Comment 17 Miya Chen 2010-01-27 09:12:47 UTC
Time offset is up to 16 seconds after 8 migration in localhost.

Host kernel:2.6.18-185.el5
KVM version:kvm-83-152.el5
Guest type:Guest:win2k8-64 
#cat /proc/cpuinfo
processor	: 95
vendor_id	: GenuineIntel
cpu family	: 6
model		: 29
model name	: Intel(R) Xeon(R) CPU           E7450  @ 2.40GHz
 
steps:
1. synchronizate the host time
#ntpdate clock.redhat.com
2. boot guest
#/usr/libexec/qemu-kvm  -no-hpet -usbdevice tablet -rtc-td-hack -m 2G -smp 2 -drive file=/root/zhangjunyi/win28k-64-ide.qcow2,if=ide,boot=on  -net nic,vlan=0,macaddr=22:11:22:45:66:83,model=virtio -net tap,vlan=0,script=/etc/qemu-ifup -uuid `uuidgen` -cpu qemu64,+sse2 -balloon none -boot c -monitor stdio -vnc :10 -startdate now -notify all
3. synchronizate guest time
4. ping-pong live migration in local.

The following is time offset report in guest:

1. before migration
>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 0.009389, delay 0.32307

27 Jan 05:57:15 ntpdate[1804]: step time server 66.187.233.4 offset 0.009389 sec



2.after 1st migration

>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 1.732748, delay 0.34164

27 Jan 05:58:56 ntpdate[740]: step time server 66.187.233.4 offset 1.732748 sec


3.after 2nd migration

>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 3.581796, delay 0.34743

27 Jan 06:00:44 ntpdate[2000]: step time server 66.187.233.4 offset 3.581796 sec



4.after 3rd migration

>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 7.187067, delay 0.34047

27 Jan 06:02:04 ntpdate[1220]: step time server 66.187.233.4 offset 7.187067 sec



5.after 4th migration

>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 10.116052, delay 0.32893

27 Jan 06:04:03 ntpdate[744]: step time server 66.187.233.4 offset 10.116052 sec



6.after 5th migration
>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 11.024944, delay 0.25447

27 Jan 06:06:01 ntpdate[1972]: step time server 66.187.233.4 offset 11.024944 se

c


7.after 6th migration

>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 12.723103, delay 0.32307

27 Jan 06:08:09 ntpdate[320]: step time server 66.187.233.4 offset 12.723103 sec



8.after 7th migration

C:\Users\Administrator>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 14.261751, delay 0.32790

27 Jan 06:10:44 ntpdate[1888]: step time server 66.187.233.4 offset 14.261751 se

c


9.after 8th migration

>ntpdate -qb clock.redhat.com

server 66.187.233.4, stratum 1, offset 16.382875, delay 0.33516

27 Jan 06:12:15 ntpdate[1888]: step time server 66.187.233.4 offset 16.382875 se

c

Comment 19 Dor Laor 2010-03-17 11:13:45 UTC
Closing as dup of 555727 since there is a drift regardless of migration

*** This bug has been marked as a duplicate of bug 555727 ***