Bug 624570

Summary:	[RFC] sync guest time post savevm/loadvm
Product:	Red Hat Enterprise Linux 6	Reporter:	Shirley Zhou <szhou>
Component:	qemu-kvm	Assignee:	Marcelo Tosatti <mtosatti>
Status:	CLOSED NOTABUG	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	low
Version:	6.0	CC:	amit.shah, gcosta, lihuang, mkenneth, mshao, tburke, virt-maint, zamsden
Target Milestone:	beta
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-08-04 20:08:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	580953

Description Shirley Zhou 2010-08-17 02:09:15 UTC

Description of problem:
time drift after savevm/loadvm

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.109.el6.x86_64
kernel-2.6.32-63.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1.do sync time on host
ntpdate -b clock.redhat.com
2.run rhel6 guest on above host as following:
 /usr/libexec/qemu-kvm -m 4G -smp 4 -cpu qemu64,+x2apic -usbdevice tablet -drive file=/mnt/rhel6.qcow2,if=none,id=drive-virtio0,boot=on,werror=stop,rerror=stop,cache=none,format=qcow2 -device ide-drive,drive=drive-virtio0,id=virtio-blk-pci0 -netdev tap,id=hostnet0,script=/mnt/qemu-ifup,vhost=on,ifname=virtio_nic_2 -device virtio-net-pci,netdev=hostnet0,mac=52:54:00:cc:7e:f7,bus=pci.0,id=virtio1  -uuid a2341245-8765-1234-95da-1dd0a8891cc4 -name rhel6 -qmp tcp:0:4446,server,nowait   -device virtio-balloon-pci,id=ba1 -monitor stdio -boot c -vnc :1  -no-kvm-pit-reinjection   -rtc base=utc,clock=host,driftfix=slew
2.do sync time on guest
ntpdate -b clock.redhat.com
3.query time
ntpdate -q clock.redhat.com
server 66.187.233.4, stratum 1, offset -0.000696, delay 0.32861
17 Aug 09:56:46 ntpdate[1970]: adjust time server 66.187.233.4 offset -0.000696 sec
4.do savevm from monitor,then query time
ntpdate -q clock.redhat.com
server 66.187.233.4, stratum 1, offset -0.090793, delay 0.33022
17 Aug 10:04:55 ntpdate[1975]: adjust time server 66.187.233.4 offset -0.090793 sec
5.do loadvm, then query time
ntpdate -q clock.redhat.com
server 66.187.233.4, stratum 1, offset 78.049035, delay 0.32887
17 Aug 10:05:36 ntpdate[1994]: step time server 66.187.233.4 offset 78.049035 sec
  
Actual results:
after step5,there is huge time drift happens.

Expected results:
There should not be so much time drift.

Additional info:

Comment 2 RHEL Program Management 2010-08-17 02:38:36 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Dor Laor 2010-11-21 22:33:22 UTC

Please try with -rtc=localtime

Comment 4 Shirley Zhou 2010-11-22 05:03:14 UTC

(In reply to comment #3)
> Please try with -rtc=localtime

Try this issue with time device option as :
-no-kvm-pit-reinjection -rtc base=localtime,clock=host,driftfix=slew

And this bug also reproduce as :

Guest: RHEL6.0 64 bit

1. before save snapshot:
# ntpdate -q clock.redhat.com
server 66.187.233.4, stratum 1, offset -0.018592, delay 0.35168
21 Nov 23:56:27 ntpdate[2548]: adjust time server 66.187.233.4 offset -0.018592 sec
2.after save snapshot:
# ntpdate -q clock.redhat.com
server 66.187.233.4, stratum 1, offset -0.001604, delay 0.31325
21 Nov 23:56:53 ntpdate[2549]: adjust time server 66.187.233.4 offset -0.001604 sec
3. after load snapshot
# ntpdate -q clock.redhat.com
server 66.187.233.4, stratum 1, offset 70.323973, delay 0.31897
21 Nov 23:58:16 ntpdate[2550]: step time server 66.187.233.4 offset 70.323973 sec

From above result, we can see huge time drift after loadvm.

Comment 5 Dor Laor 2010-11-22 07:59:35 UTC

It might be the time it takes to load the image.
Worth checking

Comment 6 Zachary Amsden 2010-11-22 21:57:25 UTC

I don't find this particularly surprising.

You've just stopped running of the guest by loading a snapshot, and you expect it to have been keeping up with real time while it was not running?

If you stop a guest and restart it, you'll need to resync it with time servers as a manual action.  NTP is not designed to cope with service outages, it works properly only a continually running machine.  It will absolutely show a huge offset after a loadvm, which corresponds to the time it was not running.

I'm not sure the slew computation will properly compensate for an offset of time like this, which could be a bug, or not, depending on your perspective.  Obviously, it is undesirable to send millions of interrupts to a VM which has been suspended and loaded a few days later, so there are good reasons not to use slew to catch up time lost while the VM was not running.

Comment 7 Dor Laor 2010-11-25 10:51:01 UTC

Zach, if the -rtc localtime info would override the rtc data saved in the VM it will fix things. Assuming the ntp and other apps will sync up from the OS (even the OS needs to get an interrupt about the new rtc change.
Let's move it into RFC and throw it to 6.2 or later.

Comment 8 Glauber Costa 2010-11-25 13:26:11 UTC

I am in agreement with Dor here.

upon load, there is a window of opportunity where we can do something, just not sure yet what's the best strategy.

Comment 9 Zachary Amsden 2010-11-29 17:32:35 UTC

Best strategy: configure NTP to have a 5 second rejection window and quit upon falling out of window.

Then configure scripts so NTP will restart and force sync time from the server upon exit.

No engineering changes need to be made, the software support for this should already exist.

Comment 11 Marcelo Tosatti 2011-08-04 20:07:08 UTC

I agree with Zach, the current behaviour is correct. Correcting the
guest clock upon resuming is responsability of ntp, not
the hypervisor hardware emulation.

Comment 12 Marcelo Tosatti 2011-08-04 20:08:07 UTC

(In reply to comment #6)
> I don't find this particularly surprising.
> 
> You've just stopped running of the guest by loading a snapshot, and you expect
> it to have been keeping up with real time while it was not running?
> 
> If you stop a guest and restart it, you'll need to resync it with time servers
> as a manual action.  NTP is not designed to cope with service outages, it works
> properly only a continually running machine.  It will absolutely show a huge
> offset after a loadvm, which corresponds to the time it was not running.
> 
> I'm not sure the slew computation will properly compensate for an offset of
> time like this, which could be a bug, or not, depending on your perspective. 

If it does not compensate, then ntp should step the guest clock into 
the correct value.

> Obviously, it is undesirable to send millions of interrupts to a VM which has
> been suspended and loaded a few days later, so there are good reasons not to
> use slew to catch up time lost while the VM was not running.

Comment 13 Marcelo Tosatti 2011-08-04 20:08:46 UTC

Closing as NOTABUG.