543137 – time drift in win2k364 KVM guest

Bug 543137 - time drift in win2k364 KVM guest

Summary: time drift in win2k364 KVM guest

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kvm
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Gleb Natapov
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	540569 547625
TreeView+	depends on / blocked

Reported:	2009-12-01 19:10 UTC by Issue Tracker
Modified:	2018-10-27 11:08 UTC (History)
CC List:	10 users (show)
Fixed In Version:	kvm-83-140.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 07:53:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport-bkon.58172-787563-664095.tar.bz2 (799.42 KB, application/x-bzip2) 2009-12-01 19:11 UTC, Joseph Kachuck	no flags	Details
drift_test.c (3.64 KB, text/plain) 2009-12-01 19:11 UTC, Joseph Kachuck	no flags	Details
win2k3-32bit guest time drift result in kvm-140 (9.72 KB, text/plain) 2009-12-24 02:29 UTC, Qunfang Zhang	no flags	Details
win2k3-64bit guest time drift result in kvm-140 (9.17 KB, text/plain) 2009-12-24 02:30 UTC, Qunfang Zhang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0271	0	normal	SHIPPED_LIVE	Important: kvm security, bug fix and enhancement update	2010-03-29 13:19:48 UTC

Description Issue Tracker 2009-12-01 19:10:38 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2009-12-01 19:10:40 UTC

Event posted on 12-01-2009 06:54am EST by Glen Johnson

=Comment: #0=================================================
Elizabeth A. Kon <bkon.com> -

---Problem Description---
win2k364 guest is experiencing significant time drift (e.g., 1 minute drift within 5 minutes of
starting testcase).

Seeing the drift did not depend on any other load on the host system. Interestingly,
I found that with time (i.e., the day after starting the testcase and leaving it running) the drift
was corrected and no longer occurring. But on restarting the testcase, the drift began again.

Attachment: testcase for producing drift

Contact Information = Elizabeth Kon/bkon.com Anthony Liguori/aliguori.com

---uname output---
Linux bc1cn6 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Machine Type = HS22

---Steps to Reproduce---
Start time_drift.exe testcase on the windows guest.
Within 5 minutes, using date on host and watching windows system clock, drift should be apparent.

---KVM Component Data---
Userspace tool common name: qemu-kvm
The userspace tool has the following bit modes: 64
Userspace rpm: kvm-83-105.el5_4.9

=Comment: #2=================================================
Elizabeth A. Kon <bkon.com> -

Attachment: sosreport
This event sent from IssueTracker by jkachuck  [IBM-LTC]
 issue 371234

Comment 2 Joseph Kachuck 2009-12-01 19:11:14 UTC

Created attachment 375166 [details]
sosreport-bkon.58172-787563-664095.tar.bz2

Comment 3 Joseph Kachuck 2009-12-01 19:11:37 UTC

Created attachment 375167 [details]
drift_test.c

Comment 4 Gleb Natapov 2009-12-03 12:57:28 UTC

What was the command line? Is it reproducable on 32bit version or is it unique for 64bit? Can you attach compiled version of drift_test.c? I don't have development environment for Windows. Will try to cross-compile on Linux, but having compiled version will be easier.

Comment 5 Issue Tracker 2009-12-03 15:20:47 UTC

Event posted on 12-03-2009 09:53am EST by Glen Johnson

------- Comment From bkon.com 2009-12-03 09:43 EDT-------
Following is my understanding of what is causing the drift, and an
explanation of the patch.

The current approach when interrupts are not being delivered in a timely
manner is to coalesce them and then inject them by dividing the rtc
interval into smaller intervals to speed up delivery of the missed
interrupts. The problem in this case is that we're dealing with a 2.6.18
kernel that doesn't have highres timers. This testcase causes windows to
run the rtc at its minimum period of 1ms so when the coalescing code tries
to split the rtc interval into smaller intervals, it can't, so the
delivery of coalesced interrupts can never catch up.

Originally, the coalescing code queued up a coalesced interrupt
immediately after the guest read the interrupt status (register C).
According to Gleb (as I understand it) there are situations where windows
reads register C and then spins waiting for the interrupt to deactivate.
If another interrupt is immediately queued, a hang results, because
windows does not see the interrupt deactivate, and is not looking at
register C. (Gleb please correct/clarify this as needed).

Since in this case, splitting the rtc interval doesn't work, the patch
attempts to immediately queue a coalesced interrupt after register C is
read, but if the coalesced timer pops and it appears that this interrupt
is still queued, the rtc deactivates the interrupt (to get windows out of
the hang) and increments the count of coalesced interrupts, since the
current one was not successfully delivered.


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 6 Issue Tracker 2009-12-03 15:21:35 UTC

Event posted on 12-03-2009 09:12am EST by Glen Johnson

------- Comment From bkon.com 2009-12-03 09:10 EDT-------
I've attached the binary version of the testcase, and a patch that
corrects the drift for this testcase. Please review and comment.

Drift occurs in both win2k3 32 and 64 bit machines, and is corrected by
the patch in both cases.

Command line is:
/usr/libexec/qemu-kvm -drive
file=/scratch/images/beth/windows/win2k3_r2_std_64_raw_20G.img,if=ide -m
2048 -boot cd -net nic,macaddr=00:FF:FE:00:00:99,model=rtl8139 -net
tap,script=/etc/qemu-ifup -vnc :12 -usbdevice tablet -monitor stdio
-rtc-td-hack


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 7 Issue Tracker 2009-12-03 15:21:38 UTC

Event posted on 12-03-2009 09:12am EST by Glen Johnson

File uploaded: win2k3_drift.3.patch

This event sent from IssueTracker by jkachuck 
 issue 371234
it_file 278772

Comment 8 Issue Tracker 2009-12-03 15:21:40 UTC

Event posted on 12-03-2009 09:12am EST by Glen Johnson

File uploaded: drift_test.exe

This event sent from IssueTracker by jkachuck 
 issue 371234
it_file 278773

Comment 9 Gleb Natapov 2009-12-03 17:42:05 UTC

(In reply to comment #5)
> Event posted on 12-03-2009 09:53am EST by Glen Johnson
> 
> ------- Comment From bkon.com 2009-12-03 09:43 EDT-------
> Following is my understanding of what is causing the drift, and an
> explanation of the patch.
> 
> The current approach when interrupts are not being delivered in a timely
> manner is to coalesce them and then inject them by dividing the rtc
> interval into smaller intervals to speed up delivery of the missed
> interrupts. The problem in this case is that we're dealing with a 2.6.18
> kernel that doesn't have highres timers. This testcase causes windows to
> run the rtc at its minimum period of 1ms so when the coalescing code tries
> to split the rtc interval into smaller intervals, it can't, so the
> delivery of coalesced interrupts can never catch up.
> 
> Originally, the coalescing code queued up a coalesced interrupt
> immediately after the guest read the interrupt status (register C).
> According to Gleb (as I understand it) there are situations where windows
> reads register C and then spins waiting for the interrupt to deactivate.
> If another interrupt is immediately queued, a hang results, because
> windows does not see the interrupt deactivate, and is not looking at
> register C. (Gleb please correct/clarify this as needed).
> 
Yes I remember discussing this with Anthony. The description is correct. The problem is that Windows does that loop with interrupt disabled, so no progress is done. I wonder why drift was corrected after a day of running though.

Comment 10 Issue Tracker 2009-12-04 19:04:57 UTC

Event posted on 12-03-2009 01:37pm EST by Glen Johnson

------- Comment From bkon.com 2009-12-03 13:27 EDT-------
I don't have an answer about why the drift stops... still looking into
it.


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 11 Issue Tracker 2009-12-04 19:05:00 UTC

Event posted on 12-04-2009 01:52pm EST by Glen Johnson

File uploaded: drift_test.c

This event sent from IssueTracker by jkachuck 
 issue 371234
it_file 279293

Comment 12 Issue Tracker 2009-12-04 19:05:03 UTC

Event posted on 12-04-2009 01:52pm EST by Glen Johnson

<cde:attachment>
Comment on attachment: drift_test.c

------- Comment on attachment From bkon.com 2009-12-04 13:47
EDT-------


Same testcase, just corrects a problem when printing interrupt rate,
causing rate to report 0, after i overflowed bounds of a float.
</cde:attachment>


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 13 Issue Tracker 2009-12-04 19:05:05 UTC

Event posted on 12-04-2009 01:53pm EST by Glen Johnson

File uploaded: drift_test.exe

This event sent from IssueTracker by jkachuck 
 issue 371234
it_file 279294

Comment 14 Issue Tracker 2009-12-04 19:05:08 UTC

Event posted on 12-04-2009 01:53pm EST by Glen Johnson

<cde:attachment>
Comment on attachment: drift_test.exe

------- Comment on attachment From bkon.com 2009-12-04 13:48
EDT-------


With corrected interrupt rate printf
</cde:attachment>


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 15 Issue Tracker 2009-12-04 20:20:00 UTC

Event posted on 12-04-2009 02:43pm EST by Glen Johnson

------- Comment From bkon.com 2009-12-04 14:35 EDT-------
No conclusions yet, but I will document what I see with the time drift
stopping after 1.25 hours:

Time drift gets to about 15 minutes slow within the 1.25 hours while the
testcase is executing. During that time I am watching the interrupt rate
as reported by the testcase and by added qemu debug.

Testcase (sampling every 5000 interrupts) consistently reports 1000 Hz
(interrupts/second)

qemu rtc_coalesced_timer pops at about 850 Hz
qemu rtc_periodic_timer pops at about 1024 Hz

After about 1.25 hours, the drift on the windows clock has corrected. The
testcase is still running. Now interrupts are reported as follows:

Testcase reports pattern something like:

5 Hz (I assume this is around when the time drift stopped, since before
that it was 1000 Hz for 1.25 hours) sampled every 5000 interrupts.
997.51 Hz
997.51
997.51
628.29
997.51
997.51
586.95
etc...

While testcase is reporting the above pattern, the rtc_coalesced_timer
continues to pop at about 850Hz and rtc_periodic_timer continues at 1024
Hz.

So at this point testcase is running ant time is correct.

I hit ctrl-c on the testcase, and windows changes the rtc back to 64 Hz.
At this point qemu debug reports about 64,000 coalesced interrupts
(s->irq_coalesced). Then the windows system clock, which had correct time
at this point, starts to move quickly forward at about 1 minute per 12
seconds wall time. It gets to about 15 minutes beyond current wall time
and then slows back down. During this time, as expected, qemu coalesced
timer is reporting about 350 interrupts per second, and periodic 64. So
the whizzing clock is caused by the delivery of coalesced interrupts.
Unfortunately, windows has somehow corrected itself so delivery of the
coalesced interrupts moves the clock beyond current time.


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 16 Issue Tracker 2009-12-07 13:43:02 UTC

Event posted on 12-04-2009 05:52pm EST by Glen Johnson

------- Comment From bkon.com 2009-12-04 17:45 EDT-------
reproduced bug again with debug in cmos_ioport_read. There is no call to
cmos_ioport_read when clock catches up. windows clock just moves forward
14 minutes instantaneously. No whizzing clock. Coalesced count was around
800K when clock moved, but coalesced count didn't change. Coalesced count
keeps increasing. The only change in interrupt rates is that the time drift
testcase reports 6 interrupts per second at the time that the clock moves
forward, and after that enters the pattern of 997, 997, 639, 997, 997,
575, etc...

I will run w/o network connection to see if it still catches up.


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 17 Issue Tracker 2009-12-07 18:29:30 UTC

Event posted on 12-07-2009 11:54am EST by Glen Johnson

------- Comment From bkon.com 2009-12-07 11:43 EDT-------
When the windows guest has no network connectivity (-net none) the guest
drift corrects after about an hour, when windows does a read of the time
from the rtc wallclock (i.e., ioport_read of RTC_SECONDS, RTC_MINUTES,
RTC_HOURS etc...) and corrects the drift, then the drift begins again.
This read does not occur when network connectivity exists, but the drift
corrects after about 1 hour there too, so it appears that windows uses NTP
to do a hard clock correction.

The problem this presents is that if the application load ends, allowing
delivery of accumulated interrupts after windows has corrected the time,
the guest time can move rapidly forward beyond current time. It may make
sense to consider discarding accumulated interrupts on an hourly basis.
Though this problem should not be as extreme after inclusion of the patch
to deliver interrupts on a more timely basis.


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 19 Issue Tracker 2009-12-10 15:12:15 UTC

Event posted on 12-08-2009 05:30pm EST by Glen Johnson

------- Comment From bkon.com 2009-12-08 17:18 EDT-------
http://lists.gnu.org/archive/html/qemu-devel/2009-12/msg00894.html

Gleb has submitted a patch upstream and Anthony has accepted it, so now it
needs to get into RHEL 5.5 and Z stream.


This event sent from IssueTracker by jkachuck 
 issue 371234

Comment 26 Qunfang Zhang 2009-12-24 02:28:14 UTC

Can reproduce in kvm-83-139.el5.
Verified in kvm-83-140.el5 where this bug is fixed. But seems this issue still exists for win2k3-64 guest.Paste the test result here, and i will also update it to BZ.
Steps:
guest: win2k3-32, win2k3-64
1.sync time on host with ntp server.
1.boot the guest with the following command:
/usr/libexec/qemu-kvm -drive file=/home/win2003-32.qcow2,media=disk,if=ide -m 2G -smp 2G -net nic,vlan=0,macaddr=10:1a:4a:10:40:5f,model=rtl8139 -net tap,vlan=0,script=/etc/qemu-ifup -vnc :10 -usbdevice tablet -boot c -monitor stdio -rtc-td-hack -name kvm140
2.install and configure cygwin and ntp-client in guest.
3.download the drift_test.exe in BZ to guest.
4.sync time in guest with ntp server.
5.in guest's terminal(cygwin),input the following command to check the time offset value during 10 minutes.
for((i=1;i<60;i++));do ntpdate.exe -b -q clock.redhat.com;sleep 10;done
6.double-click the drift_test.exe to run the testcase.
7.check the time offset value of guest.

Result: win2k3-32 works well, but win2k3-64 drifts about 8 seconds within 10 minutes.

qzhang --> Gleb: just want to confirm with you, is this result acceptable or not?
Detail test result will be attached.

Comment 27 Qunfang Zhang 2009-12-24 02:29:25 UTC

Created attachment 380118 [details]
win2k3-32bit guest time drift result in kvm-140

Comment 28 Qunfang Zhang 2009-12-24 02:30:11 UTC

Created attachment 380119 [details]
win2k3-64bit guest time drift result in kvm-140

Comment 29 Gleb Natapov 2009-12-24 12:42:24 UTC

I wouldn't call this result acceptable. Can you retest with -smp 1? I don't see the drift with one cpu, but I can reproduce your result with smp. Can you also test if windows2008 guest has the same issue with smp too?

Comment 30 Qunfang Zhang 2009-12-25 05:20:37 UTC

Retest win2k3,win2k8 and win7,win2k8-64 drifts about 4 seconds within 10
minutes.List the test result.("ok" means no drift.)
           -smp 1       -smp 2
win2k3-32     ok          ok
win2k3-64     ok        8 sec within 10 min
win2k8-32     ok          ok
win2k8-64     ok        4 sec within 10 min
win7-32       ok          ok
win7-64       ok          ok

I used the same command line in comment#26 and kvm version is still 140.

Comment 31 Qunfang Zhang 2009-12-25 06:55:15 UTC

qzhang->Gleb:

Hi, Gleb, I have another question and want to confirm with you.
After I run the drift_test.exe in guest,the output interrupt rate is 1000 but sometimes with a sudden change(Please refer to the following output content).And if there's something wrong impacting on the time of guest?

In win2k3-64 guest:
Minimum period is 1 ms.
alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1207.55 interrupts/second

alarm_handler called, interrupt rate =   1300.81 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   2335.77 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second

alarm_handler called, interrupt rate =   1000.00 interrupts/second


In a win7-64 guest,the result is worse. 

alarm_handler called, interrupt rate =   1000.20 interrupts/second

alarm_handler called, interrupt rate =   1000.39 interrupts/second

alarm_handler called, interrupt rate = 426657.56 interrupts/second

alarm_handler called, interrupt rate = 512032.77 interrupts/second

alarm_handler called, interrupt rate = 511980.34 interrupts/second

alarm_handler called, interrupt rate =   1091.92 interrupts/second

alarm_handler called, interrupt rate =   1000.20 interrupts/second

alarm_handler called, interrupt rate =   1000.59 interrupts/second

alarm_handler called, interrupt rate =   1000.20 interrupts/second

alarm_handler called, interrupt rate =   1000.39 interrupts/second

alarm_handler called, interrupt rate =   1000.20 interrupts/second

alarm_handler called, interrupt rate =   1000.39 interrupts/second

alarm_handler called, interrupt rate =   1000.39 interrupts/second

alarm_handler called, interrupt rate =   1000.39 interrupts/second

Comment 33 Qunfang Zhang 2010-01-15 10:02:16 UTC

Arrange the test result on both AMD and Intel host here:
kvm version:kvm-83-140.el5

-----------------------------------------------------------
|          |       AMD host        |       Intel host     |
-----------------------------------------------------------
| Guest    |   up      |    smp    |    up    |    smp    |
-----------------------------------------------------------
|win2k3-32 |   OK      |   OK      |   OK     |    OK     |
-----------------------------------------------------------
|win2k3-64 |   OK      |5sec/10min |   OK     | 8sec/10min|
-----------------------------------------------------------
|win2k8-32 |   OK      |   OK      |   OK     |    OK     |
-----------------------------------------------------------
|win2k8-64 |   OK      |5sec/10min |   OK     | 4sec/10min|
-----------------------------------------------------------
|win2k8-R2 |   OK      |   OK      |   OK     |    OK     |     
-----------------------------------------------------------
| win7-32  |   OK      |   OK      |   OK     |    OK     |
-----------------------------------------------------------
| win7-64  |   OK      |   OK      |   OK     |    OK     |
-----------------------------------------------------------

Comment 34 Qunfang Zhang 2010-01-15 10:27:52 UTC

(1) After confirmed with Gleb,regarding this bug, maybe the best course of action
would be to verify that it is fixed on all systems except win2k3 win2k8
smp and close it. Open another one about timedrift with two guests above
with smp.
So,I will open another bug and close this one.

(2)for https://bugzilla.redhat.com/show_bug.cgi?id=543137#c33, paste my Intel and AMD host info here:
Intel host:(4 cpu)
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz

AMD host:(4 cpu)
processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 2
model name	: AMD Phenom(tm) 9600B Quad-Core Processor

Comment 35 Qunfang Zhang 2010-01-15 10:52:12 UTC

(In reply to comment #34)
> (1) After confirmed with Gleb,regarding this bug, maybe the best course of
> action
> would be to verify that it is fixed on all systems except win2k3 win2k8
> smp and close it. Open another one about timedrift with two guests above
> with smp.
> So,I will open another bug and close this one.

New bug:
https://bugzilla.redhat.com/show_bug.cgi?id=555727

Comment 39 errata-xmlrpc 2010-03-30 07:53:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0271.html

Note You need to log in before you can comment on or make changes to this bug.