Escalated to Bugzilla from IssueTracker
Event posted on 12-01-2009 06:54am EST by Glen Johnson =Comment: #0================================================= Elizabeth A. Kon <bkon.com> - ---Problem Description--- win2k364 guest is experiencing significant time drift (e.g., 1 minute drift within 5 minutes of starting testcase). Seeing the drift did not depend on any other load on the host system. Interestingly, I found that with time (i.e., the day after starting the testcase and leaving it running) the drift was corrected and no longer occurring. But on restarting the testcase, the drift began again. Attachment: testcase for producing drift Contact Information = Elizabeth Kon/bkon.com Anthony Liguori/aliguori.com ---uname output--- Linux bc1cn6 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Machine Type = HS22 ---Steps to Reproduce--- Start time_drift.exe testcase on the windows guest. Within 5 minutes, using date on host and watching windows system clock, drift should be apparent. ---KVM Component Data--- Userspace tool common name: qemu-kvm The userspace tool has the following bit modes: 64 Userspace rpm: kvm-83-105.el5_4.9 =Comment: #2================================================= Elizabeth A. Kon <bkon.com> - Attachment: sosreport This event sent from IssueTracker by jkachuck [IBM-LTC] issue 371234
Created attachment 375166 [details] sosreport-bkon.58172-787563-664095.tar.bz2
Created attachment 375167 [details] drift_test.c
What was the command line? Is it reproducable on 32bit version or is it unique for 64bit? Can you attach compiled version of drift_test.c? I don't have development environment for Windows. Will try to cross-compile on Linux, but having compiled version will be easier.
Event posted on 12-03-2009 09:53am EST by Glen Johnson ------- Comment From bkon.com 2009-12-03 09:43 EDT------- Following is my understanding of what is causing the drift, and an explanation of the patch. The current approach when interrupts are not being delivered in a timely manner is to coalesce them and then inject them by dividing the rtc interval into smaller intervals to speed up delivery of the missed interrupts. The problem in this case is that we're dealing with a 2.6.18 kernel that doesn't have highres timers. This testcase causes windows to run the rtc at its minimum period of 1ms so when the coalescing code tries to split the rtc interval into smaller intervals, it can't, so the delivery of coalesced interrupts can never catch up. Originally, the coalescing code queued up a coalesced interrupt immediately after the guest read the interrupt status (register C). According to Gleb (as I understand it) there are situations where windows reads register C and then spins waiting for the interrupt to deactivate. If another interrupt is immediately queued, a hang results, because windows does not see the interrupt deactivate, and is not looking at register C. (Gleb please correct/clarify this as needed). Since in this case, splitting the rtc interval doesn't work, the patch attempts to immediately queue a coalesced interrupt after register C is read, but if the coalesced timer pops and it appears that this interrupt is still queued, the rtc deactivates the interrupt (to get windows out of the hang) and increments the count of coalesced interrupts, since the current one was not successfully delivered. This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-03-2009 09:12am EST by Glen Johnson ------- Comment From bkon.com 2009-12-03 09:10 EDT------- I've attached the binary version of the testcase, and a patch that corrects the drift for this testcase. Please review and comment. Drift occurs in both win2k3 32 and 64 bit machines, and is corrected by the patch in both cases. Command line is: /usr/libexec/qemu-kvm -drive file=/scratch/images/beth/windows/win2k3_r2_std_64_raw_20G.img,if=ide -m 2048 -boot cd -net nic,macaddr=00:FF:FE:00:00:99,model=rtl8139 -net tap,script=/etc/qemu-ifup -vnc :12 -usbdevice tablet -monitor stdio -rtc-td-hack This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-03-2009 09:12am EST by Glen Johnson File uploaded: win2k3_drift.3.patch This event sent from IssueTracker by jkachuck issue 371234 it_file 278772
Event posted on 12-03-2009 09:12am EST by Glen Johnson File uploaded: drift_test.exe This event sent from IssueTracker by jkachuck issue 371234 it_file 278773
(In reply to comment #5) > Event posted on 12-03-2009 09:53am EST by Glen Johnson > > ------- Comment From bkon.com 2009-12-03 09:43 EDT------- > Following is my understanding of what is causing the drift, and an > explanation of the patch. > > The current approach when interrupts are not being delivered in a timely > manner is to coalesce them and then inject them by dividing the rtc > interval into smaller intervals to speed up delivery of the missed > interrupts. The problem in this case is that we're dealing with a 2.6.18 > kernel that doesn't have highres timers. This testcase causes windows to > run the rtc at its minimum period of 1ms so when the coalescing code tries > to split the rtc interval into smaller intervals, it can't, so the > delivery of coalesced interrupts can never catch up. > > Originally, the coalescing code queued up a coalesced interrupt > immediately after the guest read the interrupt status (register C). > According to Gleb (as I understand it) there are situations where windows > reads register C and then spins waiting for the interrupt to deactivate. > If another interrupt is immediately queued, a hang results, because > windows does not see the interrupt deactivate, and is not looking at > register C. (Gleb please correct/clarify this as needed). > Yes I remember discussing this with Anthony. The description is correct. The problem is that Windows does that loop with interrupt disabled, so no progress is done. I wonder why drift was corrected after a day of running though.
Event posted on 12-03-2009 01:37pm EST by Glen Johnson ------- Comment From bkon.com 2009-12-03 13:27 EDT------- I don't have an answer about why the drift stops... still looking into it. This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-04-2009 01:52pm EST by Glen Johnson File uploaded: drift_test.c This event sent from IssueTracker by jkachuck issue 371234 it_file 279293
Event posted on 12-04-2009 01:52pm EST by Glen Johnson <cde:attachment> Comment on attachment: drift_test.c ------- Comment on attachment From bkon.com 2009-12-04 13:47 EDT------- Same testcase, just corrects a problem when printing interrupt rate, causing rate to report 0, after i overflowed bounds of a float. </cde:attachment> This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-04-2009 01:53pm EST by Glen Johnson File uploaded: drift_test.exe This event sent from IssueTracker by jkachuck issue 371234 it_file 279294
Event posted on 12-04-2009 01:53pm EST by Glen Johnson <cde:attachment> Comment on attachment: drift_test.exe ------- Comment on attachment From bkon.com 2009-12-04 13:48 EDT------- With corrected interrupt rate printf </cde:attachment> This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-04-2009 02:43pm EST by Glen Johnson ------- Comment From bkon.com 2009-12-04 14:35 EDT------- No conclusions yet, but I will document what I see with the time drift stopping after 1.25 hours: Time drift gets to about 15 minutes slow within the 1.25 hours while the testcase is executing. During that time I am watching the interrupt rate as reported by the testcase and by added qemu debug. Testcase (sampling every 5000 interrupts) consistently reports 1000 Hz (interrupts/second) qemu rtc_coalesced_timer pops at about 850 Hz qemu rtc_periodic_timer pops at about 1024 Hz After about 1.25 hours, the drift on the windows clock has corrected. The testcase is still running. Now interrupts are reported as follows: Testcase reports pattern something like: 5 Hz (I assume this is around when the time drift stopped, since before that it was 1000 Hz for 1.25 hours) sampled every 5000 interrupts. 997.51 Hz 997.51 997.51 628.29 997.51 997.51 586.95 etc... While testcase is reporting the above pattern, the rtc_coalesced_timer continues to pop at about 850Hz and rtc_periodic_timer continues at 1024 Hz. So at this point testcase is running ant time is correct. I hit ctrl-c on the testcase, and windows changes the rtc back to 64 Hz. At this point qemu debug reports about 64,000 coalesced interrupts (s->irq_coalesced). Then the windows system clock, which had correct time at this point, starts to move quickly forward at about 1 minute per 12 seconds wall time. It gets to about 15 minutes beyond current wall time and then slows back down. During this time, as expected, qemu coalesced timer is reporting about 350 interrupts per second, and periodic 64. So the whizzing clock is caused by the delivery of coalesced interrupts. Unfortunately, windows has somehow corrected itself so delivery of the coalesced interrupts moves the clock beyond current time. This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-04-2009 05:52pm EST by Glen Johnson ------- Comment From bkon.com 2009-12-04 17:45 EDT------- reproduced bug again with debug in cmos_ioport_read. There is no call to cmos_ioport_read when clock catches up. windows clock just moves forward 14 minutes instantaneously. No whizzing clock. Coalesced count was around 800K when clock moved, but coalesced count didn't change. Coalesced count keeps increasing. The only change in interrupt rates is that the time drift testcase reports 6 interrupts per second at the time that the clock moves forward, and after that enters the pattern of 997, 997, 639, 997, 997, 575, etc... I will run w/o network connection to see if it still catches up. This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-07-2009 11:54am EST by Glen Johnson ------- Comment From bkon.com 2009-12-07 11:43 EDT------- When the windows guest has no network connectivity (-net none) the guest drift corrects after about an hour, when windows does a read of the time from the rtc wallclock (i.e., ioport_read of RTC_SECONDS, RTC_MINUTES, RTC_HOURS etc...) and corrects the drift, then the drift begins again. This read does not occur when network connectivity exists, but the drift corrects after about 1 hour there too, so it appears that windows uses NTP to do a hard clock correction. The problem this presents is that if the application load ends, allowing delivery of accumulated interrupts after windows has corrected the time, the guest time can move rapidly forward beyond current time. It may make sense to consider discarding accumulated interrupts on an hourly basis. Though this problem should not be as extreme after inclusion of the patch to deliver interrupts on a more timely basis. This event sent from IssueTracker by jkachuck issue 371234
Event posted on 12-08-2009 05:30pm EST by Glen Johnson ------- Comment From bkon.com 2009-12-08 17:18 EDT------- http://lists.gnu.org/archive/html/qemu-devel/2009-12/msg00894.html Gleb has submitted a patch upstream and Anthony has accepted it, so now it needs to get into RHEL 5.5 and Z stream. This event sent from IssueTracker by jkachuck issue 371234
Can reproduce in kvm-83-139.el5. Verified in kvm-83-140.el5 where this bug is fixed. But seems this issue still exists for win2k3-64 guest.Paste the test result here, and i will also update it to BZ. Steps: guest: win2k3-32, win2k3-64 1.sync time on host with ntp server. 1.boot the guest with the following command: /usr/libexec/qemu-kvm -drive file=/home/win2003-32.qcow2,media=disk,if=ide -m 2G -smp 2G -net nic,vlan=0,macaddr=10:1a:4a:10:40:5f,model=rtl8139 -net tap,vlan=0,script=/etc/qemu-ifup -vnc :10 -usbdevice tablet -boot c -monitor stdio -rtc-td-hack -name kvm140 2.install and configure cygwin and ntp-client in guest. 3.download the drift_test.exe in BZ to guest. 4.sync time in guest with ntp server. 5.in guest's terminal(cygwin),input the following command to check the time offset value during 10 minutes. for((i=1;i<60;i++));do ntpdate.exe -b -q clock.redhat.com;sleep 10;done 6.double-click the drift_test.exe to run the testcase. 7.check the time offset value of guest. Result: win2k3-32 works well, but win2k3-64 drifts about 8 seconds within 10 minutes. qzhang --> Gleb: just want to confirm with you, is this result acceptable or not? Detail test result will be attached.
Created attachment 380118 [details] win2k3-32bit guest time drift result in kvm-140
Created attachment 380119 [details] win2k3-64bit guest time drift result in kvm-140
I wouldn't call this result acceptable. Can you retest with -smp 1? I don't see the drift with one cpu, but I can reproduce your result with smp. Can you also test if windows2008 guest has the same issue with smp too?
Retest win2k3,win2k8 and win7,win2k8-64 drifts about 4 seconds within 10 minutes.List the test result.("ok" means no drift.) -smp 1 -smp 2 win2k3-32 ok ok win2k3-64 ok 8 sec within 10 min win2k8-32 ok ok win2k8-64 ok 4 sec within 10 min win7-32 ok ok win7-64 ok ok I used the same command line in comment#26 and kvm version is still 140.
qzhang->Gleb: Hi, Gleb, I have another question and want to confirm with you. After I run the drift_test.exe in guest,the output interrupt rate is 1000 but sometimes with a sudden change(Please refer to the following output content).And if there's something wrong impacting on the time of guest? In win2k3-64 guest: Minimum period is 1 ms. alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1207.55 interrupts/second alarm_handler called, interrupt rate = 1300.81 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 2335.77 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second alarm_handler called, interrupt rate = 1000.00 interrupts/second In a win7-64 guest,the result is worse. alarm_handler called, interrupt rate = 1000.20 interrupts/second alarm_handler called, interrupt rate = 1000.39 interrupts/second alarm_handler called, interrupt rate = 426657.56 interrupts/second alarm_handler called, interrupt rate = 512032.77 interrupts/second alarm_handler called, interrupt rate = 511980.34 interrupts/second alarm_handler called, interrupt rate = 1091.92 interrupts/second alarm_handler called, interrupt rate = 1000.20 interrupts/second alarm_handler called, interrupt rate = 1000.59 interrupts/second alarm_handler called, interrupt rate = 1000.20 interrupts/second alarm_handler called, interrupt rate = 1000.39 interrupts/second alarm_handler called, interrupt rate = 1000.20 interrupts/second alarm_handler called, interrupt rate = 1000.39 interrupts/second alarm_handler called, interrupt rate = 1000.39 interrupts/second alarm_handler called, interrupt rate = 1000.39 interrupts/second
Arrange the test result on both AMD and Intel host here: kvm version:kvm-83-140.el5 ----------------------------------------------------------- | | AMD host | Intel host | ----------------------------------------------------------- | Guest | up | smp | up | smp | ----------------------------------------------------------- |win2k3-32 | OK | OK | OK | OK | ----------------------------------------------------------- |win2k3-64 | OK |5sec/10min | OK | 8sec/10min| ----------------------------------------------------------- |win2k8-32 | OK | OK | OK | OK | ----------------------------------------------------------- |win2k8-64 | OK |5sec/10min | OK | 4sec/10min| ----------------------------------------------------------- |win2k8-R2 | OK | OK | OK | OK | ----------------------------------------------------------- | win7-32 | OK | OK | OK | OK | ----------------------------------------------------------- | win7-64 | OK | OK | OK | OK | -----------------------------------------------------------
(1) After confirmed with Gleb,regarding this bug, maybe the best course of action would be to verify that it is fixed on all systems except win2k3 win2k8 smp and close it. Open another one about timedrift with two guests above with smp. So,I will open another bug and close this one. (2)for https://bugzilla.redhat.com/show_bug.cgi?id=543137#c33, paste my Intel and AMD host info here: Intel host:(4 cpu) processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz AMD host:(4 cpu) processor : 3 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9600B Quad-Core Processor
(In reply to comment #34) > (1) After confirmed with Gleb,regarding this bug, maybe the best course of > action > would be to verify that it is fixed on all systems except win2k3 win2k8 > smp and close it. Open another one about timedrift with two guests above > with smp. > So,I will open another bug and close this one. New bug: https://bugzilla.redhat.com/show_bug.cgi?id=555727
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0271.html