Bug 1098602

Summary:	kvmclock: Ensure time in migration never goes backward (backport)
Product:	Red Hat Enterprise Linux 7	Reporter:	Marcelo Tosatti <mtosatti>
Component:	qemu-kvm	Assignee:	Marcelo Tosatti <mtosatti>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.0	CC:	amit.shah, chayang, coli, imammedo, jkurik, jreznik, juzhang, knoel, mtosatti, rbalakri, rkrcmar, scui, tdosek, virt-maint, xfu, zhanghm.zhm
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-1.5.3-77.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1121550 1143054 (view as bug list)		Environment:
Last Closed:	2015-03-05 08:09:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1121550

Description Marcelo Tosatti 2014-05-16 16:04:31 UTC

When we migrate we ask the kernel about its current belief on what the guest
time would be. However, I've seen cases where the kvmclock guest structure
indicates a time more recent than the kvm returned time.

To make sure we never go backwards, calculate what the guest would have seen
as time at the point of migration and use that value instead of the kernel
returned one when it's more recent.

While the underlying bug is supposedly fixed on newer KVM versions, it doesn't
hurt to base the view of the kvmclock after migration on the same foundation
in host as well as guest.

Signed-off-by: Alexander Graf <agraf>

Comment 6 FuXiangChun 2014-05-28 11:33:00 UTC

Tested this bug with qemu-kvm-1.5.3-62.el7.bz1076326.x86_64 & RHEL7.0 guest.

This is steps of testing & result.

steps:
1.sync time on hostA
#ntpdate clock.redhat.com

2.sync time on hostB
#ntpdate clock.redhat.com

3.Boot RHEL7.0 on hostA
/usr/libexec/qemu-kvm -M pc -cpu Opteron_G5 -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1 -no-kvm-pit-reinjection -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7,num_queues=4 -drive file=/mnt/rhel7.0.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -vnc :1 -monitor stdio -net none -rtc base=utc,clock=host,driftfix=slew -serial unix:/tmp/monitor2,server,nowait

4.Boot RHEL7.0 guest on hostB with -incoming
/usr/libexec/qemu-kvm -M pc -cpu Opteron_G5 -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1 -no-kvm-pit-reinjection -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7,num_queues=4 -drive file=/mnt/rhel7.0.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -vnc :1 -monitor stdio -net none -rtc base=utc,clock=host,driftfix=slew -serial unix:/tmp/monitor2,server,nowait -incoming tcp:0:5555

5. check guest system time and current clocksuroce inside guest
#date 
Wed May 28 06:57:10 EDT 2014
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock

6. check host system time 
#date
Wed May 28 06:57:05 EDT 2014

so, guest system time=host system time.

7. load stress cpu with stress tool inside guest.
#stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10000s

8. print system time per second with script inside guest
while true;do date;sleep 1;done >time-result

9. do ping-pong migration hostA<-->hostB

10. compare system time guest and host after 10 times

result:
guest system time:
#date
Wed May 28 07:27:45 EDT 2014

host system time:
# date
Wed May 28 07:28:52 EDT 2014


so, guest system time goes backward about 1 mins after 10 times ping-pong migration. 


Marcelo,
If my steps is wrong, please correct me.  according to this result above. seems the build qemu-kvm-1.5.3-62.el7.bz1076326.x86_64 didn't fix this issue. please confirm it.

Comment 7 Marcelo Tosatti 2014-05-29 21:12:55 UTC

(In reply to FuXiangChun from comment #6)
> If my steps is wrong, please correct me.  according to this result above.
> seems the build qemu-kvm-1.5.3-62.el7.bz1076326.x86_64 didn't fix this
> issue. please confirm it.


Fu,

The difference between guest/host time, after ping-pong migration with loaded guest, should be the same with qemu-kvm-1.5.3-62.el7.x86_64, yes?

The patch fixes a different problem which is, when executing from within 
the guest:

- R1 = read kvmclock
- R2 = read kvmclock
- R2 < R1 (smaller than)

Comment 8 FuXiangChun 2014-06-03 07:56:04 UTC



(In reply to Marcelo Tosatti from comment #7)
> (In reply to FuXiangChun from comment #6)
> > If my steps is wrong, please correct me.  according to this result above.
> > seems the build qemu-kvm-1.5.3-62.el7.bz1076326.x86_64 didn't fix this
> > issue. please confirm it.
> 
> 
> Fu,
> 
> The difference between guest/host time, after ping-pong migration with
> loaded guest, should be the same with qemu-kvm-1.5.3-62.el7.x86_64, yes?

According to test steps in commnet6. Retested qemu-kvm-1.5.3-60.el7.x86_64 & qemu-kvm-1.5.3-60.el7_0.2.x86_64. QE got the same test result as comments 6(guest system time goes backward about 1 mins after 10 times ping-pong migration).

> 
> The patch fixes a different problem which is, when executing from within 
> the guest:
> 
> - R1 = read kvmclock
> - R2 = read kvmclock
> - R2 < R1 (smaller than)

Marcelo,
QE need to confirm with you a few questions.

Q1. How to understand "R1 = read kvmclock" & "R2 = read kvmclock" & "R2 < R1 (smaller than)"? I am not clear what is R1 & R2 & read kvmclock.  Are R1 and R2 are hardware clock(if it hardware clock, QE can get it via hwclock command)? 

Q2. How to understand "read kvmclock", my understanding is that it is system time, use date command to get it, right?

Q3. QE did not find qemu-kvm-1.5.3-62.el7.x86_64 in brewweb. The latest qemu version is qemu-kvm-1.5.3-60.el7_0.2.x86_64. How to get it?

Q4. If this bug is fixed, result expected is guest system time = host system time after ping-pong migration with loaded guest, right?


Test summary, 
According to steps in comment 6, qemu-kvm-1.5.3-62.el7.bz1076326.x86_64 & qemu-kvm-1.5.3-60.el7.x86_64 & qemu-kvm-1.5.3-60.el7_0.2.x86_64 got the same result->guest system time goes backward about 1.5 mins after 10 times ping-pong migration

Comment 9 Marcelo Tosatti 2014-06-03 18:20:13 UTC

(In reply to FuXiangChun from comment #8)
> 
> 
> (In reply to Marcelo Tosatti from comment #7)
> > (In reply to FuXiangChun from comment #6)
> > > If my steps is wrong, please correct me.  according to this result above.
> > > seems the build qemu-kvm-1.5.3-62.el7.bz1076326.x86_64 didn't fix this
> > > issue. please confirm it.
> > 
> > 
> > Fu,
> > 
> > The difference between guest/host time, after ping-pong migration with
> > loaded guest, should be the same with qemu-kvm-1.5.3-62.el7.x86_64, yes?
> 
> According to test steps in commnet6. Retested qemu-kvm-1.5.3-60.el7.x86_64 &
> qemu-kvm-1.5.3-60.el7_0.2.x86_64. QE got the same test result as comments
> 6(guest system time goes backward about 1 mins after 10 times ping-pong
> migration).
> 
> > 
> > The patch fixes a different problem which is, when executing from within 
> > the guest:
> > 
> > - R1 = read kvmclock
> > - R2 = read kvmclock
> > - R2 < R1 (smaller than)
> 
> Marcelo,
> QE need to confirm with you a few questions.
> 
> Q1. How to understand "R1 = read kvmclock" & "R2 = read kvmclock" & "R2 < R1
> (smaller than)"? I am not clear what is R1 & R2 & read kvmclock.  Are R1 and
> R2 are hardware clock(if it hardware clock, QE can get it via hwclock
> command)? 

R1 and R2 are kvmclock reads (see pvclock_clocksource_read function in arch/x86/kernel/pvclock.c).
 
> Q2. How to understand "read kvmclock", my understanding is that it is system
> time, use date command to get it, right?

No, can verify by using clock_gettime(CLOCK_MONOTONIC) command from userspace,
in a host with TSC clocksource.
 
> Q3. QE did not find qemu-kvm-1.5.3-62.el7.x86_64 in brewweb. The latest qemu
> version is qemu-kvm-1.5.3-60.el7_0.2.x86_64. How to get it?

I compiled via GIT. So the source code is at 

http://git.app.eng.bos.redhat.com/virt/rhel7/qemu-kvm.git/

> Q4. If this bug is fixed, result expected is guest system time = host system
> time after ping-pong migration with loaded guest, right?

No, that is a different problem.
 
> Test summary, 
> According to steps in comment 6, qemu-kvm-1.5.3-62.el7.bz1076326.x86_64 &
> qemu-kvm-1.5.3-60.el7.x86_64 & qemu-kvm-1.5.3-60.el7_0.2.x86_64 got the same
> result->guest system time goes backward about 1.5 mins after 10 times
> ping-pong migration

OK, thanks.

I'll attach a testcase for this bug later.

Comment 11 Miroslav Rezanina 2014-07-09 12:17:46 UTC

Fix included in qemu-kvm-1.5.3-66.el7

Comment 12 Marcelo Tosatti 2014-07-10 20:39:58 UTC

TESTCASE
-------------

Find host machine with the following characteristics:

1) Using TSC clocksource.
2) With RHEL-7 guest running time-warp-test.c.

Check /var/lib/chrony/drift, first element must be negative, or alternatively 
"chronyc tracking" command must report:

Frequency       : xyz ppm slow

Then the time necessary of guest uptime can be calculated with:

h*3600*(ppm * 1/1000000) = 2*60

- where ppm is part per million frequency adjustment
of the host as noted, without the negative sign.
- h is hours of guest uptime necessary to achieve 2 minutes of drift.

2 minutes of drift should be sufficient for time backwards 
event (or guest hang) to be noticed in a guest running
time-warp-test.c(*) savevm/loadvm.

http://people.redhat.com/mingo/time-warp-test/time-warp-test.c.

Comment 15 CongLi 2014-07-28 08:29:53 UTC

Hi Marcelo,

I have met the qemu core dump when tested s3(savevm & loadvm) related operation using autotest.

version:
qemu-kvm-1.5.3-66.el7.x86_64

core dump info:
Core was generated by `/bin/qemu-kvm -S -name virt-tests-vm1 -sandbox off -M pc -nodefaults -vga cirru'.
Program terminated with signal 6, Aborted.
#0  0x00007fc86582b989 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007fc86582b989 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fc86582d098 in __GI_abort () at abort.c:90
#2  0x00007fc8658248f6 in __assert_fail_base (
    fmt=0x7fc8659743e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fc86aedbfc0 "time.tsc_timestamp <= migration_tsc", 
    file=file@entry=0x7fc86aedbf88 "/builddir/build/BUILD/qemu-1.5.3/hw/i386/kvm/clock.c", 
    line=line@entry=64, 
    function=function@entry=0x7fc86aedc040 <__PRETTY_FUNCTION__.23497> "kvmclock_current_nsec") at assert.c:92
#3  0x00007fc8658249a2 in __GI___assert_fail (
    assertion=assertion@entry=0x7fc86aedbfc0 "time.tsc_timestamp <= migration_tsc", 
    file=file@entry=0x7fc86aedbf88 "/builddir/build/BUILD/qemu-1.5.3/hw/i386/kvm/clock.c", 
    line=line@entry=64, 
    function=function@entry=0x7fc86aedc040 <__PRETTY_FUNCTION__.23497> "kvmclock_current_nsec") at assert.c:101
#4  0x00007fc86adaf930 in kvmclock_current_nsec (s=0x7fc86c1ef6b0)
    at /usr/src/debug/qemu-1.5.3/hw/i386/kvm/clock.c:64
#5  kvmclock_vm_state_change (opaque=0x7fc86c1ef6b0, running=<optimized out>, 
    state=<optimized out>) at /usr/src/debug/qemu-1.5.3/hw/i386/kvm/clock.c:87
#6  0x00007fc86ad8216b in vm_state_notify (running=running@entry=1, 
    state=state@entry=RUN_STATE_RUNNING) at vl.c:1662
#7  0x00007fc86ad821ab in vm_start () at vl.c:1671
#8  0x00007fc86ad52485 in qmp_cont (errp=errp@entry=0x7fff550e4530) at qmp.c:179
#9  0x00007fc86ad4d518 in qmp_marshal_input_cont (mon=<optimized out>, 
    qdict=<optimized out>, ret=<optimized out>) at qmp-marshal.c:1318
#10 0x00007fc86add96c7 in qmp_call_cmd (cmd=<optimized out>, params=0x7fc86c9ecff0, 
    mon=0x7fc86c059950) at /usr/src/debug/qemu-1.5.3/monitor.c:4509
#11 handle_qmp_command (parser=<optimized out>, tokens=<optimized out>)
    at /usr/src/debug/qemu-1.5.3/monitor.c:4575
#12 0x00007fc86ae86222 in json_message_process_token (lexer=0x7fc86c059de0, 
    token=0x7fc86c25b400, type=JSON_OPERATOR, x=37, y=158) at qobject/json-streamer.c:87
#13 0x00007fc86ae957af in json_lexer_feed_char (lexer=lexer@entry=0x7fc86c059de0, 
    ch=<optimized out>, flush=flush@entry=false) at qobject/json-lexer.c:303
#14 0x00007fc86ae9587e in json_lexer_feed (lexer=0x7fc86c059de0, buffer=<optimized out>, 
    size=<optimized out>) at qobject/json-lexer.c:356
#15 0x00007fc86ae863b9 in json_message_parser_feed (parser=<optimized out>, 
    buffer=<optimized out>, size=<optimized out>) at qobject/json-streamer.c:110
#16 0x00007fc86add8413 in monitor_control_read (opaque=<optimized out>, 
    buf=<optimized out>, size=<optimized out>) at /usr/src/debug/qemu-1.5.3/monitor.c:4596
#17 0x00007fc86ad471c1 in qemu_chr_be_write (len=<optimized out>, 
    buf=0x7fff550e4720 "}I\016U\377\177", s=0x7fc86c043600) at qemu-char.c:167
#18 tcp_chr_read (chan=<optimized out>, cond=<optimized out>, opaque=0x7fc86c043600)
    at qemu-char.c:2492
#19 0x00007fc86a080ac6 in g_main_dispatch (context=0x7fc86c043400) at gmain.c:3058
#20 g_main_context_dispatch (context=context@entry=0x7fc86c043400) at gmain.c:3634
#21 0x00007fc86ad19e9a in glib_pollfds_poll () at main-loop.c:187
#22 os_host_main_loop_wait (timeout=<optimized out>) at main-loop.c:232
#23 main_loop_wait (nonblocking=<optimized out>) at main-loop.c:464
#24 0x00007fc86ac3ff70 in main_loop () at vl.c:1988
#25 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4359

According to the core dump info and the qemu-kvm git log, I think the core  dump problem was introduced by this bug.
And I have tested qemu-kvm-1.5.3-65.el7.x86_64, test pass.

Btw, I have not triggered this problem manually, just met this with autotest (100% reproducible). I will try more times manually to trigger it.

As the above info, I think we should set this bug to 'ASSIGNED', is it ok for you?

If there is anything wrong, feel free to correct me.

Thanks,
Cong

Comment 16 CongLi 2014-07-28 08:34:46 UTC

(In reply to CongLi from comment #15)
> Hi Marcelo,
> 
> I have met the qemu core dump when tested s3(savevm & loadvm) related
> operation using autotest.

The qemu core dump is kvmclock related error.

Comment 18 FuXiangChun 2014-07-29 06:03:03 UTC

Marcelo，
QE still can not reproduce this bug with qemu-kvm-1.5.3-64.el7.x86_64.  The following are detailed steps. If have any mistake, please correct me. Thanks.

1. hostA & host sync clock from clock.redhat.com
#ntpdate clock.redhat.com

2. ensure two hosts are using tsc clocksource.
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

3. Boot rhel7.0 guest with this command line.

/usr/libexec/qemu-kvm -M pc -cpu SandyBridge -enable-kvm -m 4096 -smp 4,sockets=4,cores=1,threads=1 -no-kvm-pit-reinjection  -name rhel7.0 -uuid 990ea161-6b67-47b2-b803-19fb01d30d30 -rtc base=localtime,clock=host,driftfix=slew -drive file=/mnt/rhel7-64-ga.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,media=disk,aio=native,werror=stop,rerror=stop,serial=1234 -device virtio-blk-pci,drive=drive-virtio-disk,id=virtio-disk,bootindex=1 -monitor stdio  -qmp tcp:0:5555,server,nowait -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -vnc :1

4. run test script inside rhel7.0 guest. 
./time-warp-test

5. check "cat /var/lib/chrony/drift" inside guest
# cat /var/lib/chrony/drift
-21.219299            57.059106

6.check "chronyc tracking"
Reference ID    : 0.0.0.0 ()
Stratum         : 0
Ref time (UTC)  : Thu Jan  1 00:00:00 1970
System time     : 0.000000000 seconds fast of NTP time
Last offset     : 0.000000000 seconds
RMS offset      : 0.000000000 seconds
Frequency       : 21.219 ppm slow
Residual freq   : 0.000 ppm
Skew            : 0.000 ppm
Root delay      : 0.000000 seconds
Root dispersion : 0.000000 seconds
Update interval : 0.0 seconds
Leap status     : Not synchronised

7. do migration to des host.

8. check "cat /var/lib/chrony/drift" & "chronyc tracking" inside guest.

got the same result with step 5 & step 6.

Comment 19 FuXiangChun 2014-07-29 06:25:24 UTC

another, correct a small mistake in script.
http://people.redhat.com/mingo/time-warp-test/time-warp-test.c

change 

169  __asm__ __volatile__("movl $0,%0; rep; nop" : "=g"(*flag) :: "memory");

to 

169 __asm__ __volatile__("mov $0,%0; rep; nop" : "=g"(*flag) :: "memory");

otherwise, compile fail.

Comment 21 Marcelo Tosatti 2014-09-03 12:43:02 UTC

(In reply to FuXiangChun from comment #18)
> Marcelo，
> QE still can not reproduce this bug with qemu-kvm-1.5.3-64.el7.x86_64.  The
> following are detailed steps. If have any mistake, please correct me. Thanks.
> 
> 1. hostA & host sync clock from clock.redhat.com
> #ntpdate clock.redhat.com
> 
> 2. ensure two hosts are using tsc clocksource.
> # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> tsc
> 
> 3. Boot rhel7.0 guest with this command line.
> 
> /usr/libexec/qemu-kvm -M pc -cpu SandyBridge -enable-kvm -m 4096 -smp
> 4,sockets=4,cores=1,threads=1 -no-kvm-pit-reinjection  -name rhel7.0 -uuid
> 990ea161-6b67-47b2-b803-19fb01d30d30 -rtc
> base=localtime,clock=host,driftfix=slew -drive
> file=/mnt/rhel7-64-ga.qcow2,if=none,id=drive-virtio-disk,format=qcow2,
> cache=none,aio=native,media=disk,aio=native,werror=stop,rerror=stop,
> serial=1234 -device
> virtio-blk-pci,drive=drive-virtio-disk,id=virtio-disk,bootindex=1 -monitor
> stdio  -qmp tcp:0:5555,server,nowait -global PIIX4_PM.disable_s3=0 -global
> PIIX4_PM.disable_s4=0 -vnc :1
> 
> 4. run test script inside rhel7.0 guest. 
> ./time-warp-test
> 
> 5. check "cat /var/lib/chrony/drift" inside guest
> # cat /var/lib/chrony/drift
> -21.219299            57.059106

The host must have a negative entry for first element of 
/var/lib/chrony/drift, not the guest.

The guest must be left running for some time, at least 1 hour
uptime.

Try 2 hours uptime.

> 6.check "chronyc tracking"
> Reference ID    : 0.0.0.0 ()
> Stratum         : 0
> Ref time (UTC)  : Thu Jan  1 00:00:00 1970
> System time     : 0.000000000 seconds fast of NTP time
> Last offset     : 0.000000000 seconds
> RMS offset      : 0.000000000 seconds
> Frequency       : 21.219 ppm slow
> Residual freq   : 0.000 ppm
> Skew            : 0.000 ppm
> Root delay      : 0.000000 seconds
> Root dispersion : 0.000000 seconds
> Update interval : 0.0 seconds
> Leap status     : Not synchronised
> 
> 7. do migration to des host.

No need to migrate to destination host, savevm/loadvm on a 
single host is sufficient.

> 8. check "cat /var/lib/chrony/drift" & "chronyc tracking" inside guest.
> 
> got the same result with step 5 & step 6.

Comment 22 Miroslav Rezanina 2014-10-24 12:03:52 UTC

Fix included in qemu-kvm-1.5.3-77.el7

Comment 24 Chao Yang 2014-10-30 08:43:23 UTC

> > 4. run test script inside rhel7.0 guest. 
> > ./time-warp-test
> > 
> > 5. check "cat /var/lib/chrony/drift" inside guest
> > # cat /var/lib/chrony/drift
> > -21.219299            57.059106
> 
> The host must have a negative entry for first element of 
> /var/lib/chrony/drift, not the guest.
> 
I have a host using tsc whose drift is a negative value. 

> The guest must be left running for some time, at least 1 hour
> uptime.
> 
> Try 2 hours uptime.
> 
Here do you mean after guest starts up, running time-warp-test.c in guest for more than 2 hours? At the same time, keep savevm then loadvm?

Another question:
According to your formula:
h*3600*(ppm * 1/1000000) = 2*60

In my case: I have Frequency: 19.007 ppm slow, the 'h' should be 33333 hours. Is there any way we can speed up the time to reproduce it?

Comment 25 Marcelo Tosatti 2014-10-30 19:00:48 UTC

(In reply to Chao Yang from comment #24)
> > > 4. run test script inside rhel7.0 guest. 
> > > ./time-warp-test
> > > 
> > > 5. check "cat /var/lib/chrony/drift" inside guest
> > > # cat /var/lib/chrony/drift
> > > -21.219299            57.059106
> > 
> > The host must have a negative entry for first element of 
> > /var/lib/chrony/drift, not the guest.
> > 
> I have a host using tsc whose drift is a negative value. 
> 
> > The guest must be left running for some time, at least 1 hour
> > uptime.
> > 
> > Try 2 hours uptime.
> > 
> Here do you mean after guest starts up, running time-warp-test.c in guest
> for more than 2 hours? At the same time, keep savevm then loadvm?
> 
> Another question:
> According to your formula:
> h*3600*(ppm * 1/1000000) = 2*60
> 
> In my case: I have Frequency: 19.007 ppm slow, the 'h' should be 33333
> hours. Is there any way we can speed up the time to reproduce it?

h*3600*(ppm * 1/1000000) = 2*60

24*3600*(.0000190000) = x*60

1.6416000000 = x*60

So 1.6 seconds of drift in 24 hours. Given that
time-warp-test is reading time values continuously,
and that savevm stops the VM and immediately saves
KVM_GET_CLOCK value, 1.6 seconds should be enough.

Don't know any method to speed up that value
from the top of my head, will look it up and
let you know.

Comment 26 Chao Yang 2014-11-05 11:09:01 UTC

Hi Marcelo,

I have been trying to reproduce this issue with qemu-kvm-1.5.3-60.el7.x86_64 by steps:
1. start a rhel 7 guest
2. run time-warp-test.c in guest
3. keep it running for more than 3 days
4. savevm then loadvm through monitor

Actual Result:
I stopped ntpd as well as chronyd both in host and in guest. After step 4, no hang happened in guest. 

Host info:
# cat /var/lib/chrony/drift 
          -18.903366             0.044623

# dmesg | grep -i clocksource
[    0.163878] Switching to clocksource hpet
[    1.407523] tsc: Refined TSC clocksource calibration: 3392.304 MHz
[    1.407538] Switching to clocksource tsc

# uname -r
3.10.0-194.el7.x86_64

Questions:
1. When shall I run savevm/loadvm pair?

2. The instruction command to compile this in time-warp-test.c is malfunction. It leads to coredump when running. Is this normal?

3. I used gcc -o time-warp-test.c time-warp-test.c to compile and run. Is this ok?

4. time-warp-test reports TSC: 2.30us, fail:0  What should it report to reproduce this issue?

I'll attach CLI in my test.

Comment 28 Marcelo Tosatti 2014-11-17 18:01:09 UTC

(In reply to Chao Yang from comment #26)
> Hi Marcelo,
> 
> I have been trying to reproduce this issue with qemu-kvm-1.5.3-60.el7.x86_64
> by steps:
> 1. start a rhel 7 guest
> 2. run time-warp-test.c in guest
> 3. keep it running for more than 3 days
> 4. savevm then loadvm through monitor
> 
> Actual Result:
> I stopped ntpd as well as chronyd both in host and in guest. After step 4,
> no hang happened in guest. 

Should not stop chronyd in the host.

In the guest, you can stop it.

> 
> Host info:
> # cat /var/lib/chrony/drift 
>           -18.903366             0.044623
> 
> # dmesg | grep -i clocksource
> [    0.163878] Switching to clocksource hpet
> [    1.407523] tsc: Refined TSC clocksource calibration: 3392.304 MHz
> [    1.407538] Switching to clocksource tsc
> 
> # uname -r
> 3.10.0-194.el7.x86_64
> 
> Questions:
> 1. When shall I run savevm/loadvm pair?
> 
> 2. The instruction command to compile this in time-warp-test.c is
> malfunction. It leads to coredump when running. Is this normal?
> 
> 3. I used gcc -o time-warp-test.c time-warp-test.c to compile and run. Is
> this ok?
> 
> 4. time-warp-test reports TSC: 2.30us, fail:0  What should it report to
> reproduce this issue?
> 
> I'll attach CLI in my test.

Comment 29 Chao Yang 2014-12-05 03:11:20 UTC

I have keep VM running time-warp-test for 2 days. And this TSC host has -18 output from /var/lib/chrony/drift. After savevm/loadvm I saw 2 minutes warp(guest is 2 mins slower than host), but I didn't see any hang in guest. Can I say I have reproduced the original issue? If no, what further operation should I do?

Comment 30 Marcelo Tosatti 2014-12-15 21:21:42 UTC

(In reply to Chao Yang from comment #29)
> I have keep VM running time-warp-test for 2 days. And this TSC host has -18
> output from /var/lib/chrony/drift. After savevm/loadvm I saw 2 minutes
> warp(guest is 2 mins slower than host), but I didn't see any hang in guest.
> Can I say I have reproduced the original issue? If no, what further
> operation should I do?

Where did you saw the warp exactly? In the output of time-warp-test?

Comment 31 Chao Yang 2014-12-16 04:41:48 UTC

(In reply to Marcelo Tosatti from comment #30)
> (In reply to Chao Yang from comment #29)
> > I have keep VM running time-warp-test for 2 days. And this TSC host has -18
> > output from /var/lib/chrony/drift. After savevm/loadvm I saw 2 minutes
> > warp(guest is 2 mins slower than host), but I didn't see any hang in guest.
> > Can I say I have reproduced the original issue? If no, what further
> > operation should I do?
> 
> Where did you saw the warp exactly? In the output of time-warp-test?

No, it was from date. And time in guest caught up soon.

Warp test in guest reported:
TSC: 2.36us, fail:0
TOD: 2.27us, fail:0
CLK: 2.25us, fail:0

Unless warp test reports failure in either TSC, TOD or CLK, I haven't reproduced it, right?

savevm then loadvm cannot reproduce this bug on a host with a guest up for 13 days, with warp running in guest.

Any suggestions for QE to reproduce and verify this bug?

Comment 32 Marcelo Tosatti 2015-01-05 15:04:47 UTC

(In reply to Chao Yang from comment #31)
> (In reply to Marcelo Tosatti from comment #30)
> > (In reply to Chao Yang from comment #29)
> > > I have keep VM running time-warp-test for 2 days. And this TSC host has -18
> > > output from /var/lib/chrony/drift. After savevm/loadvm I saw 2 minutes
> > > warp(guest is 2 mins slower than host), but I didn't see any hang in guest.
> > > Can I say I have reproduced the original issue? If no, what further
> > > operation should I do?
> > 
> > Where did you saw the warp exactly? In the output of time-warp-test?
> 
> No, it was from date. And time in guest caught up soon.
> 
> Warp test in guest reported:
> TSC: 2.36us, fail:0
> TOD: 2.27us, fail:0
> CLK: 2.25us, fail:0
> 
> Unless warp test reports failure in either TSC, TOD or CLK, I haven't
> reproduced it, right?
> 
> savevm then loadvm cannot reproduce this bug on a host with a guest up for
> 13 days, with warp running in guest.
> 
> Any suggestions for QE to reproduce and verify this bug?

Do you recall what kernel version was being used in the host ?

It should be easier to hit the bug with kernels < kernel-3.10.0-105.el7.

Comment 33 Chao Yang 2015-01-06 02:50:45 UTC

(In reply to Marcelo Tosatti from comment #32)
> (In reply to Chao Yang from comment #31)
> > (In reply to Marcelo Tosatti from comment #30)
> > > (In reply to Chao Yang from comment #29)
> > > > I have keep VM running time-warp-test for 2 days. And this TSC host has -18
> > > > output from /var/lib/chrony/drift. After savevm/loadvm I saw 2 minutes
> > > > warp(guest is 2 mins slower than host), but I didn't see any hang in guest.
> > > > Can I say I have reproduced the original issue? If no, what further
> > > > operation should I do?
> > > 
> > > Where did you saw the warp exactly? In the output of time-warp-test?
> > 
> > No, it was from date. And time in guest caught up soon.
> > 
> > Warp test in guest reported:
> > TSC: 2.36us, fail:0
> > TOD: 2.27us, fail:0
> > CLK: 2.25us, fail:0
> > 
> > Unless warp test reports failure in either TSC, TOD or CLK, I haven't
> > reproduced it, right?
> > 
> > savevm then loadvm cannot reproduce this bug on a host with a guest up for
> > 13 days, with warp running in guest.
> > 
> > Any suggestions for QE to reproduce and verify this bug?
> 
> Do you recall what kernel version was being used in the host ?
> 
> It should be easier to hit the bug with kernels < kernel-3.10.0-105.el7.

I was using kernel-3.10.0-194.el7.x86_64.  Retrying with kernels < kernel-3.10.0-105.el7.

Comment 34 Chao Yang 2015-01-08 08:27:54 UTC

(In reply to Chao Yang from comment #33)
> (In reply to Marcelo Tosatti from comment #32)
> > (In reply to Chao Yang from comment #31)
> > > (In reply to Marcelo Tosatti from comment #30)
> > > > (In reply to Chao Yang from comment #29)
> > > > > I have keep VM running time-warp-test for 2 days. And this TSC host has -18
> > > > > output from /var/lib/chrony/drift. After savevm/loadvm I saw 2 minutes
> > > > > warp(guest is 2 mins slower than host), but I didn't see any hang in guest.
> > > > > Can I say I have reproduced the original issue? If no, what further
> > > > > operation should I do?
> > > > 
> > > > Where did you saw the warp exactly? In the output of time-warp-test?
> > > 
> > > No, it was from date. And time in guest caught up soon.
> > > 
> > > Warp test in guest reported:
> > > TSC: 2.36us, fail:0
> > > TOD: 2.27us, fail:0
> > > CLK: 2.25us, fail:0
> > > 
> > > Unless warp test reports failure in either TSC, TOD or CLK, I haven't
> > > reproduced it, right?
> > > 
> > > savevm then loadvm cannot reproduce this bug on a host with a guest up for
> > > 13 days, with warp running in guest.
> > > 
> > > Any suggestions for QE to reproduce and verify this bug?
> > 
> > Do you recall what kernel version was being used in the host ?
> > 
> > It should be easier to hit the bug with kernels < kernel-3.10.0-105.el7.
> 
> I was using kernel-3.10.0-194.el7.x86_64.  Retrying with kernels <
> kernel-3.10.0-105.el7.

I tested with kernel-3.10.0-104.el7(both host and guest), after 24h uptime the drift should be 1.27s, then savevm/loadvm, time-warp-test runs well and no hang happens.

Comment 35 Marcelo Tosatti 2015-01-12 19:39:00 UTC

(In reply to Chao Yang from comment #34)
> (In reply to Chao Yang from comment #33)
> > (In reply to Marcelo Tosatti from comment #32)
> > > (In reply to Chao Yang from comment #31)
> > > > (In reply to Marcelo Tosatti from comment #30)
> > > > > (In reply to Chao Yang from comment #29)
> > > > > > I have keep VM running time-warp-test for 2 days. And this TSC host has -18
> > > > > > output from /var/lib/chrony/drift. After savevm/loadvm I saw 2 minutes
> > > > > > warp(guest is 2 mins slower than host), but I didn't see any hang in guest.
> > > > > > Can I say I have reproduced the original issue? If no, what further
> > > > > > operation should I do?
> > > > > 
> > > > > Where did you saw the warp exactly? In the output of time-warp-test?
> > > > 
> > > > No, it was from date. And time in guest caught up soon.
> > > > 
> > > > Warp test in guest reported:
> > > > TSC: 2.36us, fail:0
> > > > TOD: 2.27us, fail:0
> > > > CLK: 2.25us, fail:0
> > > > 
> > > > Unless warp test reports failure in either TSC, TOD or CLK, I haven't
> > > > reproduced it, right?
> > > > 
> > > > savevm then loadvm cannot reproduce this bug on a host with a guest up for
> > > > 13 days, with warp running in guest.
> > > > 
> > > > Any suggestions for QE to reproduce and verify this bug?
> > > 
> > > Do you recall what kernel version was being used in the host ?
> > > 
> > > It should be easier to hit the bug with kernels < kernel-3.10.0-105.el7.
> > 
> > I was using kernel-3.10.0-194.el7.x86_64.  Retrying with kernels <
> > kernel-3.10.0-105.el7.
> 
> I tested with kernel-3.10.0-104.el7(both host and guest), after 24h uptime
> the drift should be 1.27s, then savevm/loadvm, time-warp-test runs well and
> no hang happens.

Ok then, please mark the bug as verified (the standard kvmclock tests should be
sufficient).

Comment 36 juzhang 2015-01-13 01:30:24 UTC

According to comment35, set this bz as verified.

Comment 38 errata-xmlrpc 2015-03-05 08:09:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0349.html

Comment 40 hongming zhang 2017-01-19 08:20:25 UTC

The patch has been reverted in upstream. I want to know how about rhel downstream . Thanks 

https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg02811.html