Bug 1156194

Summary: [RHEV-M] Time sync after resuming from suspend
Product: Red Hat Enterprise Virtualization Manager Reporter: Marcelo Tosatti <mtosatti>
Component: vdsmAssignee: Milan Zamazal <mzamazal>
Status: CLOSED ERRATA QA Contact: Israel Pinto <ipinto>
Severity: medium Docs Contact:
Priority: high    
Version: 3.4.3CC: bazulay, danken, dgilbert, iheim, ipinto, jentrena, knoel, lpeer, lsurette, mavital, melewis, mgoldboi, michal.skrivanek, mkalinin, mprivozn, mtosatti, mzamazal, pagupta, pdwyer, sherold, srosenbe, uobergfe, yeylon, ykaul, ylavi
Target Milestone: ovirt-3.6.2Keywords: ZStream
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, when a virtual machine was resumed from a suspended state its time was not updated, resulting in incorrect time in the guest operating system. This happened because the functionality to update time on resume was missing in VDSM. Now, the functionality has been added and the time is now updated after a virtual machine is resumed from a suspended state. As long as the guest operating system supports this feature and qemu-guest-agent is running. It is still recommended to have NTP services running in the guest operating system to provide precise time for the guest.
Story Points: ---
Clone Of:
: 1287620 (view as bug list) Environment:
Last Closed: 2016-03-09 19:26:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1115340    
Bug Blocks: 1230126, 1287620    
Attachments:
Description Flags
engine_log
none
vdsm_log
none
vm_qemu_log
none
libvirt_log none

Description Marcelo Tosatti 2014-10-23 18:59:54 UTC
RHEV-M should execute commands to sync guest time upon guest unpause.

This is RFE for OpenStack:
https://access.redhat.com/support/cases/00991668/

https://bugzilla.redhat.com/show_bug.cgi?id=1049040

Several customers have asked for this to be functional.

Comment 10 Marina Kalinin 2014-10-30 14:11:27 UTC
Marcelo, thanks a lot.
I updated the article: https://access.redhat.com/solutions/1244263

Comment 12 Michal Skrivanek 2014-10-31 12:01:28 UTC
(In reply to Marcelo Tosatti from comment #0)
> RHEV-M should execute commands to sync guest time upon guest unpause.

is there a reason why this would not be "hidden" in libvirt on cont?

Comment 13 Michal Privoznik 2014-10-31 13:44:58 UTC
(In reply to Michal Skrivanek from comment #12)
> (In reply to Marcelo Tosatti from comment #0)
> > RHEV-M should execute commands to sync guest time upon guest unpause.
> 
> is there a reason why this would not be "hidden" in libvirt on cont?

Yes there is. Firstly, the API comes in pair: get & set time. Secondly, libvirt is unable to make such decision whether guest time should be synced when doing domain resume. Libvirt knows nothing about guest time policy, whether it's using NTP or other mechanisms to sync time. It's mgmt app to make such decision.

Comment 14 Michal Skrivanek 2014-10-31 14:38:48 UTC
(In reply to Michal Privoznik from comment #13)
> (In reply to Michal Skrivanek from comment #12)
> > (In reply to Marcelo Tosatti from comment #0)
> > > RHEV-M should execute commands to sync guest time upon guest unpause.
> > 
> > is there a reason why this would not be "hidden" in libvirt on cont?
> 
> Yes there is. Firstly, the API comes in pair: get & set time. Secondly,
> libvirt is unable to make such decision whether guest time should be synced
> when doing domain resume. Libvirt knows nothing about guest time policy,
> whether it's using NTP or other mechanisms to sync time. It's mgmt app to
> make such decision.

Well, RHEV does not have that guest time policy either. There are no plans to add any.

If RHEV had such a mechanism it would have to be an interaction with the guest, i.e. using guest agent. And when you have such a thing there's no reason for anything else in qemu or libvirt, guest agent can just set the time to whatever.

If it's needed to have a decision whether to sync time or not, it needs to be part of the same call. And it doesn't make much sense to me to not make it a default

Comment 15 Michal Skrivanek 2014-11-04 10:06:16 UTC
since the resume API doesn't take any arguments yet we have to make an extra call in RHEV to sync the time.

Comment 19 Michal Skrivanek 2015-08-12 12:27:27 UTC
(In reply to Michal Privoznik from comment #13)
> (In reply to Michal Skrivanek from comment #12)
> > (In reply to Marcelo Tosatti from comment #0)
> > > RHEV-M should execute commands to sync guest time upon guest unpause.
> > 
> > is there a reason why this would not be "hidden" in libvirt on cont?
> 
> Yes there is. Firstly, the API comes in pair: get & set time. Secondly,
> libvirt is unable to make such decision whether guest time should be synced
> when doing domain resume. Libvirt knows nothing about guest time policy,
> whether it's using NTP or other mechanisms to sync time. It's mgmt app to
> make such decision.

I can't imagine a situation when someone would *not* want to reset the time to somethign close to reality. I think it makes sense to make it a libvirt default behavior. Scott?

Comment 20 Michal Privoznik 2015-08-12 13:50:36 UTC
(In reply to Michal Skrivanek from comment #19)
> (In reply to Michal Privoznik from comment #13)
> > (In reply to Michal Skrivanek from comment #12)
> > > (In reply to Marcelo Tosatti from comment #0)
> > > > RHEV-M should execute commands to sync guest time upon guest unpause.
> > > 
> > > is there a reason why this would not be "hidden" in libvirt on cont?
> > 
> > Yes there is. Firstly, the API comes in pair: get & set time. Secondly,
> > libvirt is unable to make such decision whether guest time should be synced
> > when doing domain resume. Libvirt knows nothing about guest time policy,
> > whether it's using NTP or other mechanisms to sync time. It's mgmt app to
> > make such decision.
> 
> I can't imagine a situation when someone would *not* want to reset the time
> to somethign close to reality. I think it makes sense to make it a libvirt
> default behavior. Scott?

Well, for instance if you have NTP enabled guest (which you want to maintain continuous time) and the time that domain was suspended was just a short while, e.g. couple of seconds. And what is still a short while and what is not - that's for mgmt app to decide, not libvirt.

Comment 21 Marcelo Tosatti 2015-08-12 21:28:38 UTC
(In reply to Michal Privoznik from comment #20)
> (In reply to Michal Skrivanek from comment #19)
> > (In reply to Michal Privoznik from comment #13)
> > > (In reply to Michal Skrivanek from comment #12)
> > > > (In reply to Marcelo Tosatti from comment #0)
> > > > > RHEV-M should execute commands to sync guest time upon guest unpause.
> > > > 
> > > > is there a reason why this would not be "hidden" in libvirt on cont?
> > > 
> > > Yes there is. Firstly, the API comes in pair: get & set time. Secondly,
> > > libvirt is unable to make such decision whether guest time should be synced
> > > when doing domain resume. Libvirt knows nothing about guest time policy,
> > > whether it's using NTP or other mechanisms to sync time. It's mgmt app to
> > > make such decision.
> > 
> > I can't imagine a situation when someone would *not* want to reset the time
> > to somethign close to reality. I think it makes sense to make it a libvirt
> > default behavior. Scott?
> 
> Well, for instance if you have NTP enabled guest (which you want to maintain
> continuous time) and the time that domain was suspended was just a short
> while, e.g. couple of seconds. And what is still a short while and what is
> not - that's for mgmt app to decide, not libvirt.

Michal,

Sorry, can't see why this example is a valid one. Please be more verbose?

Can't see why it should not be a libvirt default (the alternative is to expect 
all mgmt software to have this problem handled properly while users suffer 
from the bug).

Comment 22 Michal Privoznik 2015-08-13 08:00:16 UTC
Well, maybe that was not the best example. The other reason that libvirt should not do this automatically is that it would mask two actions into a single API. What should the API return if resuming succeeded but time reset didn't? It can't return success, it partially failed. So it will return failure. But the domain transitioned into a different state. So client needs to call yet another API to check the domain state. And since client ends up calling two APIs anyway, I rather not join two actions into a single one.

Comment 23 Michal Skrivanek 2015-08-13 08:09:25 UTC
(In reply to Michal Privoznik from comment #22)
> Well, maybe that was not the best example. The other reason that libvirt
> should not do this automatically is that it would mask two actions into a
> single API. What should the API return if resuming succeeded but time reset
> didn't? It can't return success, it partially failed. So it will return
> failure. But the domain transitioned into a different state. So client needs
> to call yet another API to check the domain state. And since client ends up
> calling two APIs anyway, I rather not join two actions into a single one.

I can give you a counterexample:) During migration in final phase when everything is finished the VM is resumed on destination from paused state. You can also view it as two independent actions...if resume is the only thing failing at the very end....

(restoring needinfo in comment #19)

Comment 24 Michal Privoznik 2015-08-13 09:06:22 UTC
(In reply to Michal Skrivanek from comment #23)
> (In reply to Michal Privoznik from comment #22)
>>

> I can give you a counterexample:) During migration in final phase when
> everything is finished the VM is resumed on destination from paused state.
> You can also view it as two independent actions...if resume is the only
> thing failing at the very end....

That's something slightly different. Either you have a live migration, where domain vCPUs are running and you don't care whether domain got some extra cpu time, or you are doing paused migration in which case the whole operation is atomic too.
This in fact is the reason I don't like the idea - we can't guarantee atomicity of two different operations, therefore they should not be hidden under the same API.

To extend my example, if libvirt would detect that partially failed state, during rollback it would need to suspend the domain again, so that domain state is the same as prior to calling the api. However, it would never be, because while libvirt is trying to reset the time, domain already is being scheduled and got a host cpu time.

> 
> (restoring needinfo in comment #19)

Ah, sorry for that.

Comment 25 Marcelo Tosatti 2015-08-14 11:14:45 UTC
(In reply to Michal Privoznik from comment #22)
> Well, maybe that was not the best example. The other reason that libvirt
> should not do this automatically is that it would mask two actions into a
> single API. What should the API return if resuming succeeded but time reset
> didn't? It can't return success, it partially failed. 

Success.

> So it will return failure. But the domain transitioned into a different state. 

How do you expect management to handle the failure?

> So client needs
> to call yet another API to check the domain state. And since client ends up
> calling two APIs anyway, I rather not join two actions into a single one.

I don't see a problem with that proposal.

Comment 26 Marcelo Tosatti 2015-08-14 11:24:51 UTC
(In reply to Marcelo Tosatti from comment #25)
> (In reply to Michal Privoznik from comment #22)
> > Well, maybe that was not the best example. The other reason that libvirt
> > should not do this automatically is that it would mask two actions into a
> > single API. What should the API return if resuming succeeded but time reset
> > didn't? It can't return success, it partially failed. 
> 
> Success.
> 
> > So it will return failure. But the domain transitioned into a different state. 
> 
> How do you expect management to handle the failure?
> 
> > So client needs
> > to call yet another API to check the domain state. And since client ends up
> > calling two APIs anyway, I rather not join two actions into a single one.
> 
> I don't see a problem with that proposal.

So the commands would be, from the mgmt perspective:

* current management sw:

1) resume
2) check for resume errors

outcomes:
* benefit: guest clock is synchronized if guest agent is setup properly.
* downside: none.

* clock sync aware management sw:
1) resume
2) check for resume errors
3) query clock sync initiated by resume
4) take action on 3.

Note that there is no strict need for atomicity because error checking 
for clock sync usually involves reporting (resume operation did not fail).

outcomes:
* benefit: guest clock is synchronized if guest agent is setup properly.
* downside: none.

Comment 28 Michal Privoznik 2015-10-12 11:32:34 UTC
(In reply to Marcelo Tosatti from comment #26)
> (In reply to Marcelo Tosatti from comment #25)
> > (In reply to Michal Privoznik from comment #22)
> > > Well, maybe that was not the best example. The other reason that libvirt
> > > should not do this automatically is that it would mask two actions into a
> > > single API. What should the API return if resuming succeeded but time reset
> > > didn't? It can't return success, it partially failed. 
> > 
> > Success.
> > 
> > > So it will return failure. But the domain transitioned into a different state. 
> > 
> > How do you expect management to handle the failure?
> > 
> > > So client needs
> > > to call yet another API to check the domain state. And since client ends up
> > > calling two APIs anyway, I rather not join two actions into a single one.
> > 
> > I don't see a problem with that proposal.
> 
> So the commands would be, from the mgmt perspective:
> 
> * current management sw:
> 
> 1) resume
> 2) check for resume errors

This ^^ will end up in two different API calls anyway. If the joint resume API returns an error, you have to call another API to see if domain is resumed and thus it's time sync what failed.

> 
> outcomes:
> * benefit: guest clock is synchronized if guest agent is setup properly.
> * downside: none.
> 
> * clock sync aware management sw:
> 1) resume
> 2) check for resume errors
> 3) query clock sync initiated by resume
> 4) take action on 3.
> 
> Note that there is no strict need for atomicity because error checking 
> for clock sync usually involves reporting (resume operation did not fail).
> 
> outcomes:
> * benefit: guest clock is synchronized if guest agent is setup properly.
> * downside: none.

So when there's not downside for any of the approaches, I think we should not wrap two semantically different operations into single API.

Comment 29 Michal Privoznik 2015-10-13 11:14:53 UTC
I've proposed an RFC upstream:

https://www.redhat.com/archives/libvir-list/2015-October/msg00382.html

Lets see what upstream thinks.

BTW: I'd appreciate any input from vdsm/RHEV developer here, since it's two low level developers discussing something here that upper layers may not even want.

Comment 30 Michal Skrivanek 2015-10-14 06:07:30 UTC
it looks settled now. So I'll just note my disagreement here below and we'll use the API in RHEV:)
I consider it an elementary functionality, one that you can reasonably expect any sane management client would want to do. As such an elementary management library/API should take care of such operations. 
It's not a big burden to push it up the stack, but I believe in the long run it would eliminate the extra work needed in higher management apps, eliminate repeating code flow patterns and issues

Comment 31 Moran Goldboim 2015-10-20 09:58:30 UTC
*** Bug 1247912 has been marked as a duplicate of this bug. ***

Comment 32 Marcelo Tosatti 2015-10-25 19:54:40 UTC
(In reply to Michal Skrivanek from comment #30)
> it looks settled now. So I'll just note my disagreement here below and we'll
> use the API in RHEV:)
> I consider it an elementary functionality, one that you can reasonably
> expect any sane management client would want to do. As such an elementary
> management library/API should take care of such operations. 
> It's not a big burden to push it up the stack, but I believe in the long run
> it would eliminate the extra work needed in higher management apps,
> eliminate repeating code flow patterns and issues

Agreed.

Comment 33 Michal Privoznik 2015-10-26 08:38:32 UTC
(In reply to Michal Skrivanek from comment #30)
> it looks settled now. So I'll just note my disagreement here below and we'll
> use the API in RHEV:)
> I consider it an elementary functionality, one that you can reasonably
> expect any sane management client would want to do. As such an elementary
> management library/API should take care of such operations. 
> It's not a big burden to push it up the stack, but I believe in the long run
> it would eliminate the extra work needed in higher management apps,
> eliminate repeating code flow patterns and issues

In fact, it would not. Either you will have:

if (virDomainResumeFlags($dom, SYNC_TIME) < 0) {
  if (virDomainIsActive())
    error_syncing_time;
  else
    error_resuming_domain;
}

you will need to:

if (virDomainResume($dom) < 0)
  error_resuming_domain;
if (virDomainSetTime($dom) < 0)
  error_syncing_time;


OR, in case as Marcelo suggested, return success even if syncing time failed (best effort):

if (virDomainResumeFlags($dom, SYNC_TIME) < 0)
  error_resuming_domain;

vs:

if (virDomainResume($dom) < 0)
  error_resuming_domain;
virDomainSetTime($dom);

At any rate - change in mgmt app code is required regardless if we wrap this into a single API because current resume API is missing flags. The only advantage is in case #3 where you save a single LOC.

Comment 36 Michal Skrivanek 2015-12-02 13:26:31 UTC
*** Bug 1270302 has been marked as a duplicate of this bug. ***

Comment 37 Israel Pinto 2016-01-21 16:31:25 UTC
Verify with 

Scenario:
Create VM with OS and start it, connect to VM and check time with date command
suspend VM for 5 minutes, connect VM and check time with date command.
Guest with RHEL 7.2 OS - check pass  
Guest with RHEL 6.7 OS - check failed - The time did not update to current time after resume VM. VM name vm_67 ID: 72021d20-3530-4a56-b98c-411d09dea9c6
See logs attached.

Comment 38 Israel Pinto 2016-01-21 16:35:16 UTC
(In reply to Israel Pinto from comment #37)
> Verify with 
> 
> Scenario:
> Create VM with OS and start it, connect to VM and check time with date
> command
> suspend VM for 5 minutes, connect VM and check time with date command.
> Guest with RHEL 7.2 OS - check pass  
> Guest with RHEL 6.7 OS - check failed - The time did not update to current
> time after resume VM. VM name vm_67 ID: 72021d20-3530-4a56-b98c-411d09dea9c6
> See logs attached.

Verify with: 
RHEVM Version: 3.6.2.6-0.1.el6
VDSM: vdsm-4.17.15-0.el7ev
libvirt:  libvirt-1.2.17-13.el7_2.2

Comment 39 Israel Pinto 2016-01-21 16:36:13 UTC
Created attachment 1116973 [details]
engine_log

Comment 40 Israel Pinto 2016-01-21 16:36:50 UTC
Created attachment 1116974 [details]
vdsm_log

Comment 41 Israel Pinto 2016-01-21 16:37:34 UTC
Created attachment 1116975 [details]
vm_qemu_log

Comment 42 Israel Pinto 2016-01-21 16:38:24 UTC
Created attachment 1116976 [details]
libvirt_log

Comment 43 Milan Zamazal 2016-01-22 09:51:26 UTC
According to the libvirt log, setting time was attempted by libvirt but failed.

What's the version of qemu-guest-agent on the RHEL6 guest? It seems the set-time capability wasn't introduced before 1.5 so if the version is older then setting time by QEMU is unsupported in the guest OS.

Comment 44 Israel Pinto 2016-01-24 07:56:30 UTC
I install qemu-guest-agent, and it works. 
Verify with: 
RHEVM Version: 3.6.2.6-0.1.el6
VDSM: vdsm-4.17.15-0.el7ev
libvirt:  libvirt-1.2.17-13.el7_2.2

qemu version on guest 6.7:
qemu-img-rhev-0.12.1.2-2.479.el6_7.3.x86_64
gpxe-roms-qemu-0.9.7-6.14.el6.noarch
qemu-guest-agent-0.12.1.2-2.479.el6_7.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.479.el6_7.3.x86_64

Comment 46 errata-xmlrpc 2016-03-09 19:26:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html