1620573 – [downstream clone - 4.2.7] [RFE] Time sync in VM after resuming from PAUSE state

Bug 1620573 - [downstream clone - 4.2.7] [RFE] Time sync in VM after resuming from PAUSE state

Summary: [downstream clone - 4.2.7] [RFE] Time sync in VM after resuming from PAUSE state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	Steven Rosenberg
QA Contact:	Vitalii Yerys
Docs Contact:
URL:
Whiteboard:
Depends On:	1510856
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-23 08:18 UTC by RHV bug bot
Modified:	2021-12-10 17:06 UTC (History)
CC List:	14 users (show)
Fixed In Version:	v4.20.40
Doc Type:	Enhancement
Doc Text:	Large snapshots can result in long pauses of a VM that can affect the accuracy of the System Time, upon which time stamps and other time related functions depend. Guest Time Synchronization enables synchronization of the VM’s System Time during the creation of snapshots when enabled. When this feature is enabled and the Guest Agent is running, the VDSM process on the Host attempts to synchronize the System Time of the VM with the Host’s System Time when snapshots are completed and the VM is un-paused. To turn on Guest Time Synchronization for snapshots, use the time_sync_snapshot_enable option. For synchronizing the VM’s System Time during abnormal scenarios that may cause the VM to pause, you can enable the time_sync_cont_enable option. By default, these features are disabled for backward compatibility.
Clone Of:	1510856
Environment:
Last Closed:	2018-11-05 15:02:07 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44120	None	None	None	2021-11-30 17:05:30 UTC
Red Hat Knowledge Base (Solution)	6547481	None	None	None	2021-11-30 16:59:12 UTC
Red Hat Product Errata	RHEA-2018:3478	None	None	None	2018-11-05 15:02:51 UTC
oVirt gerrit	93963	ovirt-4.2	MERGED	virt: vm: Update guest time after VM un-pausing	2018-08-28 12:54:18 UTC

Description RHV bug bot 2018-08-23 08:18:29 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1510856 +++
======================================================================

Description of problem:
Thanks to BZ#1156194, when a VM is suspended and later restored, the time is synced.

The request is to extend the automatic sync of the timing even when the VM restore from a Pause state.

Version-Release number of selected component (if applicable):
RHEV 4.1.x

How reproducible:
Always

Steps to Reproduce:
1. turn on a VM with guest agent enabled
2. create a transient storage issue or a high impact IO operation (like a snapshot with memory for a huge VM) happen.
2. VM went to pause
3. solve the issue or wait for the IO operation to happen

Actual results:
the time is mis-aligned.

Expected results:
RHV do force the time sync like when the VM go into suspend state.

Additional info:
This RFE is critical for those time sensitive workloads running on RHV (transational apps and so on).

(Originally by Andrea Perotti)

Comment 3 RHV bug bot 2018-08-23 08:18:45 UTC

possible, but not sure it needs to be configurable or not

(Originally by michal.skrivanek)

Comment 4 RHV bug bot 2018-08-23 08:18:51 UTC

can you add more details about the actual scenario? Even when VM pauses due to drive extension it shouldn't really take too much time, not more than few seconds which are better handled by NTP inside the guest rather than abrupt clock changes done externally.

(Originally by michal.skrivanek)

Comment 5 RHV bug bot 2018-08-23 08:18:56 UTC

The scenario we are talking about is very time sensitive app, an in-memory app,  like a jboss datagrid installation, running on VM with huge amount of RAM.

Dealing with a pause of that VM *can* be worked out with ntp, but require a constant aggressive configuration of the tool, while having the same behaviour for paused like for suspended VMs can be more practical for some users.

Eventually this can be make configurable, like, setting after how many seconds of pause state RHV should enforce the clock changes.

(Originally by Andrea Perotti)

Comment 6 RHV bug bot 2018-08-23 08:19:02 UTC

(In reply to Andrea Perotti from comment #4)
> The scenario we are talking about is very time sensitive app, an in-memory
> app,  like a jboss datagrid installation, running on VM with huge amount of
> RAM.
> 
> Dealing with a pause of that VM *can* be worked out with ntp, but require a
> constant aggressive configuration of the tool, while having the same
> behaviour for paused like for suspended VMs can be more practical for some
> users.

if it is a time sensitive app, wouldn't it be better to avoid ENOSPC paused states? Either bigger allocation chunks, lower watermark so it starts extending the drive sooner, bigger initial size of the thin provisioned disk, etc.

> Eventually this can be make configurable, like, setting after how many
> seconds of pause state RHV should enforce the clock changes.

Creating a config option to do a time sync after resume from pause is feasible. I would still leave it off by default though. Implementing a configurable interval when it should be set is more complicated and will delay this RFE, but if that's required it's doable too.

Still, before starting on this I believe we should check if we are really solving the right thing, making VMs not to pause in the first place might make more sense.

(Originally by michal.skrivanek)

Comment 7 RHV bug bot 2018-08-23 08:19:06 UTC

(In reply to Michal Skrivanek from comment #5)
> if it is a time sensitive app, wouldn't it be better to avoid ENOSPC paused
> states? Either bigger allocation chunks, lower watermark so it starts
> extending the drive sooner, bigger initial size of the thin provisioned
> disk, etc.

Customer is triggering this event when doing a full snapshot of VM included with memory, but also transient connectivity storage issues can lead to Pause.

> Creating a config option to do a time sync after resume from pause is
> feasible. I would still leave it off by default though. 

I think just having it would be good enough for my customer, and having it now is more important than having it perfectly configurable.

(Originally by Andrea Perotti)

Comment 9 RHV bug bot 2018-08-23 08:19:16 UTC

Overall I believe that we should address the reason for the pausing in the first place. If that happens for snapshots, we should probably also check if we can get rid of the pausing that does happen.

(Originally by Martin Tessun)

Comment 11 RHV bug bot 2018-08-23 08:19:26 UTC

perhaps after_vm_cont hook can be used? Hopefully it's not suffering from the same problem as the after_vm_pause hook in bug 1543103


Other than that, the solution could look similar to Openstack's https://review.openstack.org/#/c/316116/

(Originally by michal.skrivanek)

Comment 12 RHV bug bot 2018-08-23 08:19:31 UTC

I'll add another large application to this one too.  In this case, storage goes offline, the VM pauses, storage comes back, the VM resumes, but now its clock is way off, and that sets a whole bad cascade of events.  

Nothing we can do about the storage going offline, and pausing the VM is the correct action when that happens. If we can inject the correct time into that VM when it resumes, we can make lots of people happy.

- Greg

(Originally by Greg Scott)

Comment 26 Vitalii Yerys 2018-09-07 15:36:24 UTC

Verified upstream:

ovirt-engine-4.2.6.5-0.0.master.20180831090131.git1d64d4c.el7.noarch
vdsm-4.20.39-6.git00d5340.el7.x86_64

Comment 28 Raz Tamir 2018-09-19 12:30:38 UTC

QE verification bot: the bug was verified upstream

Comment 30 errata-xmlrpc 2018-11-05 15:02:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3478

Note You need to log in before you can comment on or make changes to this bug.