Bug 1635845

Summary: Report Guest-OS uptime in RHV-M
Product: Red Hat Enterprise Virtualization Manager Reporter: Bimal Chollera <bcholler>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: low    
Version: 4.2.6CC: ahadas, bcholler, mavital, michal.skrivanek, mjankula, mkalinin, mtessun, mzamazal, pmatyas, rbarry, Rhev-m-bugs, sborella, trailtotale
Target Milestone: ovirt-4.3.2Keywords: FutureFeature, Rebase
Target Release: 4.3.0Flags: pmatyas: testing_plan_complete+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
This release ensures the clearing of the VM uptime during a guest operating system reboot, and the uptime that does display corresponds to the guest operating system.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-08 12:38:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1547768    

Description Bimal Chollera 2018-10-03 18:53:02 UTC
Description of problem:

Virtual Machine 'Uptime' data is not updated in the RHV-M Webadmin/GUI/Portal when a VM is rebooted.  It doesn't match to the uptime in the VM.  This problem is seen  if the VM is rebooted from RHV-M Webadmin/GUI/Portal or from within the VM.  The RHV Webadmin/GUI/Portal continues to reflect old Uptime data before the VM was rebooted.

The 'Uptime' data gets updated only the VM is powered off and started.

Version-Release number of selected component (if applicable):

ovirt-engine-4.2.6.4-0.1.el7ev.noarch

How reproducible:

100%

Steps to Reproduce:

1.  	From RHV Webadmin/GUI/Portal -> Compute -> Virtual Machine
	Check the Uptime column in RHV Webadmin/GUI/Portal
	Right click and reboot the VM
	Check the Uptime column in RHV Webadmin/GUI/Portal, it will not be updated.
	Verify the uptime from within the VM


2.  	Within the VM, reboot a VM.
	From RHV Webadmin/GUI/Portal -> Compute -> Virtual Machine
	Check the Uptime column in RHV Webadmin/GUI/Portal, it will not be updated.


Actual results:

Uptime data doesn't match uptime data in the VM.
The RHV Webadmin/GUI/Portal continues to reflect old Uptime data before the VM was rebooted.

Expected results:

Uptime data should match to update data in the VM.

Additional info:

Comment 1 Bimal Chollera 2018-10-03 18:55:11 UTC
This problem is not seen in RHV-M 4.1.11 version.

Comment 3 Bimal Chollera 2018-10-03 19:53:18 UTC
Running "reboot" from within the guest isn't enough to trigger changes to the VM Uptime in the RHV-M Webadmin/GUI/Portal, as the qemu-kvm process never halts.
The qemu-kvm process has to halt for that status to change which we observed when the guest is powered off/started.  It appears the Uptime in RHV-M is uptime of the qemu-kvm process rather then actual uptime of the VM.

Comment 4 Martin Tessun 2018-10-10 11:56:42 UTC
Hi Bimal,

(In reply to Bimal Chollera from comment #0)
> Description of problem:
> 
> Virtual Machine 'Uptime' data is not updated in the RHV-M
> Webadmin/GUI/Portal when a VM is rebooted.  It doesn't match to the uptime
> in the VM.  This problem is seen  if the VM is rebooted from RHV-M
> Webadmin/GUI/Portal or from within the VM.  The RHV Webadmin/GUI/Portal
> continues to reflect old Uptime data before the VM was rebooted.

that is expected as long as the VM process (qemu-kvm) has not been restarted. The uptime of the VM is typically reported within the VM (in RHEL using uptime command, e.g.).

> 
> The 'Uptime' data gets updated only the VM is powered off and started.
> 

That's expected as this is what the uptime in RHV shows. It shows the uptime of the qemu-kvm process the VM is running in.
So if you start a VM and keep that in the BIOS for 2 hours and not boot any OS at all, you will see 2 hours of uptime.

So this is not a bug, but a feature request.

As such: What exactly do you need and what is the business justification for this. I don't see any reason for having that kind of info in RHV-M.
Unless there is a good justification for it, I would close this request.

In case you want to recylce the qemu process with a RHV initiated reboot, you can still do so. Otherwise, I would expect customers having some monitoring that do report the uptime for their VMs.

As said the uptime reported in RHV is for the uptime/runtime of the qemu-process and not for the guest OS.

Comment 5 Martin Tessun 2018-10-10 11:57:57 UTC
(In reply to Bimal Chollera from comment #1)
> This problem is not seen in RHV-M 4.1.11 version.

It is. Just start the BIOS, keep it up for some hours and start the boot process.
The uptime in the VM and in RHV will differ by this time.

Comment 7 Michal Skrivanek 2018-10-25 06:50:25 UTC
it's not supposed to be changed, actually.
looks like a bug. The code looks good to me, _startTime is reset in onReboot()

Comment 9 Milan Zamazal 2018-10-25 13:39:05 UTC
elapsedTime is reported correctly from Vdsm in VM stats and it resets on reboot from the guest. But Uptime in Engine is shown since VM start, not the value obtained from Vdsm.

I'm not sure it's related, but this looks suspicious: https://gerrit.ovirt.org/78620, like a change that aims to improve uptime accuracy but ignores reboots.

Arik, could you please clarify what value Engine uses to display Uptime in the VM list in Webadmin?

Comment 10 Arik 2018-10-25 14:10:49 UTC
(In reply to Milan Zamazal from comment #9)
> Arik, could you please clarify what value Engine uses to display Uptime in
> the VM list in Webadmin?

Your analysis in comment 8 is correct - indeed, we now present a value that is calculated on the engine side and reflects the uptime of the qemu process. My understanding was that since this value was reported by vdsm even when no guest agent is installed, it reflects the uptime of the qemu process rather than that of the guest (from the same arguments Martin mentioned in comment 4 and since VDSM might lose track of the guest uptime when the latter is rebooted while VDSM is down).

If the guest uptime is needed, I would propose adding it to the guest agent and we'll simply present it, without the need to adjust that value on resuming a suspended VM (this require us to hold metadata along with memory snapshots).

Comment 11 Marina Kalinin 2018-10-25 16:59:47 UTC
Arik,
Thanks for your comment, but I am not sure I understand you 100%.
How did it work before the changes you are suggesting for the guest agent?

Comment 12 Marina Kalinin 2018-10-25 17:02:31 UTC
And on the note of what customers would prefer to see is definitely the uptime of the guest OS. So this should be fixed.
One other example for this - live migration. The qemu-kvm process is created new on the destination host. What is the uptime reported in that scenario?

Comment 13 Arik 2018-10-25 17:51:36 UTC
Previously, the engine used the value that was reported from VDSM which is basically the same except that the latter is reset on guest reboot. Assuming that the user is not interested in the exact uptime (i.e., this uptime also includes the time to boot) and that VDSM catches all guest reboots, then that value is generally fine, close to that of the uptime within the guest.

However, from code perspective it is awkward. It would be much simpler for VDSM to retrieve it via the guest agent rather than "from outside" by tracing events from libvirt.

The real reason for this change though was to enable removing the metadata volume of each snapshot with memory/hibernated VM (1G for each on block storage. We didn't remove it eventually).

As for live migration, the uptime is not reset when the VM is migrated. Only when the VM boots (so it is not reset also when the memory of the VM is restored).

Comment 16 Marina Kalinin 2018-10-26 14:05:57 UTC
(In reply to Arik from comment #13)
> Previously, the engine used the value that was reported from VDSM which is
> basically the same except that the latter is reset on guest reboot. Assuming
> that the user is not interested in the exact uptime (i.e., this uptime also
> includes the time to boot) and that VDSM catches all guest reboots, then
> that value is generally fine, close to that of the uptime within the guest.
> 
> However, from code perspective it is awkward. It would be much simpler for
> VDSM to retrieve it via the guest agent rather than "from outside" by
> tracing events from libvirt.
> 
> The real reason for this change though was to enable removing the metadata
> volume of each snapshot with memory/hibernated VM (1G for each on block
> storage. We didn't remove it eventually).
Do I understand correctly that the metadata was not removed? Are there plans to remove it by the storage team?
> 
> As for live migration, the uptime is not reset when the VM is migrated. Only
> when the VM boots (so it is not reset also when the memory of the VM is
> restored).

Comment 17 Marina Kalinin 2018-10-26 14:10:38 UTC
(In reply to Ryan Barry from comment #15)
> Granted, but the difficulty here is that any guest agent reporting is going
> to be a dead issue with RHEL8 guests, along with ovirt-guest-agent. Even if
> it is seen as a regression, we are basically left with 2 choices on RHEL8:
> 
> 1) Reporting the uptime the way it is at present
> 2) Reverting the changes Arik mentioned to keep the metadata block around
> for snapshots again, which increases overhead costs on storage
> 
> There's an old bug requesting this functionality from qemu-guest-agent,
> which we'd need to push and re-open, assuming platform will take it (but
> this still needs a guest agent):
> https://bugzilla.redhat.com/show_bug.cgi?id=1369850
> 
> Alternatively, we do have the option of a partial fix. If the customer is
> willing to accept the current uptime tracking mechanism for the cases Arik
> mentioned (memory snapshot/hibernated VM), we can get the uptime for
> "active" VMs reasonably easily through elapsedTime from vdsm.
> 
> Does this work for the customer?
It seems to me from reading the comments, using the elapsedTime from vdsm is the best option for now, since this is how we used to do it before.
And we should add a note to the docs saying that this value is not accurate for the mentioned reasons.

For the future, if we believe we want to go with this change, we need to work on a proper solution, get the fix in qemu-kvm and adjust our code to behave the most correct in case guest agent installed and in case it is not installed.
I can open a separate rfe for this and reopen the qemu-kvm bz.

For now, I suggest to roll back this change and report the most close to uptime value possible.

Comment 18 Ryan Barry 2018-10-26 14:15:21 UTC
(In reply to Marina from comment #17)
> It seems to me from reading the comments, using the elapsedTime from vdsm is
> the best option for now, since this is how we used to do it before.
> And we should add a note to the docs saying that this value is not accurate
> for the mentioned reasons.

Well, we used to track it in metadata. The only reason that using this would no longer be accurate is in the case of hibernation/memory snapshots (and may actually be accurate then -- we'd need to test the value of elapsedTime to see what it reports once a snapshot is restored).

> 
> For the future, if we believe we want to go with this change, we need to
> work on a proper solution, get the fix in qemu-kvm and adjust our code to
> behave the most correct in case guest agent installed and in case it is not
> installed.
> I can open a separate rfe for this and reopen the qemu-kvm bz.

This will still require a guest agent no matter what.

> 
> For now, I suggest to roll back this change and report the most close to
> uptime value possible.

It's essentially a whole new bug, since that patch merged over a year ago with a number of other related changes. It may be possible to roll it back, but that also bars us from ever removing metadata volumes in the future when it's practical. Reverting a year-old patch is not practical.

Please ask the customer whether using elapsedTime is acceptable, which will report accurate uptimes for everything but snapshots/hibernation (and maybe those also, needs testing)

Comment 19 Michal Skrivanek 2018-10-26 14:58:32 UTC
(In reply to Ryan Barry from comment #18)
> (In reply to Marina from comment #17)
> > It seems to me from reading the comments, using the elapsedTime from vdsm is
> > the best option for now, since this is how we used to do it before.
> > And we should add a note to the docs saying that this value is not accurate
> > for the mentioned reasons.
> 
> Well, we used to track it in metadata. The only reason that using this would
> no longer be accurate is in the case of hibernation/memory snapshots (and
> may actually be accurate then -- we'd need to test the value of elapsedTime
> to see what it reports once a snapshot is restored).

it is supposed to work. It used to work just fine. The tracking logic is fairly simple


> > For the future, if we believe we want to go with this change, we need to
> > work on a proper solution, get the fix in qemu-kvm and adjust our code to
> > behave the most correct in case guest agent installed and in case it is not
> > installed.
> > I can open a separate rfe for this and reopen the qemu-kvm bz.
> 
> This will still require a guest agent no matter what.
> 
> > 
> > For now, I suggest to roll back this change and report the most close to
> > uptime value possible.
> 
> It's essentially a whole new bug, since that patch merged over a year ago
> with a number of other related changes. It may be possible to roll it back,
> but that also bars us from ever removing metadata volumes in the future when
> it's practical. Reverting a year-old patch is not practical.
> 
> Please ask the customer whether using elapsedTime is acceptable, which will
> report accurate uptimes for everything but snapshots/hibernation (and maybe
> those also, needs testing)


it's fairly accurate so there should be no need to ask even. just report the right thing;)
the guest agent way is out of question, no need to distract the discussion

Comment 20 Arik 2018-10-26 18:49:51 UTC
(In reply to Marina from comment #16)
> Do I understand correctly that the metadata was not removed? Are there plans
> to remove it by the storage team?
Yes, unfortunately, we didn't remove it.
Regarding future plans, I don't know, but in any case, it's in the scope of the virt team.

(In reply to Michal Skrivanek from comment #19)
> the guest agent way is out of question, no need to distract the discussion
Can you please explain why? or is it due to non-technical reasons?

Comment 21 Michal Skrivanek 2018-10-30 09:09:38 UTC
this functionality was nacked in qemu-ga 2 years ago as a non-virt feature. It makes much more sense to track that "externally" as info from guest can be spoofed easily, whenever there is a way how to do things without guest knowing it's more accurate.

AFAICT elapsedTime is reported in running stats calculated from startTime which is correctly handed over during migrations and hibernate/resume/recovery so all that is needed is to just show it

Comment 22 Michal Skrivanek 2018-10-30 09:19:31 UTC
hm, changes in 72e580ab63e73689c545573e5c5aa13a022964c4 look suspicious, you're using boot_time property reported in VM stats from VDSM, but I don't see that reported by VDSM for VMs, just for the host.

Comment 23 Arik 2018-10-31 09:12:29 UTC
(In reply to Michal Skrivanek from comment #22)
> hm, changes in 72e580ab63e73689c545573e5c5aa13a022964c4 look suspicious,
> you're using boot_time property reported in VM stats from VDSM, but I don't
> see that reported by VDSM for VMs, just for the host.

That's not something the engine gets from VDSM but the time on the engine when starting the VM.

Comment 24 Arik 2018-10-31 09:16:00 UTC
(In reply to Michal Skrivanek from comment #21)
> this functionality was nacked in qemu-ga 2 years ago as a non-virt feature.

Ah, so maybe there's a chance to change that, right? two years ago we had both qemu-ga and ovirt-ga and once you have that separation I could understand why qemu guys would say "application list, uptime and so on should not be reported by us but rather by ovirt-ga". But now that we moved things from ovirt-ga to qemu-ga, wouldn't it make sense to incorporate the OS uptime to qemu-ga as well?

> It makes much more sense to track that "externally" as info from guest can
> be spoofed easily, whenever there is a way how to do things without guest
> knowing it's more accurate.
> 
> AFAICT elapsedTime is reported in running stats calculated from startTime
> which is correctly handed over during migrations and
> hibernate/resume/recovery so all that is needed is to just show it

Well, when it's about things you know better from "the outside", I agree. But when you are requested to reflect a value that is stored by the guest, it sounds very awkward not to ask that from the guest :)

Comment 25 Michal Skrivanek 2018-10-31 10:56:20 UTC
(In reply to Arik from comment #24)
> (In reply to Michal Skrivanek from comment #21)
> > this functionality was nacked in qemu-ga 2 years ago as a non-virt feature.
> 
> Ah, so maybe there's a chance to change that, right? two years ago we had
> both qemu-ga and ovirt-ga and once you have that separation I could
> understand why qemu guys would say "application list, uptime and so on
> should not be reported by us but rather by ovirt-ga". But now that we moved
> things from ovirt-ga to qemu-ga, wouldn't it make sense to incorporate the
> OS uptime to qemu-ga as well?

so per them no. Feel free to try again of course, but seeing how "fast" the other changes are getting in I don't think the chances are high

> > It makes much more sense to track that "externally" as info from guest can
> > be spoofed easily, whenever there is a way how to do things without guest
> > knowing it's more accurate.
> > 
> > AFAICT elapsedTime is reported in running stats calculated from startTime
> > which is correctly handed over during migrations and
> > hibernate/resume/recovery so all that is needed is to just show it
> 
> Well, when it's about things you know better from "the outside", I agree.
> But when you are requested to reflect a value that is stored by the guest,
> it sounds very awkward not to ask that from the guest :)

sure, but this one is still from the outside. The reboot event is something the guest doesn't know anything about either. It's still an uptime since you initiate booting rather than "since OS became fully ready"

Comment 26 Ryan Barry 2018-10-31 20:51:23 UTC
Going with elapsedTime...

Comment 28 Ryan Barry 2019-01-21 14:53:54 UTC
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 30 Petr Matyáš 2019-03-07 10:59:28 UTC
Verified on ovirt-engine-4.3.2-0.1.el7.noarch

Comment 32 errata-xmlrpc 2019-05-08 12:38:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085