1370120 – Inform the user HE VM needs to be restarted after cluster upgrade

Bug 1370120 - Inform the user HE VM needs to be restarted after cluster upgrade

Summary: Inform the user HE VM needs to be restarted after cluster upgrade

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.HostedEngine
Sub Component:
Version:	4.0.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Phillip Bailey
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1364132
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-25 11:19 UTC by sefi litmanovich
Modified:	2023-09-14 03:30 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-03-09 12:55:04 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3? mgoldboi: planning_ack+ dfediuck: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1358383	0	high	CLOSED	The engine VM doesn't restart on Conroe hosts regardless of cluster CPU level	2021-02-22 00:41:40 UTC
oVirt gerrit	78340	0	master	ABANDONED	WIP >>> Include HE VM in cluster upgrade flow	2018-02-18 01:56:16 UTC

Internal Links: 1358383

Description sefi litmanovich 2016-08-25 11:19:44 UTC

Description of problem:

A new cluster upgrade flow is introduced in the following bz#1356027 .
For normal vms this means that e.g. in cluster compatibility upgrade from 3.6 to 4.0, any running vm will be set automatically to have 3.6 compatibility version and after upgrade is done it will be marked for restart in order to start over with 4.0 compatibility. This allows the vms to keep running after cluster upgrade and not break due to inconsistencies between different compatibility version.

This flow was tested with a pre integration build, with HE (using the HE 3.6 to 4.0 migration flow: http://www.ovirt.org/develop/release-management/features/hosted-engine-migration-to-4-0/ )
The behaviour of the HE vm after the cluster upgrade from 3.6 to 4.0 was different:

1. No mark for restart due to changed configuration had appeared.
2. The compatibility version of the vm was changed to 4.0 (according to cluster), that means that the new flow doesn't touch HE vms.
3. The vm remained in fact a 3.6 vm (domain xml did no change also after restarting the vm with hosted-engine --vm-poweroff ; hosted-engine --vm-start).

Considering HE vm is handled differently and using the ha-agent, there probably should be some work done to adjust the behaviour to the new general upgrade cluster flow, including specific warning messages for HE vm.

Version-Release number of selected component (if applicable):
This was tested with a pre integration build:
ovirt-engine-4.0.2.7-0.0.master.20160823072450.git3b10fd7.el7.centos.noarch.
The rpms were installed on HE by interrupting the flow of hosted-engine --upgrade-appliance tool, should test again once the build that includes the upgrade cluster flow is merged.

How reproducible:

Steps to Reproduce:

1. Install hosted engine 3.6.
2. After engine is up add another SD and wait for the HE vm to appear in engine.
3. Start migration to 4.0 flow by setting to global maintenance mode: hosted-engine --set-maintenance --mode=global.
4. Upgrade hosted-engine-setup/vdsm rpms to 4.0 on the host (The better flow would probably be to first shutdown the HE vm with hosted-engine --vm-poweroff before upgrading the packages and starting with hosted-engine --vm-start afterwards).
*5. Edit /usr/share/ovirt-hosted-engine-setup/plugins/gr-he-common/vm/cloud_init.py in order to interrupt the flow before engine-setup is invoked. line 884 (after engine-restore).
*6. Connect to HE vm and remove the ovirt-engine packages installed by appliance, install the packages with cluster upgrade changes instead and then run engine-backup --mode=restore and so on to restore the old engine's DB.
*7. run engine-setup
8. When engine is up change the cluster with HE to 4.0 (host already has new vdsm installed)

* When we get appliance with the cluster upgrade change in it, no need to do these steps at all.

Comment 1 Michal Skrivanek 2016-08-25 14:39:32 UTC

well even without that flow I'm kind of confused when exactly the HE VM becomes a proper 4.0 VM from the point of view of engine itself (to rephrase: when the domain xml definition is 4.0 and machine type of that VM is 4.0)

Comment 2 sefi litmanovich 2016-08-30 16:00:24 UTC

To add to comment 1, I re tested the flow and noticed this time that the HE vm has the emulated machine flag set in custom emulated machine option, my guess is that it's for HA reasons HE guys wanted the vm to not suffer any changes in the supported emulated machine for rhel6 vm - correct?
This explains why after the upgrade the configuration isn't changed but as Michal pointed out, that's a problem with HE upgrade/migration from 3.6 to 4.0 even without the new cluster upgrade flow.

Comment 3 Martin Sivák 2016-10-20 15:44:54 UTC

Roy, what is the status here? Was it a DUP after all or was it just fixed by some other bugfix?

Comment 4 Roy Golan 2016-11-28 14:24:03 UTC

Bug 1353838 now makes the HE VMs even with cluster emulated machine and cpu name.

I think that the only thing missing is any indication that if you upgraded from 3.6 to 4.0 then your engine vm is still up on 3.6 and you need cold reboot the engine VM to get the new configuration. note - need to make sure that upgrding the cluster from 3.6 to 4.0 will trigger the creation of a new OVF.

Sefi, care to change this bug to RFE and adapt the description?

Comment 5 Martin Sivák 2017-01-18 13:48:32 UTC

So the underlying goal here is to make sure the hosted engine VMs look the same (or close) across all installations. That means we should update the devices and machine type (but not user editable fields like cpu type, memory size, nics, disks, ...) when the cluster compatibility level is upgraded.

There are couple of open questions here:

- what should happen when the hosted engine capable hosts belong to multiple clusters (not recommended, but possible)?
- when and where should the devices be updated (he ovf conversion code, engine ovf dump, engine-setup and db upgrade scripts)?
- should there be a separate "template" for each cluster compatibility level?

Comment 6 Sandro Bonazzola 2017-01-25 07:55:28 UTC

4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 7 Marina Kalinin 2017-02-18 04:45:31 UTC

(In reply to Martin Sivák from comment #5)
> So the underlying goal here is to make sure the hosted engine VMs look the
> same (or close) across all installations. That means we should update the
> devices and machine type (but not user editable fields like cpu type, memory
> size, nics, disks, ...) when the cluster compatibility level is upgraded.
> 
> There are couple of open questions here:
> 
> - what should happen when the hosted engine capable hosts belong to multiple
> clusters (not recommended, but possible)?
This is weird. What is the point of this option? The VM cannot live migrate between hosts in different clusters. So if we do allow hosts from different clusters to be part of HE setup, we should block the live migration option of HE VM here. OR, my choice, not to allow such an option initially.
Regardless, assuming this live migration is not allowed between hosts in different clusters. So, if we want to put HE VM on a different cluster, we will shut it down first and then, problem solved.

> - when and where should the devices be updated (he ovf conversion code,
> engine ovf dump, engine-setup and db upgrade scripts)?
> - should there be a separate "template" for each cluster compatibility level?

Regardless, we should make sure this bugs makes it to 4.0.z.

Comment 8 sefi litmanovich 2017-03-02 13:50:58 UTC

Not sure why this bug moved to on_qa, after talking with Roy I get the sense that the issue raised in this bug isn't really resolved.
I re iterate on the main issue on this bug - after upgrade of HE the engine starts with the cluster set to the older version and requires update.

1. Updating the cluster gave no indication in UI or some msg for the user that restarting the HE is required.
2. The compatibility version on the HE's vm representation changes with cluster to 4.0 (in the original case) although nothing has changed really on HE.
3. The actual vm's xml in qemu is with old emulated machine value (of 3.6 in this case).

I understand that the following bug takes care of updating the vm's xml in ovf:
https://bugzilla.redhat.com/show_bug.cgi?id=1358383

But this is not enough because:
1. The user still doesn't know he should re boot the system.
2. There's no mechanism as far as I know that ensures that the ovf will be updated before the reboot, so there might be a scenario when even after reboot the HE vm will start with the old xml.

If I am missing something please let me know, but seeing as I don't see any changes in this bug, and don't know what exactly to test now, I have to move this bug back to NEW

Comment 9 Marina Kalinin 2017-03-07 05:01:29 UTC

Exactly.
Thank you, Sefi. This is how I read this bug as well.
And that's why I was also hoping to get it into 4.0.z

Comment 10 Yaniv Lavi 2017-03-09 10:39:42 UTC

Moving the needinfo to be considered by the assignee.

Comment 11 Martin Sivák 2017-03-09 11:01:37 UTC

So what you are really missing is an icon telling you "Please restart he engine VM"?

That too would be slightly confusing, because you can't do that from the admin UI (you have to ssh to a host and do that manually).

Also there is no way for the engine to restart itself.. well it could stop serving the health page and wait for the agent to restart it I guess.

All the configuration is properly updated and restart is all that is needed (but does it hurt anything if it keeps running with old configs until the first move?)

Comment 12 sefi litmanovich 2017-03-12 16:41:48 UTC

(In reply to Martin Sivák from comment #11)
> So what you are really missing is an icon telling you "Please restart he
> engine VM"?
> 
> That too would be slightly confusing, because you can't do that from the
> admin UI (you have to ssh to a host and do that manually).
> 
> Also there is no way for the engine to restart itself.. well it could stop
> serving the health page and wait for the agent to restart it I guess.
> 
> All the configuration is properly updated and restart is all that is needed
> (but does it hurt anything if it keeps running with old configs until the
> first move?)

I will explain the problem once again by describing the flow after re producing now with latest 4.1.

1. have 4.0 HE running, everything is set up (with storage domain and so on). The engine is installed on a rhel 7.3 appliance.
2. move HE to global maintenance mode.
3. update the host to 4.1 (updating vdsm and ovirt-hosted-engine components)
4. update the packages on the engine appliance to 4.1.
5. Run engine-setup to upgrade the engine.

At this point we have a 4.1 host running a rhel 7.3 vm with 4.1 engine - the vm is still running with machine type rhel-6.5.0 as default for 4.0.
The 'OvfUpdateIntervalInMinutes' value is set by default to 60.

6. Connect to the engine and upgrade the cluster to 4.1 - A message appear regarding the upgrade with running vms suggesting that the vms should be re started  - This msg is for normal vms, but one suggestion is adding on top of this message an additional paragraph:

"In case this engine is a hosted engine, in order to apply the changes to the HE vm, please use hosted-engine agent on one of the managing hosts to restart the HE vm."

7. from the host - restart the vm (hosted-engine --vm-poweroff; hosted-engine --vm-start)

At this point because OvfUpdateIntervalInMinutes=60, if we did not wait the whole 60 minutes between cluster upgrade and vm restart the ovf will still have the old configuration saved (notice how the ad-hoc change of the ovf doesn't apply as the change is invoked internally). So when the vm starts it still has the 'old' machine type.

This is the exact scenario where the user "misses" the update. This is not to say that this is a tragedy or will cause some damage, but you can see how this is problematic.

A possible solution would be to apply ad-hoc change of the ovf in this case of cluster upgrade to ensure that on the next restart the vm will start as expected even if OvfUpdateIntervalInMinutes is set to 60.
Another possible solution is supplying the user with this information at the phase of cluster upgrade as suggested above. Another one might be the health page W/A you suggested.
At the least, a suggestion for a flow to apply the changes on the vm should exist in hosted-engine upgrade documentation.

Comment 13 Yaniv Kaul 2017-05-30 07:04:23 UTC

Is it going to make it to 4.1.3?

Comment 14 Yaniv Kaul 2017-06-22 09:58:35 UTC

(In reply to Yaniv Kaul from comment #13)
> Is it going to make it to 4.1.3?

ping?

Comment 15 Martin Sivák 2017-06-22 10:17:31 UTC

We are verifying the code changes, but since it has no exception flag then it will probably end up in 4.1.4.

Comment 18 Martin Sivák 2018-03-09 12:55:04 UTC

I would like to describe the things that are working.

- We have ovf that contains engine generated libvirtxml as of oVirt 4.2.2.
- The VM is updated asynchronously on VM edit action (and a lot others)
- The VM is created through the engine using the standard blank template since 4.2.1

Those should make sure the VM is pretty standard with regards to all other VMs in a cluster.

But it makes no difference whatsoever if the VM runs using older configuration for couple of minutes or even hours as long as it is running and serving the management application. So we will not be implementing any automated restart or indication in the UI for now. It just isn't worth the effort and gives the user no additional benefits.

If you find out any place where this might actually cause a visible issue (different libvirtxml is not it) or a situation where the engine would not be able to run properly, let us know. Till then I am closing this as all the flows seem to be covered by what we already have.

Comment 19 Red Hat Bugzilla 2023-09-14 03:30:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.