Bug 1396672

Summary:	modify output of the hosted engine CLI to show info on auto import process
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Marina Kalinin <mkalinin>
Component:	ovirt-hosted-engine-ha	Assignee:	Simone Tiraboschi <stirabos>
Status:	CLOSED ERRATA	QA Contact:	Nikolai Sednev <nsednev>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	3.6.9	CC:	alan.cowles, didi, gklein, gveitmic, lsurette, mkalinin, molasaga, rbalakri, srevivo, stirabos, trichard, ykaul, ylavi
Target Milestone:	ovirt-4.1.0-beta	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
URL:	https://www.ovirt.org/documentation/how-to/hosted-engine-host-OS-upgrade/
Whiteboard:	integration
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Since Red Hat Enterprise Virtualization 3.6, ovirt-ha-agent has read its configuration, and the Manager virtual machine specification, from shared storage. Previously, they were just local files replicated on each involved host. This enhancement modifies the output of hosted-engine --vm-status to show if the configuration and the Manager virtual machine specification has been, on each reported host, correctly read from the shared storage.	Story Points:	---
Clone Of:
Clones:	1403735 (view as bug list)		Environment:
Last Closed:	2017-04-25 00:53:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Integration	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1403735

Description Marina Kalinin 2016-11-18 22:51:15 UTC

Together with bz#1394448, we need to fix our documentation asap on how we recommended the HE upgrade process from 3.5 to 3.6.

In this bug we need to fix 3.5 to 3.6 with RHEL7 hosts section[1].

Procedure 6.5. Updating the RHEV-H Self-Hosted Engine Host
Step 3:
Need to explicitly explain why this step is there and what its importance is.
This step is required to trigger the upgrade of HE SD from 3.5 to 3.6.
It is an essential part of the upgrade process and if it fails, the user should not proceed.
How to verify the upgrade succeeded? Hosted Engine Storage Domain (HE SD) should appear in the UI under Storage Tab. Until this happened, the upgrade is not complete or failed.

[1]
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Self-Hosted_Engine_Guide/Upgrading_the_Self-Hosted_Engine.html

Comment 1 Marina Kalinin 2016-11-19 03:15:30 UTC

Simone, can you please review?

Comment 2 Simone Tiraboschi 2016-11-22 14:41:01 UTC

I'm re-checking https://access.redhat.com/solutions/2351141

The central point is how, for the user, to be sure that the upgrade procedure really upgraded since it's not interactive but just triggered by the upgrade of the RHEV-H 3.5/el7 host to RHEV-H 3.6/el7.

The best strategy is to grep /var/log/ovirt-hosted-engine-ha/agent.log on that host for '(upgrade_35_36) Successfully upgraded'.

The upgrade procedure should be pretty stable but it requires some attention
to be sure that it worked as expected. For instance it will work if, and only if, that host is  in maintenance mode at engine eyes.

So, if the user finds something like:

(upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready

under /var/log/ovirt-hosted-engine-ha/agent.log, he has to put that host into maintenance mode from the engine and eventually then manually restart ovirt-ha-agent on that host (systemd will try just 10 times in a row, so the user has to manually restart it if he wasn't fast enough).

At the end he should see:
'(upgrade_35_36) Successfully upgraded'.

That host should now score 3400 points and the hosted-engine VM should automatically migrate there.
In order to check it:

[root@rhevh72 admin]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : rh68he20161115h1.localdomain
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 579062
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=579062 (Tue Nov 22 15:23:59 2016)
	host-id=1
	score=2400
	maintenance=False
	state=EngineDown


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : rh68he20161115h2.localdomain
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 578990
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=578990 (Tue Nov 22 15:24:01 2016)
	host-id=2
	score=2400
	maintenance=False
	state=EngineDown


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : rhevh72.localdomain
Host ID                            : 3
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 09ed71ab
Host timestamp                     : 1245

Another sign that the upgrade was successfully is that under /etc/ovirt-hosted-engine/hosted-engine.conf we should find:
spUUID=00000000-0000-0000-0000-000000000000
and
conf_volume_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
conf_image_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
where 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' means any value.

If something went wrong, for any issue, the user can retrigger the upgrade procedure restarting ovirt-ha-agent on the affected host.

At this point the user can reinstall other hosts (one at a time) with el7, add rhev agent 3.6 repo there and redeploy hosted-engine on each of them.

After that (it's really important that the user moves to the next step only when the previous one is OK!!!), on each host, he has to find '(upgrade_35_36) Successfully upgraded' under /var/log/ovirt-hosted-engine-ha/agent.log 

At the end all the HE hosts should reach a score of 3400 points.
Only at this point the user has to:
- upgrade the engine to 3.6
- move the the cluster compatibility level to 3.6.
The engine should trigger the import of the hosted-engine storage domain.
If successfully, the user should see the hosted-engine storage domain into the engine as active.

Is really really import that the user moves to the next action if and only if all the previous steps are OK.

Comment 5 Marina Kalinin 2016-11-22 19:22:10 UTC

Simone,
Thank you.

I will update the article with this very valuable information!

However, we still need to find the right wording for the official docs that cover el7 hosts 3.5 to 3.6 upgrade. And this is what this bug is about.
I think for the official documentation, it would be enough to say that the user should check the UI, and if HE SD does not show up, they shoudl contact support.

Comment 6 Simone Tiraboschi 2016-11-22 19:41:26 UTC

Other than properly documenting this, we can also modify, for 3.6.10, the output of
 hosted-engine --vm-status
to report, for each host, if everything was OK with the upgrade process.

Comment 7 Marina Kalinin 2016-11-22 20:14:09 UTC

Simone, is it also correct, that if there is no other Data Domain in the DC, auto import would not happen?
This is probably only theoretical scenarios, but worth to mention.

Comment 8 Marina Kalinin 2016-11-22 20:16:14 UTC

(In reply to Simone Tiraboschi from comment #6)
> Other than properly documenting this, we can also modify, for 3.6.10, the
> output of
>  hosted-engine --vm-status
> to report, for each host, if everything was OK with the upgrade process.

This would be wonderful.
Do you want me to open a separate bug on this?

Comment 10 Simone Tiraboschi 2016-11-22 20:22:26 UTC

(In reply to Marina from comment #8)
> (In reply to Simone Tiraboschi from comment #6)
> > Other than properly documenting this, we can also modify, for 3.6.10, the
> > output of
> >  hosted-engine --vm-status
> > to report, for each host, if everything was OK with the upgrade process.
> 
> This would be wonderful.
> Do you want me to open a separate bug on this?

Yes, please

Comment 17 Simone Tiraboschi 2016-11-24 22:24:36 UTC

Oh, another relevant info:
the auto-import procedure in the engine just looks for a storage domain called 'hosted_engine' but in 3.4 and earlier 3.5 days the user could customize that name at setup time.

In that case he has also to run on the engine VM:

engine-config -s HostedEngineStorageDomainName={my_custom_name}
and than restart the engine otherwise the engine will never found and import the hosted-engine storage domain.

Comment 18 Germano Veit Michel 2016-11-25 06:20:44 UTC

(In reply to Simone Tiraboschi from comment #17)
> Oh, another relevant info:
> the auto-import procedure in the engine just looks for a storage domain
> called 'hosted_engine' but in 3.4 and earlier 3.5 days the user could
> customize that name at setup time.
> 
> In that case he has also to run on the engine VM:
> 
> engine-config -s HostedEngineStorageDomainName={my_custom_name}
> and than restart the engine otherwise the engine will never found and import
> the hosted-engine storage domain.

Thanks! I assume it's because BZ1301105 was never backported to 3.6.

Comment 19 Simone Tiraboschi 2016-11-25 08:49:52 UTC

(In reply to Germano Veit Michel from comment #18)
> > engine-config -s HostedEngineStorageDomainName={my_custom_name}
> > and than restart the engine otherwise the engine will never found and import
> > the hosted-engine storage domain.
> 
> Thanks! I assume it's because BZ1301105 was never backported to 3.6.

Yes, exactly, and in order to upgrade the engine VM to 4.0/el7, the hosted-engine storage domain should be correctly imported when on 3.6

Comment 20 Yaniv Lavi 2016-11-29 12:44:50 UTC

Can we please get a short clear list of the requested changes?

Comment 21 Germano Veit Michel 2016-12-02 07:06:40 UTC

(In reply to Yaniv Dary from comment #20)
> Can we please get a short clear list of the requested changes?

* Steps to Confirm HE SD was Imported
* Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...)

Down the road, if the 3.5 to 3.6 upgrade is not done done properly, we get quite troubled 3.6 to 4.0 Upgrades. See BZ #1400800.

Comment 22 Simone Tiraboschi 2016-12-02 10:26:22 UTC

(In reply to Germano Veit Michel from comment #21)
> (In reply to Yaniv Dary from comment #20)
> > Can we please get a short clear list of the requested changes?
> 
> * Steps to Confirm HE SD was Imported

This is quite/too complex from ovirt-ha-agent point of view since a proper fix will require to check the status of the hosted-engine storage domain in the engine over the API but: the engine could be down, currently we don't store any API credentials at ovirt-ha-agent side

> * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...)

for each host, we could add a a couple of additional lines under the Extra metadata section in the output of hosted-engine --vm-status

Comment 23 Germano Veit Michel 2016-12-02 22:35:47 UTC

(In reply to Simone Tiraboschi from comment #22)
> (In reply to Germano Veit Michel from comment #21)
> > (In reply to Yaniv Dary from comment #20)
> > > Can we please get a short clear list of the requested changes?
> > 
> > * Steps to Confirm HE SD was Imported
> 
> This is quite/too complex from ovirt-ha-agent point of view since a proper
> fix will require to check the status of the hosted-engine storage domain in
> the engine over the API but: the engine could be down, currently we don't
> store any API credentials at ovirt-ha-agent side

Why don't we check the OVFs? If it's imported the OVFs will be there. And we already to something very similar when extracting vm.conf.

> 
> > * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...)
> 
> for each host, we could add a a couple of additional lines under the Extra
> metadata section in the output of hosted-engine --vm-status

Nice!

Comment 24 Yaniv Kaul 2016-12-08 14:22:15 UTC

Simone, I don't see this getting into 3.6.10. Postpone to 3.6.11?

Comment 25 Simone Tiraboschi 2016-12-09 08:43:53 UTC

The relevant patch has already been merged on master (not sure why the gerrit hook didn't triggered), it's just about back-porting and verifying it.

Comment 28 Nikolai Sednev 2017-01-26 13:17:52 UTC

Moving to verified as on 4.1 I'm getting these two lines, corresponding on successful auto-import:

vm_conf_refresh_time=68357 (Thu Jan 26 15:00:02 2017)
conf_on_shared_storage=True




alma04 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 9ae7da8a
local_conf_timestamp               : 85165
Host timestamp                     : 85152
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=85152 (Thu Jan 26 15:00:03 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=85165 (Thu Jan 26 15:00:15 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 4e11343f
local_conf_timestamp               : 68357
Host timestamp                     : 68345
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=68345 (Thu Jan 26 14:59:49 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=68357 (Thu Jan 26 15:00:02 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

Moving to verified, as works for me on these components on hosts:
rhvm-appliance-4.1.20170119.1-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0-2.el7ev.noarch
ovirt-host-deploy-1.6.0-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
mom-0.5.8-1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-daemon-0.5.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)


On engine:
rhev-guest-tools-iso-4.1-3.el7ev.noarch
rhevm-doc-4.1.0-1.el7ev.noarch
rhevm-dependencies-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.0-1.el7ev.noarch
rhevm-4.1.0.1-0.1.el7.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
rhevm-branding-rhev-4.1.0-0.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 29 Alan 2017-04-11 10:58:50 UTC

Following the steps right here: https://access.redhat.com/solutions/2351141

When I get to 5.1 I place the node into maintenance in the Engine web interface and I restart the two services as described in 5.3.

I then tail -f /var/log/ovirt-hosted-engine-ha/agent.log | grep upgrade_35_36, looking for '(upgrade_35_36) Successfully upgraded' as suggested above, but I only find this message repeated every few seconds:

MainThread::INFO::2017-04-11 00:21:07,340::upgrade::1010::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version

After this runs for a while and I don't see success, I cancel the process, and I determine that the maintenance suggested in 5.1 is actually HE maintenance, not node maintenance, so I re-activate the node, confirm HE is synced up with 'hosted-engine --vm-status' and then place the HE into local maintenance via the TUI. I confirm we are in maintenance mode and restart the services and am prompted with the following:

MainThread::ERROR::2017-04-11 01:01:30,462::upgrade::1013::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready

I cancel the process again, and place the node in maintenance in the Engine web interface as well as still in local maintenance mode in HE. I restart the services once again and it returns to the previous error message:

MainThread::INFO::2017-04-11 01:27:58,706::upgrade::1010::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version

Tailing the log file that is the message that still persists early this morning.

Would it be possible to have additional verbosity in step 5.1 as to which maintenance mode is being prescribed, also is there a way to get updates to the progress of the upgrade other than tailing the log file and looking for '(upgrade_35_36) Successfully upgraded' to appear?