Red Hat Bugzilla – Bug 1403735
[z-stream clone - 4.0.7] modify output of the hosted engine CLI to show info on auto import process
Last modified: 2017-03-20 07:52:20 EDT
+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1396672 +++ ====================================================================== Together with bz#1394448, we need to fix our documentation asap on how we recommended the HE upgrade process from 3.5 to 3.6. In this bug we need to fix 3.5 to 3.6 with RHEL7 hosts section[1]. Procedure 6.5. Updating the RHEV-H Self-Hosted Engine Host Step 3: Need to explicitly explain why this step is there and what its importance is. This step is required to trigger the upgrade of HE SD from 3.5 to 3.6. It is an essential part of the upgrade process and if it fails, the user should not proceed. How to verify the upgrade succeeded? Hosted Engine Storage Domain (HE SD) should appear in the UI under Storage Tab. Until this happened, the upgrade is not complete or failed. [1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Self-Hosted_Engine_Guide/Upgrading_the_Self-Hosted_Engine.html (Originally by Marina Kalinin)
Simone, can you please review? (Originally by Marina Kalinin)
I'm re-checking https://access.redhat.com/solutions/2351141 The central point is how, for the user, to be sure that the upgrade procedure really upgraded since it's not interactive but just triggered by the upgrade of the RHEV-H 3.5/el7 host to RHEV-H 3.6/el7. The best strategy is to grep /var/log/ovirt-hosted-engine-ha/agent.log on that host for '(upgrade_35_36) Successfully upgraded'. The upgrade procedure should be pretty stable but it requires some attention to be sure that it worked as expected. For instance it will work if, and only if, that host is in maintenance mode at engine eyes. So, if the user finds something like: (upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready under /var/log/ovirt-hosted-engine-ha/agent.log, he has to put that host into maintenance mode from the engine and eventually then manually restart ovirt-ha-agent on that host (systemd will try just 10 times in a row, so the user has to manually restart it if he wasn't fast enough). At the end he should see: '(upgrade_35_36) Successfully upgraded'. That host should now score 3400 points and the hosted-engine VM should automatically migrate there. In order to check it: [root@rhevh72 admin]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : rh68he20161115h1.localdomain Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 579062 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=579062 (Tue Nov 22 15:23:59 2016) host-id=1 score=2400 maintenance=False state=EngineDown --== Host 2 status ==-- Status up-to-date : True Hostname : rh68he20161115h2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 578990 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=578990 (Tue Nov 22 15:24:01 2016) host-id=2 score=2400 maintenance=False state=EngineDown --== Host 3 status ==-- Status up-to-date : True Hostname : rhevh72.localdomain Host ID : 3 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 09ed71ab Host timestamp : 1245 Another sign that the upgrade was successfully is that under /etc/ovirt-hosted-engine/hosted-engine.conf we should find: spUUID=00000000-0000-0000-0000-000000000000 and conf_volume_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx conf_image_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' means any value. If something went wrong, for any issue, the user can retrigger the upgrade procedure restarting ovirt-ha-agent on the affected host. At this point the user can reinstall other hosts (one at a time) with el7, add rhev agent 3.6 repo there and redeploy hosted-engine on each of them. After that (it's really important that the user moves to the next step only when the previous one is OK!!!), on each host, he has to find '(upgrade_35_36) Successfully upgraded' under /var/log/ovirt-hosted-engine-ha/agent.log At the end all the HE hosts should reach a score of 3400 points. Only at this point the user has to: - upgrade the engine to 3.6 - move the the cluster compatibility level to 3.6. The engine should trigger the import of the hosted-engine storage domain. If successfully, the user should see the hosted-engine storage domain into the engine as active. Is really really import that the user moves to the next action if and only if all the previous steps are OK. (Originally by Simone Tiraboschi)
Simone, Thank you. I will update the article with this very valuable information! However, we still need to find the right wording for the official docs that cover el7 hosts 3.5 to 3.6 upgrade. And this is what this bug is about. I think for the official documentation, it would be enough to say that the user should check the UI, and if HE SD does not show up, they shoudl contact support. (Originally by Marina Kalinin)
Other than properly documenting this, we can also modify, for 3.6.10, the output of hosted-engine --vm-status to report, for each host, if everything was OK with the upgrade process. (Originally by Simone Tiraboschi)
Simone, is it also correct, that if there is no other Data Domain in the DC, auto import would not happen? This is probably only theoretical scenarios, but worth to mention. (Originally by Marina Kalinin)
(In reply to Simone Tiraboschi from comment #6) > Other than properly documenting this, we can also modify, for 3.6.10, the > output of > hosted-engine --vm-status > to report, for each host, if everything was OK with the upgrade process. This would be wonderful. Do you want me to open a separate bug on this? (Originally by Marina Kalinin)
(In reply to Marina from comment #8) > (In reply to Simone Tiraboschi from comment #6) > > Other than properly documenting this, we can also modify, for 3.6.10, the > > output of > > hosted-engine --vm-status > > to report, for each host, if everything was OK with the upgrade process. > > This would be wonderful. > Do you want me to open a separate bug on this? Yes, please (Originally by Simone Tiraboschi)
Oh, another relevant info: the auto-import procedure in the engine just looks for a storage domain called 'hosted_engine' but in 3.4 and earlier 3.5 days the user could customize that name at setup time. In that case he has also to run on the engine VM: engine-config -s HostedEngineStorageDomainName={my_custom_name} and than restart the engine otherwise the engine will never found and import the hosted-engine storage domain. (Originally by Simone Tiraboschi)
(In reply to Simone Tiraboschi from comment #17) > Oh, another relevant info: > the auto-import procedure in the engine just looks for a storage domain > called 'hosted_engine' but in 3.4 and earlier 3.5 days the user could > customize that name at setup time. > > In that case he has also to run on the engine VM: > > engine-config -s HostedEngineStorageDomainName={my_custom_name} > and than restart the engine otherwise the engine will never found and import > the hosted-engine storage domain. Thanks! I assume it's because BZ1301105 was never backported to 3.6. (Originally by Germano Veit Michel)
(In reply to Germano Veit Michel from comment #18) > > engine-config -s HostedEngineStorageDomainName={my_custom_name} > > and than restart the engine otherwise the engine will never found and import > > the hosted-engine storage domain. > > Thanks! I assume it's because BZ1301105 was never backported to 3.6. Yes, exactly, and in order to upgrade the engine VM to 4.0/el7, the hosted-engine storage domain should be correctly imported when on 3.6 (Originally by Simone Tiraboschi)
Can we please get a short clear list of the requested changes? (Originally by Yaniv Dary)
(In reply to Yaniv Dary from comment #20) > Can we please get a short clear list of the requested changes? * Steps to Confirm HE SD was Imported * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...) Down the road, if the 3.5 to 3.6 upgrade is not done done properly, we get quite troubled 3.6 to 4.0 Upgrades. See BZ #1400800. (Originally by Germano Veit Michel)
(In reply to Germano Veit Michel from comment #21) > (In reply to Yaniv Dary from comment #20) > > Can we please get a short clear list of the requested changes? > > * Steps to Confirm HE SD was Imported This is quite/too complex from ovirt-ha-agent point of view since a proper fix will require to check the status of the hosted-engine storage domain in the engine over the API but: the engine could be down, currently we don't store any API credentials at ovirt-ha-agent side > * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...) for each host, we could add a a couple of additional lines under the Extra metadata section in the output of hosted-engine --vm-status (Originally by Simone Tiraboschi)
(In reply to Simone Tiraboschi from comment #22) > (In reply to Germano Veit Michel from comment #21) > > (In reply to Yaniv Dary from comment #20) > > > Can we please get a short clear list of the requested changes? > > > > * Steps to Confirm HE SD was Imported > > This is quite/too complex from ovirt-ha-agent point of view since a proper > fix will require to check the status of the hosted-engine storage domain in > the engine over the API but: the engine could be down, currently we don't > store any API credentials at ovirt-ha-agent side Why don't we check the OVFs? If it's imported the OVFs will be there. And we already to something very similar when extracting vm.conf. > > > * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...) > > for each host, we could add a a couple of additional lines under the Extra > metadata section in the output of hosted-engine --vm-status Nice! (Originally by Germano Veit Michel)
Simone, I don't see this getting into 3.6.10. Postpone to 3.6.11? (Originally by Yaniv Kaul)
The relevant patch has already been merged on master (not sure why the gerrit hook didn't triggered), it's just about back-porting and verifying it. (Originally by Simone Tiraboschi)
Verified on # rpm -qa | grep hosted ovirt-hosted-engine-ha-2.1.0.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.2-1.el7ev.noarch # hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : cyan-vdsf.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 5f945a94 local_conf_timestamp : 3030979 Host timestamp : 3030961 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3030961 (Tue Feb 28 14:46:00 2017) host-id=1 score=3400 vm_conf_refresh_time=3030979 (Tue Feb 28 14:46:17 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False
Verified on correct version # rpm -qa | grep hosted ovirt-hosted-engine-setup-2.0.4.3-2.el7ev.noarch ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : cyan-vdsf.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : ab52e2b8 local_conf_timestamp : 0 Host timestamp : 3055736 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3055736 (Tue Feb 28 21:38:55 2017) host-id=1 score=3400 vm_conf_refresh_time=0 (Thu Jan 1 02:00:00 1970) conf_on_shared_storage=True maintenance=False state=EngineStart stopped=False
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0541.html