Bug 1396672
Summary: | modify output of the hosted engine CLI to show info on auto import process | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Marina Kalinin <mkalinin> | |
Component: | ovirt-hosted-engine-ha | Assignee: | Simone Tiraboschi <stirabos> | |
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 3.6.9 | CC: | alan.cowles, didi, gklein, gveitmic, lsurette, mkalinin, molasaga, rbalakri, srevivo, stirabos, trichard, ykaul, ylavi | |
Target Milestone: | ovirt-4.1.0-beta | Keywords: | Triaged | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
URL: | https://www.ovirt.org/documentation/how-to/hosted-engine-host-OS-upgrade/ | |||
Whiteboard: | integration | |||
Fixed In Version: | Doc Type: | Enhancement | ||
Doc Text: |
Since Red Hat Enterprise Virtualization 3.6, ovirt-ha-agent has read its configuration, and the Manager virtual machine specification, from shared storage. Previously, they were just local files replicated on each involved host. This enhancement modifies the output of hosted-engine --vm-status to show if the configuration and the Manager virtual machine specification has been, on each reported host, correctly read from the shared storage.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1403735 (view as bug list) | Environment: | ||
Last Closed: | 2017-04-25 00:53:59 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1403735 |
Description
Marina Kalinin
2016-11-18 22:51:15 UTC
Simone, can you please review? I'm re-checking https://access.redhat.com/solutions/2351141 The central point is how, for the user, to be sure that the upgrade procedure really upgraded since it's not interactive but just triggered by the upgrade of the RHEV-H 3.5/el7 host to RHEV-H 3.6/el7. The best strategy is to grep /var/log/ovirt-hosted-engine-ha/agent.log on that host for '(upgrade_35_36) Successfully upgraded'. The upgrade procedure should be pretty stable but it requires some attention to be sure that it worked as expected. For instance it will work if, and only if, that host is in maintenance mode at engine eyes. So, if the user finds something like: (upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready under /var/log/ovirt-hosted-engine-ha/agent.log, he has to put that host into maintenance mode from the engine and eventually then manually restart ovirt-ha-agent on that host (systemd will try just 10 times in a row, so the user has to manually restart it if he wasn't fast enough). At the end he should see: '(upgrade_35_36) Successfully upgraded'. That host should now score 3400 points and the hosted-engine VM should automatically migrate there. In order to check it: [root@rhevh72 admin]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : rh68he20161115h1.localdomain Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 579062 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=579062 (Tue Nov 22 15:23:59 2016) host-id=1 score=2400 maintenance=False state=EngineDown --== Host 2 status ==-- Status up-to-date : True Hostname : rh68he20161115h2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 578990 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=578990 (Tue Nov 22 15:24:01 2016) host-id=2 score=2400 maintenance=False state=EngineDown --== Host 3 status ==-- Status up-to-date : True Hostname : rhevh72.localdomain Host ID : 3 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 09ed71ab Host timestamp : 1245 Another sign that the upgrade was successfully is that under /etc/ovirt-hosted-engine/hosted-engine.conf we should find: spUUID=00000000-0000-0000-0000-000000000000 and conf_volume_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx conf_image_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' means any value. If something went wrong, for any issue, the user can retrigger the upgrade procedure restarting ovirt-ha-agent on the affected host. At this point the user can reinstall other hosts (one at a time) with el7, add rhev agent 3.6 repo there and redeploy hosted-engine on each of them. After that (it's really important that the user moves to the next step only when the previous one is OK!!!), on each host, he has to find '(upgrade_35_36) Successfully upgraded' under /var/log/ovirt-hosted-engine-ha/agent.log At the end all the HE hosts should reach a score of 3400 points. Only at this point the user has to: - upgrade the engine to 3.6 - move the the cluster compatibility level to 3.6. The engine should trigger the import of the hosted-engine storage domain. If successfully, the user should see the hosted-engine storage domain into the engine as active. Is really really import that the user moves to the next action if and only if all the previous steps are OK. Simone, Thank you. I will update the article with this very valuable information! However, we still need to find the right wording for the official docs that cover el7 hosts 3.5 to 3.6 upgrade. And this is what this bug is about. I think for the official documentation, it would be enough to say that the user should check the UI, and if HE SD does not show up, they shoudl contact support. Other than properly documenting this, we can also modify, for 3.6.10, the output of hosted-engine --vm-status to report, for each host, if everything was OK with the upgrade process. Simone, is it also correct, that if there is no other Data Domain in the DC, auto import would not happen? This is probably only theoretical scenarios, but worth to mention. (In reply to Simone Tiraboschi from comment #6) > Other than properly documenting this, we can also modify, for 3.6.10, the > output of > hosted-engine --vm-status > to report, for each host, if everything was OK with the upgrade process. This would be wonderful. Do you want me to open a separate bug on this? (In reply to Marina from comment #8) > (In reply to Simone Tiraboschi from comment #6) > > Other than properly documenting this, we can also modify, for 3.6.10, the > > output of > > hosted-engine --vm-status > > to report, for each host, if everything was OK with the upgrade process. > > This would be wonderful. > Do you want me to open a separate bug on this? Yes, please Oh, another relevant info: the auto-import procedure in the engine just looks for a storage domain called 'hosted_engine' but in 3.4 and earlier 3.5 days the user could customize that name at setup time. In that case he has also to run on the engine VM: engine-config -s HostedEngineStorageDomainName={my_custom_name} and than restart the engine otherwise the engine will never found and import the hosted-engine storage domain. (In reply to Simone Tiraboschi from comment #17) > Oh, another relevant info: > the auto-import procedure in the engine just looks for a storage domain > called 'hosted_engine' but in 3.4 and earlier 3.5 days the user could > customize that name at setup time. > > In that case he has also to run on the engine VM: > > engine-config -s HostedEngineStorageDomainName={my_custom_name} > and than restart the engine otherwise the engine will never found and import > the hosted-engine storage domain. Thanks! I assume it's because BZ1301105 was never backported to 3.6. (In reply to Germano Veit Michel from comment #18) > > engine-config -s HostedEngineStorageDomainName={my_custom_name} > > and than restart the engine otherwise the engine will never found and import > > the hosted-engine storage domain. > > Thanks! I assume it's because BZ1301105 was never backported to 3.6. Yes, exactly, and in order to upgrade the engine VM to 4.0/el7, the hosted-engine storage domain should be correctly imported when on 3.6 Can we please get a short clear list of the requested changes? (In reply to Yaniv Dary from comment #20) > Can we please get a short clear list of the requested changes? * Steps to Confirm HE SD was Imported * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...) Down the road, if the 3.5 to 3.6 upgrade is not done done properly, we get quite troubled 3.6 to 4.0 Upgrades. See BZ #1400800. (In reply to Germano Veit Michel from comment #21) > (In reply to Yaniv Dary from comment #20) > > Can we please get a short clear list of the requested changes? > > * Steps to Confirm HE SD was Imported This is quite/too complex from ovirt-ha-agent point of view since a proper fix will require to check the status of the hosted-engine storage domain in the engine over the API but: the engine could be down, currently we don't store any API credentials at ovirt-ha-agent side > * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...) for each host, we could add a a couple of additional lines under the Extra metadata section in the output of hosted-engine --vm-status (In reply to Simone Tiraboschi from comment #22) > (In reply to Germano Veit Michel from comment #21) > > (In reply to Yaniv Dary from comment #20) > > > Can we please get a short clear list of the requested changes? > > > > * Steps to Confirm HE SD was Imported > > This is quite/too complex from ovirt-ha-agent point of view since a proper > fix will require to check the status of the hosted-engine storage domain in > the engine over the API but: the engine could be down, currently we don't > store any API credentials at ovirt-ha-agent side Why don't we check the OVFs? If it's imported the OVFs will be there. And we already to something very similar when extracting vm.conf. > > > * Steps to Confirm HE SD was upgraded to 3.6 (ha 1.3.xx, conf volume...) > > for each host, we could add a a couple of additional lines under the Extra > metadata section in the output of hosted-engine --vm-status Nice! Simone, I don't see this getting into 3.6.10. Postpone to 3.6.11? The relevant patch has already been merged on master (not sure why the gerrit hook didn't triggered), it's just about back-porting and verifying it. Moving to verified as on 4.1 I'm getting these two lines, corresponding on successful auto-import: vm_conf_refresh_time=68357 (Thu Jan 26 15:00:02 2017) conf_on_shared_storage=True alma04 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 9ae7da8a local_conf_timestamp : 85165 Host timestamp : 85152 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=85152 (Thu Jan 26 15:00:03 2017) host-id=1 score=3400 vm_conf_refresh_time=85165 (Thu Jan 26 15:00:15 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 4e11343f local_conf_timestamp : 68357 Host timestamp : 68345 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=68345 (Thu Jan 26 14:59:49 2017) host-id=2 score=3400 vm_conf_refresh_time=68357 (Thu Jan 26 15:00:02 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False Moving to verified, as works for me on these components on hosts: rhvm-appliance-4.1.20170119.1-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0-2.el7ev.noarch ovirt-host-deploy-1.6.0-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 libvirt-client-2.0.0-10.el7_3.4.x86_64 mom-0.5.8-1.el7ev.noarch vdsm-4.19.2-2.el7ev.x86_64 ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-daemon-0.5.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) On engine: rhev-guest-tools-iso-4.1-3.el7ev.noarch rhevm-doc-4.1.0-1.el7ev.noarch rhevm-dependencies-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.0-1.el7ev.noarch rhevm-4.1.0.1-0.1.el7.noarch rhevm-guest-agent-common-1.0.12-3.el7ev.noarch rhevm-branding-rhev-4.1.0-0.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Following the steps right here: https://access.redhat.com/solutions/2351141 When I get to 5.1 I place the node into maintenance in the Engine web interface and I restart the two services as described in 5.3. I then tail -f /var/log/ovirt-hosted-engine-ha/agent.log | grep upgrade_35_36, looking for '(upgrade_35_36) Successfully upgraded' as suggested above, but I only find this message repeated every few seconds: MainThread::INFO::2017-04-11 00:21:07,340::upgrade::1010::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version After this runs for a while and I don't see success, I cancel the process, and I determine that the maintenance suggested in 5.1 is actually HE maintenance, not node maintenance, so I re-activate the node, confirm HE is synced up with 'hosted-engine --vm-status' and then place the HE into local maintenance via the TUI. I confirm we are in maintenance mode and restart the services and am prompted with the following: MainThread::ERROR::2017-04-11 01:01:30,462::upgrade::1013::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready I cancel the process again, and place the node in maintenance in the Engine web interface as well as still in local maintenance mode in HE. I restart the services once again and it returns to the previous error message: MainThread::INFO::2017-04-11 01:27:58,706::upgrade::1010::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version Tailing the log file that is the message that still persists early this morning. Would it be possible to have additional verbosity in step 5.1 as to which maintenance mode is being prescribed, also is there a way to get updates to the progress of the upgrade other than tailing the log file and looking for '(upgrade_35_36) Successfully upgraded' to appear? |