Description of problem: After upgrading from 4.1.6 to 4.2-pre the HA subsystem no longer starts. broker.log ends with this line: storage_broker::96::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: 'metadata_image_UUID can't be '' agent.log has errors starting up monitors as well: RequestError: Failed to start monitor ping, options {'addr': '66.187.230.126'}: [Errno 2] No such file or directory Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.0-1.el7.centos.noarch How reproducible: reproduced in our environment upgraded all the way from 3.6->4.0->4.1 Steps to Reproduce: 1. upgrade HA host from 4.1 to 4.2 () 2. log in and check ovirt-ha-broker and ovirt-ha-agent statuses Actual results: HA subsystem is down Expected results: HA subsystem working Additional info: /etc/ovirt-hosted-engine/hosted-engine.conf has empty values for several parameters: ... domainType=nfs3 vdsm_use_ssl=true gateway=66.187.230.126 bridge=ovirtmgmt metadata_volume_UUID= metadata_image_UUID= lockspace_volume_UUID= lockspace_image_UUID= # The following are used only for iSCSI storage iqn= portal= user= password= port= ...
Checked out the logs, the timestamp on the file matches updating to one of the 4.1 releases: ... Aug 14 16:32:06 Installed: vdsm-4.19.24-1.el7.centos.x86_64 Aug 14 16:32:07 Updated: ovirt-hosted-engine-ha-2.1.4-1.el7.centos.noarch Aug 14 16:32:08 Updated: ovirt-hosted-engine-setup-2.1.3.5-1.el7.centos.noarch Aug 14 16:32:26 Updated: ovirt-release41-4.1.4-1.el7.centos.noarch ...
Just to avoid confusion, the system did receive updates after the above log snippet, it's just the update that last touched hosted-engine.conf: 2016: Jun 23 13:55:09 Installed: vdsm-4.16.30-0.el7.centos.x86_64 Oct 21 18:31:47 Updated: vdsm-4.17.32-1.el7.noarch 2017: Jan 17 13:41:07 Updated: vdsm-4.18.21-1.el7.centos.x86_64 Feb 24 09:45:55 Updated: vdsm-4.19.4-1.el7.centos.x86_64 Apr 21 21:58:20 Updated: vdsm-4.19.10.1-1.el7.centos.x86_64 Aug 14 16:32:06 Installed: vdsm-4.19.24-1.el7.centos.x86_64 Oct 10 14:26:04 Updated: vdsm-4.19.31-1.el7.centos.x86_64 Dec 05 15:12:20 Updated: vdsm-4.20.9-1.el7.centos.x86_64 Other hosts still on 4.1 all have the same file with empty lines and HA works properly on them.
metadata_volume_UUID is effectively empty. I assume that the system has been initially deployed at 3.3 time. We have code to upgrade it, the point is why it never triggered in the past.
Looking at ancient logs, the environment was initially deployed on 2014-08-05 using otopi-1.2.1 which looks like oVirt 3.4. It was later updated all the way to its current 4.1 state. Was metadata_image_UUID used by older versions? If yes - where was it taken from? Can we make the HA Broker behave in a similar fashion if the value is missing, then grab it from HE storage domain metadata and update the file? Or the only way here is to perform this step manually. If so, where do I get the values of metadata_volume_UUID and friends?
(In reply to Evgheni Dereveanchin from comment #7) > Looking at ancient logs, the environment was initially deployed on > 2014-08-05 using otopi-1.2.1 which looks like oVirt 3.4. It was later > updated all the way to its current 4.1 state. > > Was metadata_image_UUID used by older versions? In 3.4 the metadata area was just a file on NFS, since 3.5 it becomes a properly vdsm handled volume. > If yes - where was it taken > from? We had upgrade code, see: https://bugzilla.redhat.com/show_bug.cgi?id=1313917 now we need to understand if for some reason it failed in the past. > Can we make the HA Broker behave in a similar fashion if the value is > missing, then grab it from HE storage domain metadata and update the file? > Or the only way here is to perform this step manually. If so, where do I get > the values of metadata_volume_UUID and friends?
I start thinking that the upgrade was successfully but we have a regression on how ovirt-ha-broker access it. In 3.4 on NFS we were just using a file. In 3.6/el6 -> 4.0/el7 upgrade code we eventually created a new volume, deleted the previous file and created a symlink to the volume. on https://gerrit.ovirt.org/#/c/61345/ I read "Volume creation will also remove the previous file and it will replace it with a symlink pointing to the new volume. Upon restart, all the hosts will point to the new volume since they'll simply consume the symlink." So, instead that directly fixing /etc/ovirt-hosted-engine/hosted-engine.conf on all the involved hosts (ovirt-hosted-engine-setup --upgrade-appliance has been used just on one host), we were relying on ovirt-ha-broker simply consuming the symlink. Now, since https://gerrit.ovirt.org/#/c/81011/ , ovirt-ha-broker is explicitly looking for all the volume uuids under /etc/ovirt-hosted-engine/hosted-engine.conf but they are not that for the volumes created during 3.6/el6 -> 4.0/el7 upgrade.
Simone, thanks for the insight! How was this upgrade flow supposed to be triggered? From patches it looks like hosted-engine-setup for 4.0 should have warned about the migration and it needed to be performed manually. Am I right? In our case, from what I remember, hosts were rebuilt as el7 at around 3.5 that's probably the last time hosted-engine-setup was run on them. Then it was just "yum update" probably.
Severity?
I'm setting high sevirity mediom priority, based on the assumption (comment #8) *this can't happen on 3.5 and higher*, this is a corner case. removing blocker, and we may target it to 4.2.1
I agree with Moran, this probably affects only a minor fraction of environments which were deployed with Hosted Engine before 3.5 and upgraded to 4.1. It should be enough to at least have workaround steps on how to fill the values ha-broker manually. Upgrade of HE from 3.4 to 4.1 requites at least an Engine OS reinstall so I assume admins who've done it successfully are quite familiar with oVirt and can perform the manual steps without issues as long as they're documented in this BZ.
I accidentally cleared Simone's needinfo request from #9 with my comment so re-adding it.
Looking at the HE storage domain from a 4.1 host together with Martin it looks like there are no syminks to metadata and lockspace: # ls -la ha_agent total 2036 drwxr-xr-x. 2 vdsm kvm 4096 Aug 5 2014 . drwxr-xr-x. 6 vdsm kvm 4096 Aug 5 2014 .. -rw-rw----. 1 vdsm kvm 1048576 Dec 6 12:50 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Dec 6 12:50 hosted-engine.metadata Most probably, some upgrade step was skipped (will need to review the upgrade doc) yet up till 4.2 this still worked fine. We'd probably need to stop HE stuff now and do the upgrade to create the disks, copy data, make symlinks and fix config files.
Created attachment 1363679 [details] Workaround script
Under the hypothesis that the upgrade code correctly generated the missing volumes upgrading the engine VM to el7 via hosted-engine --upgrade-appliace (and this seams not Evgheni's case according to comment 17) , the script at https://bugzilla.redhat.com/attachment.cgi?id=1363679 will print out the correct values to be set under /etc/ovirt-hosted-engine/hosted-engine.conf on all the involved hosts.
Thanks Simone and Martin. Indeed, the "hosted-engine --upgrade-appliace" step was skipped when upgrading this environment from 3.6 to 4.0 and the volumes were missing. Everyone going via the official upgrade path should not be affected. We worked around the issue the following way: 1) create two disks on hosted_storage from the Engine UI and write down their image_id from the Disks tab 2) stopp HA broker and agent on all hosts (this does not affect any VMs) 3) verify hosted-engine lockspace was released (sanlock client status) and release it where needed running "sanlock client rem_lockspace -s LINE_FROM_STAUS" 4) mount HE storage manually if needed (hosted-engine --connect-storage) 5) initialize the lockspace manually in the newly created file running "sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/...path.to.new.lockspace.flie" (note that if it needs double escaped characters to work, was needed for our NFS.) 6) create symlinks for backwards compatibility 7) update /etc/ovirt-hosted-engine/hosted-engine.conf with respective volume and image UUIDs 8) start broker on 4.2 host, immediately set global maintenance just in case 9) start agent, then start broker and agent on remaining hosts 10) wait till all hosts are visible in "hosted-engine --vm-status" output 11) done. remove global maintenance
(In reply to Evgheni Dereveanchin from comment #20) > Thanks Simone and Martin. Indeed, the "hosted-engine --upgrade-appliance" > step was skipped when upgrading this environment from 3.6 to 4.0 and the > volumes were missing. Everyone going via the official upgrade path should > not be affected. Everyone that deployed hosted-engine on NFS on 3.4 is affected: the missing volumes are supposed to be created by 'hosted-engine --upgrade-appliance' but it was simply replacing the files with symlinks pointing to the new volumes. Now ovirt-ha-agent ignores also the symlinks.
Simone I think what Evgheni is saying is that this _only_ affects setups that: - were installed using 3.3 or 3.4 - AND use NFS - AND skipped the hosted-engine --upgrade-appliance when upgrading I believe this is pretty rare enough and knowledge base article or a release note might be a good enough resolution of this bug.
(In reply to Martin Sivák from comment #22) > Simone I think what Evgheni is saying is that this _only_ affects setups > that: > > - were installed using 3.3 or 3.4 > - AND use NFS > - AND skipped the hosted-engine --upgrade-appliance when upgrading It will also affect the systems corectly upgraded with 'hosted-engine --upgrade-appliance': in that case all the volumes will be there, the symlink will be there as well but ovirt-hosted-engine-setup is not supposed to upgrade hosted-engine.conf on all the hosts and so ovirt-ha-broker is going to fail as well because it cannot found the metadata_image_UUID in config file exactly as for this bug. > I believe this is pretty rare enough and knowledge base article or a release > note might be a good enough resolution of this bug.
Removing needinfo on me since I see engine on PHX has been upgraded to 4.2. Workaround script has been provided in comment #18.
*** Bug 1613278 has been marked as a duplicate of this bug. ***