1521011 – ovirt-ha-broker fails to start after upgrade: metadata_image_UUID can't be ''

Bug 1521011 - ovirt-ha-broker fails to start after upgrade: metadata_image_UUID can't be ''

Summary: ovirt-ha-broker fails to start after upgrade: metadata_image_UUID can't be ''

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Broker
Sub Component:
Version:	2.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1613278 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-05 15:53 UTC by Evgheni Dereveanchin
Modified:	2022-02-25 11:11 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-02-14 14:57:59 UTC
oVirt Team:	Integration
Embargoed:
Flags:	sbonazzo: ovirt-4.3-

Attachments	(Terms of Use)
Workaround script (1.61 KB, application/x-shellscript) 2017-12-06 13:45 UTC, Simone Tiraboschi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1313917	urgent	CLOSED	[RFE] allow NFS hosted-engine system started on 3.4 to upgrade its filebase metadata and lockspace area to a VDSM volume...	2021-02-22 00:41:40 UTC
Red Hat Issue Tracker	RHV-44936	None	None	None	2022-02-25 11:11:42 UTC
Red Hat Knowledge Base (Solution)	3353391	None	None	None	2018-02-14 14:20:08 UTC

Internal Links: 1313917

Description Evgheni Dereveanchin 2017-12-05 15:53:59 UTC

Description of problem:
 After upgrading from 4.1.6 to 4.2-pre the HA subsystem no longer starts.

  broker.log ends with this line:
  
 storage_broker::96::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: 'metadata_image_UUID can't be '' 

  agent.log has errors starting up monitors as well:
   RequestError: Failed to start monitor ping, options {'addr': '66.187.230.126'}: [Errno 2] No such file or directory

Version-Release number of selected component (if applicable):
 ovirt-hosted-engine-ha-2.2.0-1.el7.centos.noarch

How reproducible:
 reproduced in our environment upgraded all the way from 3.6->4.0->4.1

Steps to Reproduce:
1. upgrade HA host from 4.1 to 4.2 ()
2. log in and check ovirt-ha-broker and ovirt-ha-agent statuses

Actual results:
HA subsystem is down

Expected results:
HA subsystem working

Additional info:
/etc/ovirt-hosted-engine/hosted-engine.conf has empty values for several parameters:
...
domainType=nfs3
vdsm_use_ssl=true
gateway=66.187.230.126
bridge=ovirtmgmt
metadata_volume_UUID=
metadata_image_UUID=
lockspace_volume_UUID=
lockspace_image_UUID=

# The following are used only for iSCSI storage
iqn=
portal=
user=
password=
port=
...

Comment 2 Evgheni Dereveanchin 2017-12-05 16:04:48 UTC

Checked out the logs, the timestamp on the file matches updating to one of the 4.1 releases:
...
Aug 14 16:32:06 Installed: vdsm-4.19.24-1.el7.centos.x86_64
Aug 14 16:32:07 Updated: ovirt-hosted-engine-ha-2.1.4-1.el7.centos.noarch
Aug 14 16:32:08 Updated: ovirt-hosted-engine-setup-2.1.3.5-1.el7.centos.noarch
Aug 14 16:32:26 Updated: ovirt-release41-4.1.4-1.el7.centos.noarch
...

Comment 4 Evgheni Dereveanchin 2017-12-05 16:12:32 UTC

Just to avoid confusion, the system did receive updates after the above log snippet, it's just the update that last touched hosted-engine.conf:

2016:
Jun 23 13:55:09 Installed: vdsm-4.16.30-0.el7.centos.x86_64
Oct 21 18:31:47 Updated: vdsm-4.17.32-1.el7.noarch
2017:
Jan 17 13:41:07 Updated: vdsm-4.18.21-1.el7.centos.x86_64
Feb 24 09:45:55 Updated: vdsm-4.19.4-1.el7.centos.x86_64
Apr 21 21:58:20 Updated: vdsm-4.19.10.1-1.el7.centos.x86_64
Aug 14 16:32:06 Installed: vdsm-4.19.24-1.el7.centos.x86_64
Oct 10 14:26:04 Updated: vdsm-4.19.31-1.el7.centos.x86_64
Dec 05 15:12:20 Updated: vdsm-4.20.9-1.el7.centos.x86_64

Other hosts still on 4.1 all have the same file with empty lines and HA works properly on them.

Comment 5 Simone Tiraboschi 2017-12-05 16:17:47 UTC

metadata_volume_UUID is effectively empty.
I assume that the system has been initially deployed at 3.3 time.
We have code to upgrade it, the point is why it never triggered in the past.

Comment 7 Evgheni Dereveanchin 2017-12-05 17:02:30 UTC

Looking at ancient logs, the environment was initially deployed on 2014-08-05 using otopi-1.2.1 which looks like oVirt 3.4. It was later updated all the way to its current 4.1 state.

Was metadata_image_UUID used by older versions? If yes - where was it taken from? Can we make the HA Broker behave in a similar fashion if the value is missing, then grab it from HE storage domain metadata and update the file? Or the only way here is to perform this step manually. If so, where do I get the values of metadata_volume_UUID and friends?

Comment 8 Simone Tiraboschi 2017-12-05 17:13:11 UTC

(In reply to Evgheni Dereveanchin from comment #7)
> Looking at ancient logs, the environment was initially deployed on
> 2014-08-05 using otopi-1.2.1 which looks like oVirt 3.4. It was later
> updated all the way to its current 4.1 state.
> 
> Was metadata_image_UUID used by older versions? 

In 3.4 the metadata area was just a file on NFS, since 3.5 it becomes a properly vdsm handled volume.

> If yes - where was it taken
> from?

We had upgrade code, see: 
https://bugzilla.redhat.com/show_bug.cgi?id=1313917

now we need to understand if for some reason it failed in the past.

> Can we make the HA Broker behave in a similar fashion if the value is
> missing, then grab it from HE storage domain metadata and update the file?
> Or the only way here is to perform this step manually. If so, where do I get
> the values of metadata_volume_UUID and friends?

Comment 9 Simone Tiraboschi 2017-12-05 17:34:18 UTC

I start thinking that the upgrade was successfully but we have a regression on how ovirt-ha-broker access it.

In 3.4 on NFS we were just using a file.

In 3.6/el6 -> 4.0/el7 upgrade code we eventually created a new volume, deleted the previous file and created a symlink to the volume.

on https://gerrit.ovirt.org/#/c/61345/ I read
"Volume creation will also remove the previous file
and it will replace it with a symlink pointing to the
new volume.
Upon restart, all the hosts will point to the new
volume since they'll simply consume the symlink."

So, instead that directly fixing /etc/ovirt-hosted-engine/hosted-engine.conf on all the involved hosts (ovirt-hosted-engine-setup --upgrade-appliance has been used just on one host), we were relying on ovirt-ha-broker simply consuming the symlink.

Now, since https://gerrit.ovirt.org/#/c/81011/ , ovirt-ha-broker is explicitly looking for all the volume uuids under /etc/ovirt-hosted-engine/hosted-engine.conf but they are not that for the volumes created during 3.6/el6 -> 4.0/el7 upgrade.

Comment 10 Evgheni Dereveanchin 2017-12-05 17:35:58 UTC

Simone, thanks for the insight! How was this upgrade flow supposed to be triggered? From patches it looks like hosted-engine-setup for 4.0 should have warned about the migration and it needed to be performed manually. Am I right?

In our case, from what I remember, hosts were rebuilt as el7 at around 3.5 that's probably the last time hosted-engine-setup was run on them. Then it was just "yum update" probably.

Comment 11 Yaniv Kaul 2017-12-06 10:42:52 UTC

Severity?

Comment 12 Moran Goldboim 2017-12-06 11:47:41 UTC

I'm setting high sevirity mediom priority, based on the assumption (comment #8) *this can't happen on 3.5 and higher*, this is a corner case. 
removing blocker, and we may target it to 4.2.1

Comment 13 Evgheni Dereveanchin 2017-12-06 12:02:18 UTC

I agree with Moran, this probably affects only a minor fraction of environments which were deployed with Hosted Engine before 3.5 and upgraded to 4.1.

It should be enough to at least have workaround steps on how to fill the values ha-broker manually. Upgrade of HE from 3.4 to 4.1 requites at least an Engine OS reinstall so I assume admins who've done it successfully are quite familiar with oVirt and can perform the manual steps without issues as long as they're documented in this BZ.

Comment 16 Evgheni Dereveanchin 2017-12-06 12:46:25 UTC

I accidentally cleared Simone's needinfo request from #9 with my comment so re-adding it.

Comment 17 Evgheni Dereveanchin 2017-12-06 13:02:34 UTC

Looking at the HE storage domain from a 4.1 host together with Martin it looks like there are no syminks to metadata and lockspace:

# ls -la ha_agent
total 2036
drwxr-xr-x. 2 vdsm kvm    4096 Aug  5  2014 .
drwxr-xr-x. 6 vdsm kvm    4096 Aug  5  2014 ..
-rw-rw----. 1 vdsm kvm 1048576 Dec  6 12:50 hosted-engine.lockspace
-rw-rw----. 1 vdsm kvm 1028096 Dec  6 12:50 hosted-engine.metadata

Most probably, some upgrade step was skipped (will need to review the upgrade doc) yet up till 4.2 this still worked fine. We'd probably need to stop HE stuff now and do the upgrade to create the disks, copy data, make symlinks and fix config files.

Comment 18 Simone Tiraboschi 2017-12-06 13:45:58 UTC

Created attachment 1363679 [details]
Workaround script

Comment 19 Simone Tiraboschi 2017-12-06 13:49:43 UTC

Under the hypothesis that the upgrade code correctly generated the missing volumes upgrading the engine VM to el7 via hosted-engine --upgrade-appliace (and this seams not Evgheni's case according to comment 17) ,
the script at https://bugzilla.redhat.com/attachment.cgi?id=1363679 will print out the correct values to be set under /etc/ovirt-hosted-engine/hosted-engine.conf on all the involved hosts.

Comment 20 Evgheni Dereveanchin 2017-12-06 14:29:41 UTC

Thanks Simone and Martin. Indeed, the "hosted-engine --upgrade-appliace" step was skipped when upgrading this environment from 3.6 to 4.0 and the volumes were missing. Everyone going via the official upgrade path should not be affected.

We worked around the issue the following way:
1) create two disks on hosted_storage from the Engine UI and write down their image_id from the Disks tab
2) stopp HA broker and agent on all hosts (this does not affect any VMs)
3) verify hosted-engine lockspace was released (sanlock client status) and release it where needed running "sanlock client rem_lockspace -s LINE_FROM_STAUS"
4) mount HE storage manually if needed (hosted-engine --connect-storage)
5) initialize the lockspace manually in the newly created file running "sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/...path.to.new.lockspace.flie" (note that if it needs double escaped characters to work, was needed for our NFS.)
6) create symlinks for backwards compatibility
7) update /etc/ovirt-hosted-engine/hosted-engine.conf with respective volume and image UUIDs
8) start broker on 4.2 host, immediately set global maintenance just in case
9) start agent, then start broker and agent on remaining hosts
10) wait till all hosts are visible in "hosted-engine --vm-status" output
11) done. remove global maintenance

Comment 21 Simone Tiraboschi 2017-12-06 15:46:54 UTC

(In reply to Evgheni Dereveanchin from comment #20)
> Thanks Simone and Martin. Indeed, the "hosted-engine --upgrade-appliance"
> step was skipped when upgrading this environment from 3.6 to 4.0 and the
> volumes were missing. Everyone going via the official upgrade path should
> not be affected.

Everyone that deployed hosted-engine on NFS on 3.4 is affected:
the missing volumes are supposed to be created by 'hosted-engine --upgrade-appliance' but it was simply replacing the files with symlinks pointing to the new volumes.
Now ovirt-ha-agent ignores also the symlinks.

Comment 22 Martin Sivák 2017-12-06 21:09:06 UTC

Simone I think what Evgheni is saying is that this _only_ affects setups that:

- were installed using 3.3 or 3.4
- AND use NFS
- AND skipped the hosted-engine --upgrade-appliance when upgrading

I believe this is pretty rare enough and knowledge base article or a release note might be a good enough resolution of this bug.

Comment 23 Simone Tiraboschi 2017-12-06 23:31:19 UTC

(In reply to Martin Sivák from comment #22)
> Simone I think what Evgheni is saying is that this _only_ affects setups
> that:
> 
> - were installed using 3.3 or 3.4
> - AND use NFS
> - AND skipped the hosted-engine --upgrade-appliance when upgrading

It will also affect the systems corectly upgraded with 'hosted-engine --upgrade-appliance':
in that case all the volumes will be there, the symlink will be there as well but ovirt-hosted-engine-setup is not supposed to upgrade hosted-engine.conf on all the hosts and so ovirt-ha-broker is going to fail as well because it cannot found the metadata_image_UUID in config file exactly as for this bug.
 
> I believe this is pretty rare enough and knowledge base article or a release
> note might be a good enough resolution of this bug.

Comment 24 Sandro Bonazzola 2018-01-09 17:03:03 UTC

Removing needinfo on me since I see engine on PHX has been upgraded to 4.2.
Workaround script has been provided in comment #18.

Comment 26 Simone Tiraboschi 2018-10-01 07:24:01 UTC

*** Bug 1613278 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.