1613278 – hosted-engine.conf missing values after HE re-deploy on host

Bug 1613278 - hosted-engine.conf missing values after HE re-deploy on host

Summary: hosted-engine.conf missing values after HE re-deploy on host

Keywords:
Status:	CLOSED DUPLICATE of bug 1521011
Alias:	None
Product:	ovirt-hosted-engine-setup
Classification:	oVirt
Component:	General
Sub Component:
Version:	2.2.24
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-07 11:08 UTC by Evgheni Dereveanchin
Modified:	2018-10-01 07:24 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-10-01 07:24:01 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1493384	0	medium	CLOSED	[downstream clone - 4.1.7] Additional HE host deploy fails due to 'received downloaded data size is wrong'	2021-02-22 00:41:40 UTC

Internal Links: 1493384

Description Evgheni Dereveanchin 2018-08-07 11:08:01 UTC

Description of problem:
When a host is re-added to an HE pool, the hosted-engine.conf is missing most of the important values that cause broker startup failures

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.2.25-1.el7.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch
ovirt-engine-4.2.5.2-1.el7.noarch

Steps to Reproduce:
1. Preform a host reinstall, select HE to UNDEPLOY
2. Perform yet another reinstall of the same host, DEPLOY this time

Actual results:
/etc/ovirt-hosted-engine/hosted-engine.conf contains just host_id and nothing else, this causes HA services to fail:

MainThread::WARNING::2018-08-06 23:07:44,641::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: 'metadata_image_UUID can't be 'None'

Expected results:
HE services start fine

Additional info:
This happened on an HE environment upgraded multiple times that has no host_id=1 at the moment.

Comment 4 Martin Perina 2018-08-08 07:03:21 UTC

AFAIK re-adding host to engine is not officially supported without complete OS reinstallation.

Comment 5 Evgheni Dereveanchin 2018-08-08 12:10:25 UTC

I did not re-add the host, I clicked the "reinstall" button in the UI twice - first to undeploy HE and then to re-deploy it. As far as I know that is the only way to add HE hosts nowadays (it was done through "hosted-engine --deploy" before).

When Undeploy is selected /etc/ovirt-hosted-engine/hosted-engine.conf is removed completely. Where is metadata_image_UUID supposed to come from during deploy?

Comment 6 Martin Perina 2018-08-08 12:36:21 UTC

(In reply to Evgheni Dereveanchin from comment #5)
> I did not re-add the host, I clicked the "reinstall" button in the UI twice
> - first to undeploy HE and then to re-deploy it. As far as I know that is
> the only way to add HE hosts nowadays (it was done through "hosted-engine
> --deploy" before).
> 
> When Undeploy is selected /etc/ovirt-hosted-engine/hosted-engine.conf is
> removed completely. Where is metadata_image_UUID supposed to come from
> during deploy?

Ah, sorry, I misunderstood your comment. AFAIK the above should be working.

Comment 7 Simone Tiraboschi 2018-08-09 13:06:40 UTC

(In reply to Evgheni Dereveanchin from comment #5)
> When Undeploy is selected /etc/ovirt-hosted-engine/hosted-engine.conf is
> removed completely. Where is metadata_image_UUID supposed to come from
> during deploy?

It should come from the configuration volume on the shared storage.

Can you please attach engine.log for the relevant time frame?

Comment 9 Simone Tiraboschi 2018-08-20 15:01:29 UTC

The issue comes from here:

2018-08-06 19:06:21,625-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-8) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ovirt-srv01 command HSMGetAllTasksStatusesVDS failed: Internal file read failure: ('partial data 10240 from 20480',)

The engine tried to parse a 20480 bytes tar archive that was instead 10240 bytes long.
I think it's still a side effect/duplicate of https://bugzilla.redhat.com/1493384

Comment 10 Evgheni Dereveanchin 2018-08-21 15:19:57 UTC

Hi Simone and thanks for finding the probable root cause. 

How can I check this file manually what should cause a regeneration of this file?  Is it somewhere on the HostedEngine storage domain?

As noted, our env is running 4.2 (updated all the way back from 3.4 I think) with the sole remaining HE host (ovirt-srv01) running the following software versions:


vdsm-4.20.27.1-1.el7.centos.x86_64
ovirt-hosted-engine-ha-2.2.11-1.el7.centos.noarch

Other hosts have newer software versions yet I need to deploy HE on them before being able to evacuate ovirt-srv01 to update it.

Comment 11 Simone Tiraboschi 2018-08-21 16:35:30 UTC

(In reply to Evgheni Dereveanchin from comment #10)
> How can I check this file manually

[root@tiramd1 ~]# . /etc/ovirt-hosted-engine/hosted-engine.conf
[root@tiramd1 ~]# dd if=/var/run/vdsm/storage/${sdUUID}/${conf_image_UUID}/${conf_volume_UUID} of=/dev/null
40+0 records in
40+0 records out
20480 bytes (20 kB) copied, 0,000197022 s, 104 MB/s

if you see 10240 here we got it.

> what should cause a regeneration of this
> file?  

Changing any HE configuration value with something like 
  hosted-engine --set-shared-config gateway 192.168.1.1 --type=he_shared

will rewrite the whole tar archive with (now after https://bugzilla.redhat.com/show_bug.cgi?id=1493384 ) the right size.

> Is it somewhere on the HostedEngine storage domain?

It's on a specif configuration volume on the hosted-engine storage domain.

Comment 12 Evgheni Dereveanchin 2018-08-22 11:30:15 UTC

Indeed the file is 10240 bytes in our case. Can I update the conf volume while HE is running on the host or it's best to try and enable global maintenance and shut down Hosted Engine? As previously noted, there's currently just one HE host deployed.

I've also checked the KB article linked to the other BZ and it lists an option of specifying a config option HostedEngineConfigDiskSizeInBytes=10240 via engine-config - is this still relevant for 4.2 or re-writing the config volume is preferred?

Comment 13 Simone Tiraboschi 2018-08-22 11:48:43 UTC

(In reply to Evgheni Dereveanchin from comment #12)
> Indeed the file is 10240 bytes in our case. Can I update the conf volume
> while HE is running on the host 

Yes, no issue on that.

> I've also checked the KB article linked to the other BZ and it lists an
> option of specifying a config option HostedEngineConfigDiskSizeInBytes=10240
> via engine-config - is this still relevant for 4.2 or re-writing the config
> volume is preferred?

Rewriting the config volume is better on my opinion

Comment 14 Evgheni Dereveanchin 2018-08-23 16:10:29 UTC

Thanks, I was able to fix the volume size by running hosted-engine --set-shared-config. dd shows the rights size and correct values are printed by --get-shared-config (the gateway in our case was actually wrong so this tool helped fix it).


Before I applied the fix however I added several storage domains which seems to have triggered a different bug with VDSM - logged BZ#1621468 to investigate.

Will try to deploy other hosts when that is sorted and report back.

Comment 15 Simone Tiraboschi 2018-08-23 16:18:20 UTC

(In reply to Evgheni Dereveanchin from comment #14)

> Before I applied the fix however I added several storage domains which seems
> to have triggered a different bug with VDSM - logged BZ#1621468 to
> investigate.

I think it's harmless and just a side effect of this one.

The point is why you got a volume which is 10240 bytes long instead or 20480 as expected by the engine.

How did you deployed the first host? which ovirt-hosted-engine-setup version have you initially used?

Comment 16 Evgheni Dereveanchin 2018-08-24 08:59:56 UTC

This environment was initially deployed as 3.4 and upgraded all the way up to 4.2 once new releases came out. HE host reinstall was last done during 3.5->3.6 update two years ago. I have not touched them ever since (just periodic updates). I believe the configuration volume was introduced some time after that but before the 4.1.7 fix. It was probably created during an engine upgrade and has been sitting in this form ever since.

Comment 17 Simone Tiraboschi 2018-08-24 09:42:21 UTC

(In reply to Evgheni Dereveanchin from comment #16)
> This environment was initially deployed as 3.4 and upgraded all the way up
> to 4.2 once new releases came out. HE host reinstall was last done during
> 3.5->3.6 update two years ago. I have not touched them ever since (just
> periodic updates). I believe the configuration volume was introduced some
> time after that but before the 4.1.7 fix. It was probably created during an
> engine upgrade and has been sitting in this form ever since.

OK, so we can simply close this as a duplicate of https://bugzilla.redhat.com/1493384

*** This bug has been marked as a duplicate of bug 1493384 ***

Comment 18 Evgheni Dereveanchin 2018-08-28 16:24:46 UTC

Unfortunately the issue is still in place after fixing the volume size.
Re-opening this and will upload fresh logs in a second. hosted-engine.conf is not as empty now yet it is still missing metadata_image_UUID so HA-broker fails to start with the same error as stated in #0

Comment 23 Evgheni Dereveanchin 2018-09-03 14:46:57 UTC

As this issue is likely caused by some old bugs that caused missing values it's not worth investigating the root cause but I still want to get this environment operational again. Can I just copy in the values one by one and write them to the shared storage using "hosted-engine --set-shared-config" or there's a better way to recover from this metadata corruption?

Comment 24 Simone Tiraboschi 2018-09-04 10:04:39 UTC

(In reply to Evgheni Dereveanchin from comment #23)
> As this issue is likely caused by some old bugs that caused missing values

Is it still this? https://bugzilla.redhat.com/show_bug.cgi?id=1521011#c20

> it's not worth investigating the root cause but I still want to get this
> environment operational again. Can I just copy in the values one by one and
> write them to the shared storage using "hosted-engine --set-shared-config"
> or there's a better way to recover from this metadata corruption?

Copying the missing value is the way to go.
The best option is using 
hosted-engine --set-shared-config metadata_volume_UUID 365a6733-aefa-42fc-94b3-868bb0901374 --type=he_shared
and
hosted-engine --set-shared-config metadata_volume_UUID 365a6733-aefa-42fc-94b3-868bb0901374 --type=he_local

to fix the local copy of the file and also the master copy on the shared storage for the future.

Comment 25 Evgheni Dereveanchin 2018-09-06 13:51:45 UTC

Thanks Simone! indeed it looks like an aftermath of a previous upgrade.

What about the other values than metadata_volume_UUID? 

A diff of hosted-engine.conf reveals the following on the config volume:

conf_image_UUID       - absent
conf_volume_UUID      - absent
lockspace_image_UUID  - empty
lockspace_volume_UUID - empty
metadata_image_UUID   - empty
metadata_volume_UUID  - empty
spUUID                - zeroes on working host
vm_disk_vol_ID        - absent

Should I set some of them to avoid future problems? I believe ones that ended up empty in the sharedconfig volume should have proper values set.

Comment 26 Sandro Bonazzola 2018-09-20 14:27:12 UTC

Simone anything to be fixed or documented? Or can we close?

Comment 27 Simone Tiraboschi 2018-10-01 07:24:01 UTC

(In reply to Sandro Bonazzola from comment #26)
> Simone anything to be fixed or documented? Or can we close?

Yes, it's just the result of a bad upgrade in the past.
We have a workaround here: https://bugzilla.redhat.com/show_bug.cgi?id=1521011#c18
and in kbs

*** This bug has been marked as a duplicate of bug 1521011 ***

Note You need to log in before you can comment on or make changes to this bug.