Description of problem: When a host is re-added to an HE pool, the hosted-engine.conf is missing most of the important values that cause broker startup failures Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.2.25-1.el7.noarch ovirt-hosted-engine-ha-2.2.16-1.el7.noarch ovirt-engine-4.2.5.2-1.el7.noarch Steps to Reproduce: 1. Preform a host reinstall, select HE to UNDEPLOY 2. Perform yet another reinstall of the same host, DEPLOY this time Actual results: /etc/ovirt-hosted-engine/hosted-engine.conf contains just host_id and nothing else, this causes HA services to fail: MainThread::WARNING::2018-08-06 23:07:44,641::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: 'metadata_image_UUID can't be 'None' Expected results: HE services start fine Additional info: This happened on an HE environment upgraded multiple times that has no host_id=1 at the moment.
AFAIK re-adding host to engine is not officially supported without complete OS reinstallation.
I did not re-add the host, I clicked the "reinstall" button in the UI twice - first to undeploy HE and then to re-deploy it. As far as I know that is the only way to add HE hosts nowadays (it was done through "hosted-engine --deploy" before). When Undeploy is selected /etc/ovirt-hosted-engine/hosted-engine.conf is removed completely. Where is metadata_image_UUID supposed to come from during deploy?
(In reply to Evgheni Dereveanchin from comment #5) > I did not re-add the host, I clicked the "reinstall" button in the UI twice > - first to undeploy HE and then to re-deploy it. As far as I know that is > the only way to add HE hosts nowadays (it was done through "hosted-engine > --deploy" before). > > When Undeploy is selected /etc/ovirt-hosted-engine/hosted-engine.conf is > removed completely. Where is metadata_image_UUID supposed to come from > during deploy? Ah, sorry, I misunderstood your comment. AFAIK the above should be working.
(In reply to Evgheni Dereveanchin from comment #5) > When Undeploy is selected /etc/ovirt-hosted-engine/hosted-engine.conf is > removed completely. Where is metadata_image_UUID supposed to come from > during deploy? It should come from the configuration volume on the shared storage. Can you please attach engine.log for the relevant time frame?
The issue comes from here: 2018-08-06 19:06:21,625-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-8) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ovirt-srv01 command HSMGetAllTasksStatusesVDS failed: Internal file read failure: ('partial data 10240 from 20480',) The engine tried to parse a 20480 bytes tar archive that was instead 10240 bytes long. I think it's still a side effect/duplicate of https://bugzilla.redhat.com/1493384
Hi Simone and thanks for finding the probable root cause. How can I check this file manually what should cause a regeneration of this file? Is it somewhere on the HostedEngine storage domain? As noted, our env is running 4.2 (updated all the way back from 3.4 I think) with the sole remaining HE host (ovirt-srv01) running the following software versions: vdsm-4.20.27.1-1.el7.centos.x86_64 ovirt-hosted-engine-ha-2.2.11-1.el7.centos.noarch Other hosts have newer software versions yet I need to deploy HE on them before being able to evacuate ovirt-srv01 to update it.
(In reply to Evgheni Dereveanchin from comment #10) > How can I check this file manually [root@tiramd1 ~]# . /etc/ovirt-hosted-engine/hosted-engine.conf [root@tiramd1 ~]# dd if=/var/run/vdsm/storage/${sdUUID}/${conf_image_UUID}/${conf_volume_UUID} of=/dev/null 40+0 records in 40+0 records out 20480 bytes (20 kB) copied, 0,000197022 s, 104 MB/s if you see 10240 here we got it. > what should cause a regeneration of this > file? Changing any HE configuration value with something like hosted-engine --set-shared-config gateway 192.168.1.1 --type=he_shared will rewrite the whole tar archive with (now after https://bugzilla.redhat.com/show_bug.cgi?id=1493384 ) the right size. > Is it somewhere on the HostedEngine storage domain? It's on a specif configuration volume on the hosted-engine storage domain.
Indeed the file is 10240 bytes in our case. Can I update the conf volume while HE is running on the host or it's best to try and enable global maintenance and shut down Hosted Engine? As previously noted, there's currently just one HE host deployed. I've also checked the KB article linked to the other BZ and it lists an option of specifying a config option HostedEngineConfigDiskSizeInBytes=10240 via engine-config - is this still relevant for 4.2 or re-writing the config volume is preferred?
(In reply to Evgheni Dereveanchin from comment #12) > Indeed the file is 10240 bytes in our case. Can I update the conf volume > while HE is running on the host Yes, no issue on that. > I've also checked the KB article linked to the other BZ and it lists an > option of specifying a config option HostedEngineConfigDiskSizeInBytes=10240 > via engine-config - is this still relevant for 4.2 or re-writing the config > volume is preferred? Rewriting the config volume is better on my opinion
Thanks, I was able to fix the volume size by running hosted-engine --set-shared-config. dd shows the rights size and correct values are printed by --get-shared-config (the gateway in our case was actually wrong so this tool helped fix it). Before I applied the fix however I added several storage domains which seems to have triggered a different bug with VDSM - logged BZ#1621468 to investigate. Will try to deploy other hosts when that is sorted and report back.
(In reply to Evgheni Dereveanchin from comment #14) > Before I applied the fix however I added several storage domains which seems > to have triggered a different bug with VDSM - logged BZ#1621468 to > investigate. I think it's harmless and just a side effect of this one. The point is why you got a volume which is 10240 bytes long instead or 20480 as expected by the engine. How did you deployed the first host? which ovirt-hosted-engine-setup version have you initially used?
This environment was initially deployed as 3.4 and upgraded all the way up to 4.2 once new releases came out. HE host reinstall was last done during 3.5->3.6 update two years ago. I have not touched them ever since (just periodic updates). I believe the configuration volume was introduced some time after that but before the 4.1.7 fix. It was probably created during an engine upgrade and has been sitting in this form ever since.
(In reply to Evgheni Dereveanchin from comment #16) > This environment was initially deployed as 3.4 and upgraded all the way up > to 4.2 once new releases came out. HE host reinstall was last done during > 3.5->3.6 update two years ago. I have not touched them ever since (just > periodic updates). I believe the configuration volume was introduced some > time after that but before the 4.1.7 fix. It was probably created during an > engine upgrade and has been sitting in this form ever since. OK, so we can simply close this as a duplicate of https://bugzilla.redhat.com/1493384 *** This bug has been marked as a duplicate of bug 1493384 ***
Unfortunately the issue is still in place after fixing the volume size. Re-opening this and will upload fresh logs in a second. hosted-engine.conf is not as empty now yet it is still missing metadata_image_UUID so HA-broker fails to start with the same error as stated in #0
As this issue is likely caused by some old bugs that caused missing values it's not worth investigating the root cause but I still want to get this environment operational again. Can I just copy in the values one by one and write them to the shared storage using "hosted-engine --set-shared-config" or there's a better way to recover from this metadata corruption?
(In reply to Evgheni Dereveanchin from comment #23) > As this issue is likely caused by some old bugs that caused missing values Is it still this? https://bugzilla.redhat.com/show_bug.cgi?id=1521011#c20 > it's not worth investigating the root cause but I still want to get this > environment operational again. Can I just copy in the values one by one and > write them to the shared storage using "hosted-engine --set-shared-config" > or there's a better way to recover from this metadata corruption? Copying the missing value is the way to go. The best option is using hosted-engine --set-shared-config metadata_volume_UUID 365a6733-aefa-42fc-94b3-868bb0901374 --type=he_shared and hosted-engine --set-shared-config metadata_volume_UUID 365a6733-aefa-42fc-94b3-868bb0901374 --type=he_local to fix the local copy of the file and also the master copy on the shared storage for the future.
Thanks Simone! indeed it looks like an aftermath of a previous upgrade. What about the other values than metadata_volume_UUID? A diff of hosted-engine.conf reveals the following on the config volume: conf_image_UUID - absent conf_volume_UUID - absent lockspace_image_UUID - empty lockspace_volume_UUID - empty metadata_image_UUID - empty metadata_volume_UUID - empty spUUID - zeroes on working host vm_disk_vol_ID - absent Should I set some of them to avoid future problems? I believe ones that ended up empty in the sharedconfig volume should have proper values set.
Simone anything to be fixed or documented? Or can we close?
(In reply to Sandro Bonazzola from comment #26) > Simone anything to be fixed or documented? Or can we close? Yes, it's just the result of a bad upgrade in the past. We have a workaround here: https://bugzilla.redhat.com/show_bug.cgi?id=1521011#c18 and in kbs *** This bug has been marked as a duplicate of bug 1521011 ***