Created attachment 1519837 [details] engine logs Description of problem: During restore if second SPM ha-host gets disconnected, the restore fails with [ ERROR ] -Please activate the master Storage Domain first.]". HTTP response code is 409. [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "deprecations": [{"msg": "The 'ovirt_storage_domains' module is being renamed 'ovirt_storage_domain'", "version": 2.8}], "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Failed to attach Storage due to an error on the Data Center master Storage Domain.\n-Please activate the master Storage Domain first.]\". HTTP response code is 409."} Please specify the storage you would like to use (glusterfs, iscsi, fc, nfs)[nfs]: 2019-01-10 14:32:01,459+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory- engineScheduled-Thread-17) [] Unable to RefreshCapabilities: NoRouteToHostException: No route to host 2019-01-10 14:32:01,460+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-Manage dThreadFactory-engineScheduled-Thread-17) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = puma19.scl.lab.tlv.red hat.com, VdsIdAndVdsVDSCommandParametersBase:{hostId='e44ac540-e514-4678-b29a-cffe02d0ab6d', vds='Host[puma19.scl.lab. tlv.redhat.com,e44ac540-e514-4678-b29a-cffe02d0ab6d]'})' execution failed: java.net.NoRouteToHostException: No route t o host 2019-01-10 14:32:04,468+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connect ing to puma19.scl.lab.tlv.redhat.com/10.35.160.47 Version-Release number of selected component (if applicable): ovirt-engine-setup-4.2.8.2-0.1.el7ev.noarch ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.33-1.el7ev.noarch rhvm-appliance-4.2-20190108.0.el7.noarch Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) How reproducible: 100% Steps to Reproduce: 1.Deploy vintage HE over FC on two ha-hosts. 2.Add iSCSI data storage domain for guest VMs. 3.Create and run 4 guest-VMs, 2 on each ha-host and add ISO domain. 4.Make sure that first ha-host "A" is SPM and HE-VM is running on it. 5.Set global maintenance. 6.Backup the engine and copy file to safe place. 7.Set second ha-host "B" as SPM. 8.Reprovision "A" and restore HE over NFS clean storage domain, using "hosted-engine --deploy --restore-from-file=file". 9.During restore restart or power-off host "B". Actual results: During restore, if host "B" unavailable, there is a 10 minutes time shift for SPM to contend to host "A" and deployment fails because of that time delay. Customer needs to retry manually again and then he succeeds, see the attached logs. Expected results: Restore should not drop user because of the delay caused by SPM contending to host "A" from unreachable host "B". Additional info: sosreport from host "A".
Created attachment 1519838 [details] deployment.log
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0. If you think this bug should block 4.3.0 please re-target and set blocker flag.
Moving to 4.3.2 not being identified as blocker for 4.3.1.
Happening on a corner case, no easy fix available, re-targeting to 4.4
Can this be reproduced with 4.4 and ansible deployment?
(In reply to Sandro Bonazzola from comment #15) > Can this be reproduced with 4.4 and ansible deployment? To check that we should first fix https://bugzilla.redhat.com/show_bug.cgi?id=1795672.
I've just tried to reproduce migrating from one NFS volume to another and everything worked just fine using ansible. I think we may close as worksforme. rhvm-appliance-4.4-20200417.0.el8ev.x86_64 ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch Linux 4.18.0-193.2.1.el8_2.x86_64 #1 SMP Mon May 4 14:32:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa)