Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1665138

Summary: During restore if second SPM ha-host gets disconnected, the restore fails with Please activate the master Storage Domain first.]". HTTP response code is 409.
Product: Red Hat Enterprise Virtualization Manager Reporter: Nikolai Sednev <nsednev>
Component: ovirt-hosted-engine-setupAssignee: Evgeny Slutsky <eslutsky>
Status: CLOSED WORKSFORME QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: low    
Version: 4.2.8CC: cshao, gveitmic, lsurette
Target Milestone: ovirt-4.4.1Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-14 09:09:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1786458, 1795238, 1795672    
Bug Blocks:    
Attachments:
Description Flags
engine logs
none
deployment.log none

Description Nikolai Sednev 2019-01-10 14:37:53 UTC
Created attachment 1519837 [details]
engine logs

Description of problem:
During restore if second SPM ha-host gets disconnected, the restore fails with  
[ ERROR ] -Please activate the master Storage Domain first.]". HTTP response code is 409.
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "deprecations": [{"msg": "The 'ovirt_storage_domains' module is being renamed 'ovirt_storage_domain'", "version": 2.8}], "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Failed to attach Storage due to an error on the Data Center master Storage Domain.\n-Please activate the master Storage Domain first.]\". HTTP response code is 409."}
          Please specify the storage you would like to use (glusterfs, iscsi, fc, nfs)[nfs]: 

2019-01-10 14:32:01,459+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-
engineScheduled-Thread-17) [] Unable to RefreshCapabilities: NoRouteToHostException: No route to host
2019-01-10 14:32:01,460+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-Manage
dThreadFactory-engineScheduled-Thread-17) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = puma19.scl.lab.tlv.red
hat.com, VdsIdAndVdsVDSCommandParametersBase:{hostId='e44ac540-e514-4678-b29a-cffe02d0ab6d', vds='Host[puma19.scl.lab.
tlv.redhat.com,e44ac540-e514-4678-b29a-cffe02d0ab6d]'})' execution failed: java.net.NoRouteToHostException: No route t
o host
2019-01-10 14:32:04,468+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connect
ing to puma19.scl.lab.tlv.redhat.com/10.35.160.47



Version-Release number of selected component (if applicable):
ovirt-engine-setup-4.2.8.2-0.1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.33-1.el7ev.noarch
rhvm-appliance-4.2-20190108.0.el7.noarch
Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy vintage HE over FC on two ha-hosts.
2.Add iSCSI data storage domain for guest VMs.
3.Create and run 4 guest-VMs, 2 on each ha-host and add ISO domain.
4.Make sure that first ha-host "A" is SPM and HE-VM is running on it.
5.Set global maintenance.
6.Backup the engine and copy file to safe place.
7.Set second ha-host "B" as SPM.
8.Reprovision "A" and restore HE over NFS clean storage domain, using "hosted-engine --deploy --restore-from-file=file".
9.During restore restart or power-off host "B".

Actual results:
During restore, if host "B" unavailable, there is a 10 minutes time shift for SPM to contend to host "A" and deployment fails because of that time delay. Customer needs to retry manually again and then he succeeds, see the attached logs. 

Expected results:
Restore should not drop user because of the delay caused by SPM contending to host "A" from unreachable host "B".

Additional info:
sosreport from host "A".

Comment 1 Nikolai Sednev 2019-01-10 14:40:47 UTC
Created attachment 1519838 [details]
deployment.log

Comment 5 Sandro Bonazzola 2019-01-21 08:28:50 UTC
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0.
If you think this bug should block 4.3.0 please re-target and set blocker flag.

Comment 13 Sandro Bonazzola 2019-02-18 07:54:59 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1.

Comment 14 Sandro Bonazzola 2019-02-27 08:53:47 UTC
Happening on a corner case, no easy fix available, re-targeting to 4.4

Comment 15 Sandro Bonazzola 2019-11-27 08:51:00 UTC
Can this be reproduced with 4.4 and ansible deployment?

Comment 16 Nikolai Sednev 2020-02-02 10:46:47 UTC
(In reply to Sandro Bonazzola from comment #15)
> Can this be reproduced with 4.4 and ansible deployment?
To check that we should first fix https://bugzilla.redhat.com/show_bug.cgi?id=1795672.

Comment 17 Nikolai Sednev 2020-05-14 09:09:37 UTC
I've just tried to reproduce migrating from one NFS volume to another and everything worked just fine using ansible.
I think we may close as worksforme.

rhvm-appliance-4.4-20200417.0.el8ev.x86_64  
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
Linux 4.18.0-193.2.1.el8_2.x86_64 #1 SMP Mon May 4 14:32:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)