Bug 1665138 - During restore if second SPM ha-host gets disconnected, the restore fails with Please activate the master Storage Domain first.]". HTTP response code is 409.
Summary: During restore if second SPM ha-host gets disconnected, the restore fails wit...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-setup
Version: 4.2.8
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ovirt-4.4.1
: ---
Assignee: Evgeny Slutsky
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1786458 1795238 1795672
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-10 14:37 UTC by Nikolai Sednev
Modified: 2020-05-14 09:09 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-14 09:09:37 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine logs (1.48 MB, text/plain)
2019-01-10 14:37 UTC, Nikolai Sednev
no flags Details
deployment.log (29.05 KB, text/plain)
2019-01-10 14:40 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3822242 0 None None None 2019-01-22 03:41:20 UTC

Description Nikolai Sednev 2019-01-10 14:37:53 UTC
Created attachment 1519837 [details]
engine logs

Description of problem:
During restore if second SPM ha-host gets disconnected, the restore fails with  
[ ERROR ] -Please activate the master Storage Domain first.]". HTTP response code is 409.
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "deprecations": [{"msg": "The 'ovirt_storage_domains' module is being renamed 'ovirt_storage_domain'", "version": 2.8}], "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Failed to attach Storage due to an error on the Data Center master Storage Domain.\n-Please activate the master Storage Domain first.]\". HTTP response code is 409."}
          Please specify the storage you would like to use (glusterfs, iscsi, fc, nfs)[nfs]: 

2019-01-10 14:32:01,459+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-
engineScheduled-Thread-17) [] Unable to RefreshCapabilities: NoRouteToHostException: No route to host
2019-01-10 14:32:01,460+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-Manage
dThreadFactory-engineScheduled-Thread-17) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = puma19.scl.lab.tlv.red
hat.com, VdsIdAndVdsVDSCommandParametersBase:{hostId='e44ac540-e514-4678-b29a-cffe02d0ab6d', vds='Host[puma19.scl.lab.
tlv.redhat.com,e44ac540-e514-4678-b29a-cffe02d0ab6d]'})' execution failed: java.net.NoRouteToHostException: No route t
o host
2019-01-10 14:32:04,468+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connect
ing to puma19.scl.lab.tlv.redhat.com/10.35.160.47



Version-Release number of selected component (if applicable):
ovirt-engine-setup-4.2.8.2-0.1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.33-1.el7ev.noarch
rhvm-appliance-4.2-20190108.0.el7.noarch
Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy vintage HE over FC on two ha-hosts.
2.Add iSCSI data storage domain for guest VMs.
3.Create and run 4 guest-VMs, 2 on each ha-host and add ISO domain.
4.Make sure that first ha-host "A" is SPM and HE-VM is running on it.
5.Set global maintenance.
6.Backup the engine and copy file to safe place.
7.Set second ha-host "B" as SPM.
8.Reprovision "A" and restore HE over NFS clean storage domain, using "hosted-engine --deploy --restore-from-file=file".
9.During restore restart or power-off host "B".

Actual results:
During restore, if host "B" unavailable, there is a 10 minutes time shift for SPM to contend to host "A" and deployment fails because of that time delay. Customer needs to retry manually again and then he succeeds, see the attached logs. 

Expected results:
Restore should not drop user because of the delay caused by SPM contending to host "A" from unreachable host "B".

Additional info:
sosreport from host "A".

Comment 1 Nikolai Sednev 2019-01-10 14:40:47 UTC
Created attachment 1519838 [details]
deployment.log

Comment 5 Sandro Bonazzola 2019-01-21 08:28:50 UTC
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0.
If you think this bug should block 4.3.0 please re-target and set blocker flag.

Comment 13 Sandro Bonazzola 2019-02-18 07:54:59 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1.

Comment 14 Sandro Bonazzola 2019-02-27 08:53:47 UTC
Happening on a corner case, no easy fix available, re-targeting to 4.4

Comment 15 Sandro Bonazzola 2019-11-27 08:51:00 UTC
Can this be reproduced with 4.4 and ansible deployment?

Comment 16 Nikolai Sednev 2020-02-02 10:46:47 UTC
(In reply to Sandro Bonazzola from comment #15)
> Can this be reproduced with 4.4 and ansible deployment?
To check that we should first fix https://bugzilla.redhat.com/show_bug.cgi?id=1795672.

Comment 17 Nikolai Sednev 2020-05-14 09:09:37 UTC
I've just tried to reproduce migrating from one NFS volume to another and everything worked just fine using ansible.
I think we may close as worksforme.

rhvm-appliance-4.4-20200417.0.el8ev.x86_64  
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
Linux 4.18.0-193.2.1.el8_2.x86_64 #1 SMP Mon May 4 14:32:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)


Note You need to log in before you can comment on or make changes to this bug.