1648987 – [Doc] hosted-engine redeployment from backup will fail if any of the storage server listed in the backup file could not be connected

Bug 1648987 - [Doc] hosted-engine redeployment from backup will fail if any of the storage server listed in the backup file could not be connected

Summary: [Doc] hosted-engine redeployment from backup will fail if any of the storage ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1654935
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.2.7
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.3.4
Target Release:	4.3.0
Assignee:	rhev-docs@redhat.com
QA Contact:	rhev-docs@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1649001 (view as bug list)
Depends On:
Blocks:	1654935
TreeView+	depends on / blocked

Reported:	2018-11-12 16:26 UTC by Nikolai Sednev
Modified:	2020-01-27 16:10 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-29 15:10:54 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
deployment logs from alma04 aka host "B" (188.63 KB, application/x-gzip) 2018-11-12 16:32 UTC, Nikolai Sednev	no flags	Details
sosreport from alma04 aka host "B" (10.50 MB, application/x-xz) 2018-11-12 16:34 UTC, Nikolai Sednev	no flags	Details
View All

Description Nikolai Sednev 2018-11-12 16:26:34 UTC

Description of problem:
During restore new or reprovisioned hosts requred to have mapping to old iSCSI hosted storage.
In case that you try to reprovision one of your old ha-hosts for HE restore or you will try to use new host for that matter and your HE was deployed previously on iSCSI storage domain, you will have to have that host properly mapped to iSCSI target, otherwise there will be an error:
[ INFO  ] TASK [Add NFS storage domain]
[ ERROR ] Error: Fault reason is "Operation Failed". Fault detail is "[Cannot add storage server connection when Host status is not up]". HTTP response code is 409.
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "deprecations": [{"msg": "The 'ovirt_storage_domains' module is being renamed 'ovirt_storage_domain'", "version": 2.8}], "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Cannot add storage server connection when Host status is not up]\". HTTP response code is 409."}


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.18-1.el7ev.noarch
rhvm-appliance-4.2-20181026.1.el7.noarch
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
100%

Steps to Reproduce:
1.Deploy HE on host "A" over iSCSI.
2.Add additional ha-host "B" and make sure HE-VM is running on host "B".
3.Set global maintenance and, make backup and reprovision host "B" without remapping it to an old iSCSI hosted storage target.
4.Use "hosted-engine --deploy --restore-from-file=/yourpath/resorefile" in order to restore from host "B" on NFS instead of iSCSI.


Actual results:
[ INFO  ] TASK [Add NFS storage domain]
[ ERROR ] Error: Fault reason is "Operation Failed". Fault detail is "[Cannot add storage server connection when Host status is not up]". HTTP response code is 409.
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "deprecations": [{"msg": "The 'ovirt_storage_domains' module is being renamed 'ovirt_storage_domain'", "version": 2.8}], "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Cannot add storage server connection when Host status is not up]\". HTTP response code is 409."}


Expected results:
Old hosted engine's storage domain should not influence hosted engine restore procedure.

Comment 1 Nikolai Sednev 2018-11-12 16:29:25 UTC

Adding logs here and more details:

[ ERROR ] {u'_ansible_parsed': True, u'stderr_lines': [u'20+0 records in', u'20+0 records out', u'10240 bytes (10 kB) copied, 0.000129626 s, 79.0 MB/s', u'tar: 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf: Not found in archive', u'tar: Exiting with failure status due to previous errors'], u'changed': True, u'end': u'2018-11-12 18:22:09.105088', u'_ansible_item_label': {u'image_id': u'b5637066-c345-41ef-8c06-b74c21b7778d', u'name': u'OVF_STORE', u'id': u'cdc311e5-ef5b-4576-94cc-1f0a53de0cc9'}, u'stdout': u'', u'failed': True, u'_ansible_item_result': True, u'msg': u'non-zero return code', u'rc': 2, u'start': u'2018-11-12 18:22:08.485310', u'attempts': 12, u'cmd': u"vdsm-client Image prepare storagepoolID=69e41970-14b3-48b8-95bd-b22d64f572e8 storagedomainID=663d668d-c72c-41dc-9962-e7e726e00cc4 imageID=cdc311e5-ef5b-4576-94cc-1f0a53de0cc9 volumeID=b5637066-c345-41ef-8c06-b74c21b7778d | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf", u'item': {u'image_id': u'b5637066-c345-41ef-8c06-b74c21b7778d', u'name': u'OVF_STORE', u'id': u'cdc311e5-ef5b-4576-94cc-1f0a53de0cc9'}, u'delta': u'0:00:00.619778', u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"vdsm-client Image prepare storagepoolID=69e41970-14b3-48b8-95bd-b22d64f572e8 storagedomainID=663d668d-c72c-41dc-9962-e7e726e00cc4 imageID=cdc311e5-ef5b-4576-94cc-1f0a53de0cc9 volumeID=b5637066-c345-41ef-8c06-b74c21b7778d | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf", u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'stdout_lines': [], u'stderr': u'20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000129626 s, 79.0 MB/s\ntar: 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors', u'_ansible_no_log': False}
[ ERROR ] {u'_ansible_parsed': True, u'stderr_lines': [u'20+0 records in', u'20+0 records out', u'10240 bytes (10 kB) copied, 0.000124448 s, 82.3 MB/s', u'tar: 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf: Not found in archive', u'tar: Exiting with failure status due to previous errors'], u'changed': True, u'end': u'2018-11-12 18:24:21.196652', u'_ansible_item_label': {u'image_id': u'3831c261-0c90-4ad1-acea-e71e11c3f6b6', u'name': u'OVF_STORE', u'id': u'4612359a-f267-4035-96e5-1d424497cbc6'}, u'stdout': u'', u'failed': True, u'_ansible_item_result': True, u'msg': u'non-zero return code', u'rc': 2, u'start': u'2018-11-12 18:24:20.575761', u'attempts': 12, u'cmd': u"vdsm-client Image prepare storagepoolID=69e41970-14b3-48b8-95bd-b22d64f572e8 storagedomainID=663d668d-c72c-41dc-9962-e7e726e00cc4 imageID=4612359a-f267-4035-96e5-1d424497cbc6 volumeID=3831c261-0c90-4ad1-acea-e71e11c3f6b6 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf", u'item': {u'image_id': u'3831c261-0c90-4ad1-acea-e71e11c3f6b6', u'name': u'OVF_STORE', u'id': u'4612359a-f267-4035-96e5-1d424497cbc6'}, u'delta': u'0:00:00.620891', u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"vdsm-client Image prepare storagepoolID=69e41970-14b3-48b8-95bd-b22d64f572e8 storagedomainID=663d668d-c72c-41dc-9962-e7e726e00cc4 imageID=4612359a-f267-4035-96e5-1d424497cbc6 volumeID=3831c261-0c90-4ad1-acea-e71e11c3f6b6 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf", u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'stdout_lines': [], u'stderr': u'20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000124448 s, 82.3 MB/s\ntar: 566421ac-7ccb-4bff-8bed-909051ce7fee.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors', u'_ansible_no_log': False}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook

Comment 3 Nikolai Sednev 2018-11-12 16:32:36 UTC

Created attachment 1504799 [details]
deployment logs from alma04 aka host "B"

Comment 4 Nikolai Sednev 2018-11-12 16:34:46 UTC

Created attachment 1504801 [details]
sosreport from alma04 aka host "B"

Comment 5 Simone Tiraboschi 2018-11-13 09:00:18 UTC

This is not specific to iSCSI or to the hosted-engine storage domain.

The strong requirement, and we have to update our documentation to reflect that, is that at redeployment the host should be able to do a connectStorageServer to all the storage server listed in the datacenter where we want to add the host.
If just one of the connectStorageServer commands fails (for instance because the host got a new iSCSI initiator IQN that got refused by the SAN as in this case) the host will be set as non operation after a few connectStorageServer attempts and the deploy will fail since we cannot proceed over a non operational host.

The specific storage domain then can be then completely corrupted or also missing (in another test we completely emptied the folder containing an NFS storage domain but the connectStorageServer was fine since the NFS server was still there) but the connectStorageServer command should work.

Start guessing which storage domain is still there and which is not and try automatically recovering actions (like deactivating SD) is on my opinion too risky since it can impact VM running on other hosts.

What we can reasonably do when storage servers listed in the backup file are not available at restore time (for any reason):
a. let the user inject an ansible task file with custom compensation actions to be executed before adding the host used in the restore process.
b. ask the user to redeploy on a custom new datacenter so that the engine will not try other storage connections on that host and the host will not be set as non operational.
At the end of the deployment the user will get a running and reachable engine in a temporary and isolated datacenter but he can use that engine to try interactive manual recovery actions.
Once the user got a working environment, he can take a new backup where everything is working and restore again to put the new host in the desired data-center and cluster.

Comment 6 Simone Tiraboschi 2018-11-13 09:21:57 UTC

*** Bug 1649001 has been marked as a duplicate of this bug. ***

Comment 9 Sandro Bonazzola 2019-01-28 09:37:06 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 14 Nikolai Sednev 2019-08-22 08:20:32 UTC

Why did you closed this bug as duplicate of bug 1654935?
Your bug had been opened much later than this one.
1648987 Reported:2018-11-12 16:26 UTC by Nikolai Sednev
1654935 Reported:2018-11-30 04:57 UTC by Tahlia Richardson

Comment 15 Tahlia Richardson 2020-01-27 16:10:46 UTC

(In reply to Nikolai Sednev from comment #14)
> Why did you closed this bug as duplicate of bug 1654935?
> Your bug had been opened much later than this one.
> 1648987 Reported:2018-11-12 16:26 UTC by Nikolai Sednev
> 1654935 Reported:2018-11-30 04:57 UTC by Tahlia Richardson

It's been a while since then, so the best answer I can give is "it seemed like the right thing to do at the time". 

If you needed this one open for metrics or tracking, I'll keep that in mind for future cases, and the RHV docs team may be willing to switch them around. 

From the docs side, AFAIK there isn't really a convention for which bug to keep open, as long as the work gets done.

Note You need to log in before you can comment on or make changes to this bug.