Bug 1712667 - [downstream clone - 4.3.5] hosted-engine deploy (restore-from-file) fails if any non-management logical network is marked as required in backup file.
Summary: [downstream clone - 4.3.5] hosted-engine deploy (restore-from-file) fails if ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-setup
Version: 4.2.8-3
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.3.5
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1686575 1720896
Blocks: 1719314
TreeView+ depends on / blocked
 
Reported: 2019-05-22 05:18 UTC by RHV bug bot
Modified: 2021-12-23 14:43 UTC (History)
7 users (show)

Fixed In Version: ovirt-ansible-hosted-engine-setup-1.0.21
Doc Type: Bug Fix
Doc Text:
Previously, hosted-engine-setup automatically configured only the management network on the host used at restore time. If the backup file contained references to additional required logical networks, the missing networks prevented the host from booting. Now, hosted-engine-setup detects the missing networks and displays a hint to enable the user to connect to the engine manually, so the host can boot.
Clone Of: 1686575
Environment:
Last Closed: 2019-08-12 11:53:52 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)
sosreport from alma03 (10.33 MB, application/x-xz)
2019-06-04 12:20 UTC, Nikolai Sednev
no flags Details
sosreport from alma03 (10.60 MB, application/x-xz)
2019-06-05 09:08 UTC, Nikolai Sednev
no flags Details
sosreport from alma03 (10.53 MB, application/x-xz)
2019-06-05 13:40 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-ansible-hosted-engine-setup pull 183 0 'None' closed Let the user pause execution to interactively bring up the host 2021-02-12 16:38:20 UTC
Github oVirt ovirt-ansible-hosted-engine-setup pull 206 0 'None' closed Bind the engine VM to its disk by id 2021-02-12 16:38:20 UTC
Red Hat Issue Tracker RHV-44400 0 None None None 2021-12-23 14:43:30 UTC
Red Hat Knowledge Base (Solution) 4088711 0 Troubleshoot None [RHV] The hosted-engine deploy (restore-from-file) fails if any non-management logical network is defined as a required ... 2019-05-22 05:18:49 UTC
Red Hat Product Errata RHBA-2019:2434 0 None None None 2019-08-12 11:53:54 UTC

Internal Links: 1822023

Description RHV bug bot 2019-05-22 05:18:40 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1686575 +++
======================================================================

Description of problem:
hosted-engine deploy (restore-from-file) fails if any non-management logical network is marked as required in backup file as host is marked as non-operational due to missing required network.

Errors from deployment logs:-
~~~
2019-03-07 20:33:50,711+0530 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, fix accordingly and re-deploy.\n"}

2019-03-07 20:35:00,862+0530 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
~~~

Errors from engine logs:-
~~~
2019-03-07 20:33:42,342+05 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-16) [6fad6d2a] Host '<hostname>' is set to Non-Operational, it is missing the following networks: 'test'
2019-03-07 20:33:42,397+05 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-16) [6fad6d2a] EVENT_ID: VDS_SET_NONOPERATIONAL_NETWORK(519), Host <hostname> does not comply with the cluster Default networks, the following networks are missing on host: 'test'
~~~


Version-Release number of selected component (if applicable):

ovirt-hosted-engine-setup-2.2.34-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
rhvm-4.2.8.5-0.1.el7ev.noarch

How reproducible:
Always


Steps to Reproduce:
1. Have a backup_file required non-management logical networks.
2. hosted-engine --deploy --restore-from-file=backup/file_name
3. 

Actual results:
Deployment fails.

Expected results:
It should work even with required non-management logical networks

Additional info:

(Originally by Ameya Charekar)

Comment 2 RHV bug bot 2019-05-22 05:18:43 UTC
Setting the severity of this bug to high as this will be a showstopper when a user is recovering the hosted engine after a data loss or corruption where the user cannot go back to old setup and mark the network as not required. This will be a production down scenario.

A workaround will be to use hook enginevm_after_engine_setup to set the network as not required before adding the host.


/usr/share/ovirt-hosted-engine-setup/ansible/hooks/enginevm_after_engine_setup/fix_network.yml

- include_tasks: auth_sso.yml
- name: Wait for the engine to reach a stable condition
  wait_for: timeout=300
- name: fix network
  ovirt_network:
     auth: "{{ ovirt_auth }}"
     name: "{{ item }}"
     data_center: Default
     clusters:
        - name: Default
          required: False
  with_items:
     - require_network_1
     - require_network_2

(Originally by Nijin Ashok)

Comment 4 Nikolai Sednev 2019-06-04 12:20:09 UTC
Restore fails with:
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, fix accordingly and re-deploy.\n"}
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook
[ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch.

Software Version:
ovirt-engine-setup-4.3.4.2-0.1.el7.noarch
ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.9-1.el7ev.noarch
Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 5 Nikolai Sednev 2019-06-04 12:20:49 UTC
Created attachment 1577043 [details]
sosreport from alma03

Comment 6 Nikolai Sednev 2019-06-05 09:07:55 UTC
[ ERROR ] AuthError: Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials..
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": false, "msg": "Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials.."}

Moving back to assigned.

Comment 7 Nikolai Sednev 2019-06-05 09:08:55 UTC
Created attachment 1577484 [details]
sosreport from alma03

Comment 8 Simone Tiraboschi 2019-06-05 09:45:48 UTC
(In reply to Nikolai Sednev from comment #6)
> [ ERROR ] AuthError: Error during SSO authentication access_denied : Cannot
> authenticate user 'None@N/A': No valid profile found in credentials..
> [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": false,
> "msg": "Error during SSO authentication access_denied : Cannot authenticate
> user 'None@N/A': No valid profile found in credentials.."}

OK, understood: it's basically a kind of timeout on the authentication token if the user takes too much to bring up the host.

On engine.log we see:
2019-06-05 11:31:43,115+03 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-2) [1705f268] OAuthException invalid_grant: The provided authorization grant for the auth code has expired.
2019-06-05 11:31:43,118+03 ERROR [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-4) [] Cannot authenticate using authentication Headers: invalid_grant: The provided authorization grant for the auth code has expired.
2019-06-05 11:31:43,136+03 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-4) [] OAuthException access_denied: Cannot authenticate user 'None@N/A': No valid profile found in credentials..

while on the host:
2019-06-05 10:51:53,682+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 TASK [ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.igyeot_he_setup_lock is removed, delete it once ready to proceed]
2019-06-05 11:31:41,009+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 ok: [localhost]
2019-06-05 11:31:41,712+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 TASK [ovirt.hosted_engine_setup : Wait for the host to be up]
2019-06-05 11:43:06,473+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 {u'exception': u'Traceback (most recent call last):\n  File "/tmp/ansible_ovirt_host_facts_payload_Sqklpx/__main__.py", line 110, in main\n    follow=\'cluster\',\n  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 12141, in list\n    return self._internal_get(headers, query, wait)\n  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 211, in _internal_get\n    return future.wait() if wait else future\n  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 54, in wait\n    response = self._connection.wait(self._context)\n  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 497, in wait\n    return self.__wait(context, failed_auth)\n  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 527, in __wait\n    self._sso_token = self._get_access_token()\n  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 628, in _get_access_token\n    sso_error[1]\nAuthError: Error during SSO authentication access_denied : Cannot authenticate user \'None@N/A\': No valid profile found in credentials..\n', u'_ansible_no_log': False, u'changed': False, u'msg': u"Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials..", u'attempts': 120, u'invocation': {u'module_args': {u'all_content': False, u'pattern': u'name=alma03.qa.lab.tlv.redhat.com', u'fetch_nested': False, u'cluster_version': None, u'nested_attributes': []}}}
2019-06-05 11:43:06,574+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 AuthError: Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials..


So 40 minutes there (10:51 -> 11:31) are currently too much for the authentication token that as been generated at:

2019-06-05 10:51:32,720+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 TASK [ovirt.hosted_engine_setup : Obtain SSO token using username/password credentials]
2019-06-05 10:51:34,824+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 ok: [localhost]

So we definitively have to regenerate the token after the pause to be on the safe side.

Comment 9 Nikolai Sednev 2019-06-05 13:39:22 UTC
In case of not waiting too much, the restore is still not working with the following error:
[ INFO  ] TASK [ovirt.hosted_engine_setup : Wait for OVF_STORE disk content]
[ ERROR ] {u'stderr_lines': [u'20+0 records in', u'20+0 records out', u'10240 bytes (10 kB) copied, 0.000147846 s, 69.3 MB/s', u'tar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive', u'tar: Exiting with failure status due to previous errors'], u'ansible_loop_var': u'item', u'end': u'2019-06-05 16:30:37.422598', u'_ansible_item_label': {u'image_id': u'20060b0c-38e4-4023-8348-6c9f39043534', u'name': u'OVF_STORE', u'id': u'54f02da4-1496-4442-8b7b-e6097f477574'}, u'stdout': u'', u'failed': True, u'changed': True, u'msg': u'non-zero return code', u'rc': 2, u'start': u'2019-06-05 16:30:36.656917', u'attempts': 12, u'cmd': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=54f02da4-1496-4442-8b7b-e6097f477574 volumeID=20060b0c-38e4-4023-8348-6c9f39043534 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'item': {u'image_id': u'20060b0c-38e4-4023-8348-6c9f39043534', u'name': u'OVF_STORE', u'id': u'54f02da4-1496-4442-8b7b-e6097f477574'}, u'delta': u'0:00:00.765681', u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=54f02da4-1496-4442-8b7b-e6097f477574 volumeID=20060b0c-38e4-4023-8348-6c9f39043534 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'removes': None, u'argv': None, u'warn': False, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'stdout_lines': [], u'stderr': u'20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000147846 s, 69.3 MB/s\ntar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors', u'_ansible_no_log': False}
[ ERROR ] {u'stderr_lines': [u'20+0 records in', u'20+0 records out', u'10240 bytes (10 kB) copied, 0.00015368 s, 66.6 MB/s', u'tar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive', u'tar: Exiting with failure status due to previous errors'], u'ansible_loop_var': u'item', u'end': u'2019-06-05 16:32:51.941100', u'_ansible_item_label': {u'image_id': u'3f972ef5-2830-4766-bed7-61d2341bce51', u'name': u'OVF_STORE', u'id': u'e12301d7-a462-4a61-8b8b-0af23c35f87f'}, u'stdout': u'', u'failed': True, u'changed': True, u'msg': u'non-zero return code', u'rc': 2, u'start': u'2019-06-05 16:32:51.125418', u'attempts': 12, u'cmd': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=e12301d7-a462-4a61-8b8b-0af23c35f87f volumeID=3f972ef5-2830-4766-bed7-61d2341bce51 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'item': {u'image_id': u'3f972ef5-2830-4766-bed7-61d2341bce51', u'name': u'OVF_STORE', u'id': u'e12301d7-a462-4a61-8b8b-0af23c35f87f'}, u'delta': u'0:00:00.815682', u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=e12301d7-a462-4a61-8b8b-0af23c35f87f volumeID=3f972ef5-2830-4766-bed7-61d2341bce51 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'removes': None, u'argv': None, u'warn': False, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'stdout_lines': [], u'stderr': u'20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.00015368 s, 66.6 MB/s\ntar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors', u'_ansible_no_log': False}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook

See sosreport attached from host.

Comment 10 Nikolai Sednev 2019-06-05 13:40:31 UTC
Created attachment 1577587 [details]
sosreport from alma03

Comment 13 Nikolai Sednev 2019-07-09 12:02:41 UTC
Works fine on these components:
ovirt-engine-setup-4.3.5.3-0.1.el7.noarch
ovirt-hosted-engine-ha-2.3.3-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.11-1.el7ev.noarch
Linux 3.10.0-1060.el7.x86_64 #1 SMP Mon Jul 1 18:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.7 Beta (Maipo)

During restore I had to answer "YES" to "Pause the execution after adding this host to the engine?
          You will be able to iteratively connect to the restored engine in order to manually review and remediate its configuration before proceeding with the deployment:
          please ensure that all the datacenter hosts and storage domain are listed as up or in maintenance mode before proceeding.
          This is normally not required when restoring an up to date and coherent backup. (Yes, No)[No]: yes"

Then at some point I was asked to make changes to running engine environment:
"[ INFO  ] You can now connect to https://alma03.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, please continue only when the host is listed as 'up'
[ INFO  ] TASK [ovirt.hosted_engine_setup : include_tasks]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [ovirt.hosted_engine_setup : Create temporary lock file]
[ INFO  ] changed: [localhost -> localhost]
[ INFO  ] TASK [ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.WGeSW8_he_setup_lock is removed, delete it once ready to proceed]
"
When I was done and the host got listed "up", I manually deleted "/tmp/ansible.WGeSW8_he_setup_lock" from host and deployment continued as desired, until it successfully finished.
Then I disabled global maintenance via CLI and restore was finished.

Moving to verified.

Comment 14 Steve Goodman 2019-07-21 14:19:43 UTC
Please review the doc text:

Previously, hosted-engine-setup automatically configured only the management network on the host used at restore time. If the backup file contained references to additional required logical networks, the missing networks prevented the host from booting.

Now, hosted-engine-setup detects the missing networks and displays a hint to enable the user to connect to the engine manually, so the host can boot.

Comment 16 errata-xmlrpc 2019-08-12 11:53:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2434

Comment 17 Daniel Gur 2019-08-28 13:13:55 UTC
sync2jira

Comment 18 Daniel Gur 2019-08-28 13:18:09 UTC
sync2jira

Comment 19 Simone Tiraboschi 2020-05-28 09:29:35 UTC
(In reply to Steve Goodman from comment #14)
> Please review the doc text:
> 
> Previously, hosted-engine-setup automatically configured only the management
> network on the host used at restore time. If the backup file contained
> references to additional required logical networks, the missing networks
> prevented the host from booting.
> 
> Now, hosted-engine-setup detects the missing networks and displays a hint to
> enable the user to connect to the engine manually, so the host can boot.

Fine, thanks


Note You need to log in before you can comment on or make changes to this bug.