+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1686575 +++ ====================================================================== Description of problem: hosted-engine deploy (restore-from-file) fails if any non-management logical network is marked as required in backup file as host is marked as non-operational due to missing required network. Errors from deployment logs:- ~~~ 2019-03-07 20:33:50,711+0530 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, fix accordingly and re-deploy.\n"} 2019-03-07 20:35:00,862+0530 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} ~~~ Errors from engine logs:- ~~~ 2019-03-07 20:33:42,342+05 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-16) [6fad6d2a] Host '<hostname>' is set to Non-Operational, it is missing the following networks: 'test' 2019-03-07 20:33:42,397+05 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-16) [6fad6d2a] EVENT_ID: VDS_SET_NONOPERATIONAL_NETWORK(519), Host <hostname> does not comply with the cluster Default networks, the following networks are missing on host: 'test' ~~~ Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.2.34-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch rhvm-4.2.8.5-0.1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Have a backup_file required non-management logical networks. 2. hosted-engine --deploy --restore-from-file=backup/file_name 3. Actual results: Deployment fails. Expected results: It should work even with required non-management logical networks Additional info: (Originally by Ameya Charekar)
Setting the severity of this bug to high as this will be a showstopper when a user is recovering the hosted engine after a data loss or corruption where the user cannot go back to old setup and mark the network as not required. This will be a production down scenario. A workaround will be to use hook enginevm_after_engine_setup to set the network as not required before adding the host. /usr/share/ovirt-hosted-engine-setup/ansible/hooks/enginevm_after_engine_setup/fix_network.yml - include_tasks: auth_sso.yml - name: Wait for the engine to reach a stable condition wait_for: timeout=300 - name: fix network ovirt_network: auth: "{{ ovirt_auth }}" name: "{{ item }}" data_center: Default clusters: - name: Default required: False with_items: - require_network_1 - require_network_2 (Originally by Nijin Ashok)
Restore fails with: [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, fix accordingly and re-deploy.\n"} [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook [ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch. Software Version: ovirt-engine-setup-4.3.4.2-0.1.el7.noarch ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.9-1.el7ev.noarch Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo)
Created attachment 1577043 [details] sosreport from alma03
[ ERROR ] AuthError: Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials.. [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": false, "msg": "Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials.."} Moving back to assigned.
Created attachment 1577484 [details] sosreport from alma03
(In reply to Nikolai Sednev from comment #6) > [ ERROR ] AuthError: Error during SSO authentication access_denied : Cannot > authenticate user 'None@N/A': No valid profile found in credentials.. > [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": false, > "msg": "Error during SSO authentication access_denied : Cannot authenticate > user 'None@N/A': No valid profile found in credentials.."} OK, understood: it's basically a kind of timeout on the authentication token if the user takes too much to bring up the host. On engine.log we see: 2019-06-05 11:31:43,115+03 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-2) [1705f268] OAuthException invalid_grant: The provided authorization grant for the auth code has expired. 2019-06-05 11:31:43,118+03 ERROR [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-4) [] Cannot authenticate using authentication Headers: invalid_grant: The provided authorization grant for the auth code has expired. 2019-06-05 11:31:43,136+03 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-4) [] OAuthException access_denied: Cannot authenticate user 'None@N/A': No valid profile found in credentials.. while on the host: 2019-06-05 10:51:53,682+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 TASK [ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.igyeot_he_setup_lock is removed, delete it once ready to proceed] 2019-06-05 11:31:41,009+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 ok: [localhost] 2019-06-05 11:31:41,712+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 TASK [ovirt.hosted_engine_setup : Wait for the host to be up] 2019-06-05 11:43:06,473+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 {u'exception': u'Traceback (most recent call last):\n File "/tmp/ansible_ovirt_host_facts_payload_Sqklpx/__main__.py", line 110, in main\n follow=\'cluster\',\n File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 12141, in list\n return self._internal_get(headers, query, wait)\n File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 211, in _internal_get\n return future.wait() if wait else future\n File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 54, in wait\n response = self._connection.wait(self._context)\n File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 497, in wait\n return self.__wait(context, failed_auth)\n File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 527, in __wait\n self._sso_token = self._get_access_token()\n File "/usr/lib64/python2.7/site-packages/ovirtsdk4/__init__.py", line 628, in _get_access_token\n sso_error[1]\nAuthError: Error during SSO authentication access_denied : Cannot authenticate user \'None@N/A\': No valid profile found in credentials..\n', u'_ansible_no_log': False, u'changed': False, u'msg': u"Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials..", u'attempts': 120, u'invocation': {u'module_args': {u'all_content': False, u'pattern': u'name=alma03.qa.lab.tlv.redhat.com', u'fetch_nested': False, u'cluster_version': None, u'nested_attributes': []}}} 2019-06-05 11:43:06,574+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 AuthError: Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials.. So 40 minutes there (10:51 -> 11:31) are currently too much for the authentication token that as been generated at: 2019-06-05 10:51:32,720+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 TASK [ovirt.hosted_engine_setup : Obtain SSO token using username/password credentials] 2019-06-05 10:51:34,824+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 ok: [localhost] So we definitively have to regenerate the token after the pause to be on the safe side.
In case of not waiting too much, the restore is still not working with the following error: [ INFO ] TASK [ovirt.hosted_engine_setup : Wait for OVF_STORE disk content] [ ERROR ] {u'stderr_lines': [u'20+0 records in', u'20+0 records out', u'10240 bytes (10 kB) copied, 0.000147846 s, 69.3 MB/s', u'tar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive', u'tar: Exiting with failure status due to previous errors'], u'ansible_loop_var': u'item', u'end': u'2019-06-05 16:30:37.422598', u'_ansible_item_label': {u'image_id': u'20060b0c-38e4-4023-8348-6c9f39043534', u'name': u'OVF_STORE', u'id': u'54f02da4-1496-4442-8b7b-e6097f477574'}, u'stdout': u'', u'failed': True, u'changed': True, u'msg': u'non-zero return code', u'rc': 2, u'start': u'2019-06-05 16:30:36.656917', u'attempts': 12, u'cmd': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=54f02da4-1496-4442-8b7b-e6097f477574 volumeID=20060b0c-38e4-4023-8348-6c9f39043534 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'item': {u'image_id': u'20060b0c-38e4-4023-8348-6c9f39043534', u'name': u'OVF_STORE', u'id': u'54f02da4-1496-4442-8b7b-e6097f477574'}, u'delta': u'0:00:00.765681', u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=54f02da4-1496-4442-8b7b-e6097f477574 volumeID=20060b0c-38e4-4023-8348-6c9f39043534 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'removes': None, u'argv': None, u'warn': False, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'stdout_lines': [], u'stderr': u'20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000147846 s, 69.3 MB/s\ntar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors', u'_ansible_no_log': False} [ ERROR ] {u'stderr_lines': [u'20+0 records in', u'20+0 records out', u'10240 bytes (10 kB) copied, 0.00015368 s, 66.6 MB/s', u'tar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive', u'tar: Exiting with failure status due to previous errors'], u'ansible_loop_var': u'item', u'end': u'2019-06-05 16:32:51.941100', u'_ansible_item_label': {u'image_id': u'3f972ef5-2830-4766-bed7-61d2341bce51', u'name': u'OVF_STORE', u'id': u'e12301d7-a462-4a61-8b8b-0af23c35f87f'}, u'stdout': u'', u'failed': True, u'changed': True, u'msg': u'non-zero return code', u'rc': 2, u'start': u'2019-06-05 16:32:51.125418', u'attempts': 12, u'cmd': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=e12301d7-a462-4a61-8b8b-0af23c35f87f volumeID=3f972ef5-2830-4766-bed7-61d2341bce51 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'item': {u'image_id': u'3f972ef5-2830-4766-bed7-61d2341bce51', u'name': u'OVF_STORE', u'id': u'e12301d7-a462-4a61-8b8b-0af23c35f87f'}, u'delta': u'0:00:00.815682', u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u"vdsm-client Image prepare storagepoolID=356451ec-85f3-11e9-b0b8-00163e7bb853 storagedomainID=a0c3f50b-abfb-4ea0-803f-72a0ff52ecb0 imageID=e12301d7-a462-4a61-8b8b-0af23c35f87f volumeID=3f972ef5-2830-4766-bed7-61d2341bce51 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - d3c8b009-16a3-41f0-a76e-10722fae6430.ovf", u'removes': None, u'argv': None, u'warn': False, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'stdout_lines': [], u'stderr': u'20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.00015368 s, 66.6 MB/s\ntar: d3c8b009-16a3-41f0-a76e-10722fae6430.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors', u'_ansible_no_log': False} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook See sosreport attached from host.
Created attachment 1577587 [details] sosreport from alma03
Works fine on these components: ovirt-engine-setup-4.3.5.3-0.1.el7.noarch ovirt-hosted-engine-ha-2.3.3-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.11-1.el7ev.noarch Linux 3.10.0-1060.el7.x86_64 #1 SMP Mon Jul 1 18:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.7 Beta (Maipo) During restore I had to answer "YES" to "Pause the execution after adding this host to the engine? You will be able to iteratively connect to the restored engine in order to manually review and remediate its configuration before proceeding with the deployment: please ensure that all the datacenter hosts and storage domain are listed as up or in maintenance mode before proceeding. This is normally not required when restoring an up to date and coherent backup. (Yes, No)[No]: yes" Then at some point I was asked to make changes to running engine environment: "[ INFO ] You can now connect to https://alma03.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, please continue only when the host is listed as 'up' [ INFO ] TASK [ovirt.hosted_engine_setup : include_tasks] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Create temporary lock file] [ INFO ] changed: [localhost -> localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.WGeSW8_he_setup_lock is removed, delete it once ready to proceed] " When I was done and the host got listed "up", I manually deleted "/tmp/ansible.WGeSW8_he_setup_lock" from host and deployment continued as desired, until it successfully finished. Then I disabled global maintenance via CLI and restore was finished. Moving to verified.
Please review the doc text: Previously, hosted-engine-setup automatically configured only the management network on the host used at restore time. If the backup file contained references to additional required logical networks, the missing networks prevented the host from booting. Now, hosted-engine-setup detects the missing networks and displays a hint to enable the user to connect to the engine manually, so the host can boot.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2434
sync2jira
(In reply to Steve Goodman from comment #14) > Please review the doc text: > > Previously, hosted-engine-setup automatically configured only the management > network on the host used at restore time. If the backup file contained > references to additional required logical networks, the missing networks > prevented the host from booting. > > Now, hosted-engine-setup detects the missing networks and displays a hint to > enable the user to connect to the engine manually, so the host can boot. Fine, thanks