Bug 1953029 - HE deployment fails on "Add lines to answerfile" [NEEDINFO]
Summary: HE deployment fails on "Add lines to answerfile"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-ansible-collection
Version: 4.4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ovirt-4.4.7
: 4.4.7
Assignee: Yedidyah Bar David
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-23 18:38 UTC by amashah
Modified: 2021-07-22 15:26 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-22 15:26:02 UTC
oVirt Team: Integration
Target Upstream Version:
didi: needinfo? (amashah)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-ansible-collection pull 277 0 None open role: hosted_engine_setup: add_host: Use local_vm_ip 2021-06-03 12:11:11 UTC
Red Hat Product Errata RHSA-2021:2866 0 None None None 2021-07-22 15:26:18 UTC

Description amashah 2021-04-23 18:38:06 UTC
Description of problem:
During deployment of HE (in this case, when using a restore file), the deployment fails with:

~~~
2021-04-20 21:04:38,513+0000 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:111 TASK [ovirt.ovirt.hosted_engine_setup : Add lines to answerfile]
2021-04-20 21:04:39,216+0000 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:105 {'censored': "the output has been hidden due to the fact that 'no_log: true' was specified for this result", 'changed': False}
2021-04-20 21:04:39,317+0000 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
~~~


Version-Release number of selected component (if applicable):
4.4.5

How reproducible:
Unknown, customer hit it


Actual results:
HE deployment from backup fails

Expected results:
HE deployment from backup completes successfully.

Additional info:
If no_log is changed to false in 03_engine_initial_tasks.yml the error becomes more clear:

[ ERROR ] {'msg': 'Destination /root/ovirt-engine-answers does not exist !', 'failed': True, 'rc': 257, 'invocation': {'module_args': {'directory_mode': None, 'force': None, 'remote_src': None, 'backrefs': False, 'insertafter': None, 'path': '/root/ovirt-engine-answers', 'owner': None, 'follow': False, 'validate': None, 'group': None, 'insertbefore': None, 'unsafe_writes': None, 'create': False, 'setype': None, 'content': None, 'serole': None, 'state': 'present', 'selevel': None, 'regexp': None, 'line': 'OVESETUP_CONFIG/adminPassword=str:password', 'src': None, 'seuser': None, 'delimiter': None, 'mode': None, 'firstmatch': False, 'attributes': None, 'backup': False}}, '_ansible_no_log': False, 'changed': False, 'item': 'OVESETUP_CONFIG/adminPassword=str:password', 'ansible_loop_var': 'item', '_ansible_item_label': 'OVESETUP_CONFIG/adminPassword=str:password'}


Looking at the lineinfile ansible module, https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html

There is a parameter 'create':

~~~
boolean
Choices:
no ←
yes
Used with state=present.
If specified, the file will be created if it does not already exist.
By default it will fail if the file is missing.
~~~

In the playbook, we don't use this: https://github.com/oVirt/ovirt-ansible-hosted-engine-setup/blob/master/tasks/bootstrap_local_vm/03_engine_initial_tasks.yml#L33

Potential workarounds;

 1. create the file in advance (e.g. # touch /root/ovirt-engine-answers
 
 2. modify the playbook and add create: yes

Comment 1 Yedidyah Bar David 2021-04-29 11:45:12 UTC
(In reply to amashah from comment #0)
> [ ERROR ] {'msg': 'Destination /root/ovirt-engine-answers does not exist !',

Can you please check why /root/ovirt-engine-answers is missing?

It's part of the appliance image. Most likely something/someone removed it, or there some corruption/failure/etc.

I'd rather not apply the workaround you suggest as a permanent "fix", because this will likely mask some real problem somewhere that should better be addressed directly.

Comment 9 Yedidyah Bar David 2021-05-24 11:32:34 UTC
Investigation of attached logs suggests that the failure was a result of:

1. On the host deploying the hosted-engine, having a line in /etc/hosts pointing the engine's FQDN to a wrong IP address.

2. Having A VM listening on ssh on that IP address and with the same root password.

3. The code adding a local entry as a first line in /etc/hosts not being effective. Perhaps due to an update of ansible or some other infra/library package, caching of some kind, etc.

The result seems to have been that ansible connected to that address and successfully completed some tasks there - all those before 'Add lines to answerfile' in [1] - and then failing in 'Add lines to answerfile' because the file did not exist.

How to continue?

To prevent/workaround:

1. Do not have a wrong line in /etc/hosts. Generally speaking, oVirt/RHV, like most networked software, is very sensitive to correct name resolution. Double check that forward and back resolution work as expected before trying to deploy.

2. Have different root passwords on different machines. This would have likely caused ansible to fail earlier, thus perhaps making it slightly easier to spot the bug.

3. Use separate networks as applicable, to prevent wrong access.

To fix, I'll try:

1. Patch add_engine_as_ansible_host.yml and add there 'ansible_host: "{{ local_vm_ip.stdout_lines[0] }}"'

2. Perhaps patch [1] to check that the machine has a local address as above

[1] https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hosted_engine_setup/tasks/bootstrap_local_vm/03_engine_initial_tasks.yml

Comment 12 Nikolai Sednev 2021-06-16 22:58:28 UTC
I suppose that verification should be general backup and restore on latest 4.4.7.3-0.3.el8ev?

Comment 13 Yedidyah Bar David 2021-06-17 05:53:54 UTC
(In reply to Nikolai Sednev from comment #12)
> I suppose that verification should be general backup and restore on latest
> 4.4.7.3-0.3.el8ev?

Generally speaking, HE deploy, new setup or restore, should be enough for sanity testing.

See also comment 9.

For the record: The linked patch only handles a theoretical flow I guessed that happened based on the provided logs, even though I failed to reproduce. Might also be related to customization of name resolution - /etc/resolv.conf, /etc/nsswitch.conf, use of nscd, libnss_db, etc.

Comment 14 Nikolai Sednev 2021-06-22 16:22:03 UTC
Backup and restore from ovirt-engine-setup-base-4.4.7.3-0.3.el8ev.noarch to ovirt-engine-setup-4.4.7.4-0.9.el8ev.noarch, from NFS to NFS:
I ran "hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma03_rhevm_4_4_7"
Pause the execution after adding this host to the engine?
You will be able to connect to the restored engine in order to manually review and remediate its configuration.
This is normally not required when restoring an up to date and coherent backup.
Pause after adding the host? (Yes, No)[No]: yes

Got to the part:
[ INFO  ] You can now connect to https://alma03.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, please continue only when the host is listed as 'up'
[ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : include_tasks]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Create temporary lock file]
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.2yhnku1o_he_setup_lock is removed, delete it once ready to proceed]

Then upgraded the engine to latest bits:
ovirt-engine-setup-4.4.7.4-0.9.el8ev.noarch
ansible-2.9.21-1.el8ae.noarch
ovirt-ansible-collection-1.5.1-1.el8ev.noarch
python3-ansible-runner-1.4.6-2.el8ar.noarch
ansible-runner-service-1.0.7-1.el8ev.noarch
Linux nsednev-he-1.qa.lab.tlv.redhat.com 4.18.0-305.7.1.el8_4.x86_64 #1 SMP Mon Jun 14 17:25:42 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.4 (Ootpa)

Then I ran: "rm -rf /tmp/ansible.2yhnku1o_he_setup_lock" and continued with the restore to different NFS share.

*New ovirt-ansible-collection-1.5.1-1.el8ev.noarch did not caused me any issues during the restore, although I performed it from engine running ovirt-ansible-collection-1.5.0-1.el8ev.noarch and then restored to ovirt-ansible-collection-1.5.1-1.el8ev.noarch.

[ INFO  ] Hosted Engine successfully deployed
[ INFO  ] Other hosted-engine hosts have to be reinstalled in order to update their storage configuration. From the engine, host by host, please set maintenance mode and then click on reinstall button ensuring you choose DEPLOY in hosted engine tab.
[ INFO  ] Please note that the engine VM ssh keys have changed. Please remove the engine VM entry in ssh known_hosts on your clients.

Moving to verified.

Comment 20 errata-xmlrpc 2021-07-22 15:26:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: RHV Engine and Host Common Packages security update [ovirt-4.4.7]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2866


Note You need to log in before you can comment on or make changes to this bug.