Bug 1893385 - hosted-engine deploy (restore-from-file) fails if any non-management logical network is marked as required in backup file
Summary: hosted-engine deploy (restore-from-file) fails if any non-management logical ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-ansible-collection
Version: 4.4.2
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.4
: 4.4.4
Assignee: Yedidyah Bar David
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-30 23:42 UTC by Steffen Froemer
Modified: 2021-02-02 13:58 UTC (History)
10 users (show)

Fixed In Version: ovirt-ansible-collection-1.2.3
Doc Type: Enhancement
Doc Text:
In previous versions, when using 'hosted-engine --restore-from-file' to restore or upgrade, if the backup included extra required networks in the cluster, and if the user did not reply 'Yes' to the question about pausing the execution, deployment failed. In this release, regardless of the answer to 'pause?', if the host is found to be in state "Non Operational", deployment will pause, outputting relevant information to the user, and waiting until a lock file is removed. This should allow the user to then connect to the web admin UI and manually handle the situation, activate the host, and then remove the lock file and continue the deployment. This release also allows supplying a custom hook to fix such issues automatically.
Clone Of:
Environment:
Last Closed: 2021-02-02 13:58:29 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:
nsednev: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-ansible-collection pull 181 0 None closed roles: hosted_engine_setup: Add after_add_host hook 2021-02-11 08:59:54 UTC
Red Hat Knowledge Base (Solution) 4088711 0 None None None 2020-10-30 23:42:37 UTC
Red Hat Product Errata RHBA-2021:0312 0 None None None 2021-02-02 13:58:35 UTC

Description Steffen Froemer 2020-10-30 23:42:38 UTC
Description of problem:
During upgrade from RHV-4.3 hosted engine, the deployment of new HE will fail if any non-management logical network is marked as required in backup file



Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.4.6-2.el8ev.noarch

How reproducible:
always

Steps to Reproduce:
1. Have a backup_file required non-management logical networks.
2. hosted-engine --deploy --restore-from-file=backup/file_name

Actual results:
It's failing with following errors

Deployment errorlogs

~~~
2020-10-28 23:16:55,614+0000 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."}
2020-10-28 23:18:49,098+0000 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
2020-10-28 23:18:51,005+0000 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed executing ansible-playbook
2020-10-28 23:19:16,527+0000 ERROR otopi.plugins.gr_he_common.core.misc misc._terminate:167 Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch.
~~~

Expected results:
it should work

Additional info:
The workaround from KCS #4088711 works well

# cat > /usr/share/ansible/roles/ovirt.hosted-engine-setup/hooks/enginevm_after_engine_setup/network_fix.yml << EOF
- include_tasks: auth_sso.yml
- name: Wait for the engine to reach a stable condition
  wait_for: timeout=300
- name: fix network
  ovirt_network:
     auth: "{{ ovirt_auth }}"
     name: "{{ item }}"
     data_center: Default
     clusters:
        - name: Default
          required: False
  with_items:
     - backend
     - frontend
     - public-zone
     - security-zone
     - storage
EOF

Comment 1 Steffen Froemer 2020-10-30 23:44:29 UTC
It's related or same as https://bugzilla.redhat.com/show_bug.cgi?id=1686575

Comment 2 Yedidyah Bar David 2020-11-02 08:24:08 UTC
(In reply to Steffen Froemer from comment #0)
> Steps to Reproduce:
> 1. Have a backup_file required non-management logical networks.
> 2. hosted-engine --deploy --restore-from-file=backup/file_name

> Expected results:
> it should work

(In reply to Steffen Froemer from comment #1)
> It's related or same as https://bugzilla.redhat.com/show_bug.cgi?id=1686575

Indeed. Did you check this bug? The "solution" isn't automatic, as in "it should work".

Copying also bug 1712667 comment 13:
> During restore I had to answer "YES" to "Pause the execution after adding
> this host to the engine?
>           You will be able to iteratively connect to the restored engine in
> order to manually review and remediate its configuration before proceeding
> with the deployment:
>           please ensure that all the datacenter hosts and storage domain are
> listed as up or in maintenance mode before proceeding.
>           This is normally not required when restoring an up to date and
> coherent backup. (Yes, No)[No]: yes"
> 
> Then at some point I was asked to make changes to running engine environment:
> "[ INFO  ] You can now connect to
> https://alma03.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status
> of this host and eventually remediate it, please continue only when the host
> is listed as 'up'
> [ INFO  ] TASK [ovirt.hosted_engine_setup : include_tasks]
> [ INFO  ] ok: [localhost]
> [ INFO  ] TASK [ovirt.hosted_engine_setup : Create temporary lock file]
> [ INFO  ] changed: [localhost -> localhost]
> [ INFO  ] TASK [ovirt.hosted_engine_setup : Pause execution until
> /tmp/ansible.WGeSW8_he_setup_lock is removed, delete it once ready to
> proceed]
> "
> When I was done and the host got listed "up", I manually deleted
> "/tmp/ansible.WGeSW8_he_setup_lock" from host and deployment continued as
> desired, until it successfully finished.
> Then I disabled global maintenance via CLI and restore was finished.
> 
> Moving to verified.

Didn't check docs - it might be worth a doc bug, if this flow isn't explained there well enough.
I tend to close current as duplicate of one of the above bugs.

Comment 4 Marina Kalinin 2020-11-02 16:39:08 UTC
It is very unclear to the user that this is required.
This is part of any migration process to 4.4, in addition to DR recovery.

I suggest we fix it in the software and adjust our documentation until then.
CEE is working on the documentation part + Upgrade helper.
Let's keep this bug to track this problem and see if we can address it in the software.

Comment 5 Nikolai Sednev 2020-11-02 19:32:31 UTC
1.I deployed rhvm-4.3.11.3-0.1.el7.noarch on 3 ha hosts over NFS, running on these components:
ovirt-ansible-hosted-engine-setup-1.0.38-1.el7ev.noarch
ovirt-ansible-engine-setup-1.1.9-1.el7ev.noarch
ovirt-ansible-repositories-1.1.6-1.el7ev.noarch
ansible-2.9.13-1.el7ae.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.13-2.el7ev.noarch
Linux 3.10.0-1160.7.1.el7.x86_64 #1 SMP Thu Oct 29 16:14:02 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.9 (Maipo)
2.I prepared environment to use one additional non-management logical network on all 3 hosts.
3.I set environment to global maintenance.
4.I ran backup and copied it's files to my laptop.
engine-backup --mode=backup --file=nsednev_from_alma07_rhevm_4_3 --log=Log_nsednev_from_alma07_rhevm_4_3
5.I reprovisioned alma07 to RHEL8.3, the host that was running the engine and it was also SPM host.
6.I restored from backup file e.g. "[root@alma07 ~]# hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma07_rhevm_4_3", while using 4.4.3-12 latest build from 30.10.20, on clean alma07, after copying the backup file to alma07's /root from my laptop. Host had these components:
ovirt-hosted-engine-setup-2.4.8-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch
ovirt-ansible-collection-1.2.0-1.el8ev.noarch
ansible-2.9.14-1.el8ae.noarch
Linux 4.18.0-240.2.1.el8_3.x86_64 #1 SMP Tue Oct 27 08:54:58 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Red Hat Enterprise Linux release 8.3 (Ootpa)

7.During restore I answered "yes" to:
Pause the execution after adding this host to the engine?
          You will be able to iteratively connect to the restored engine in order to manually review and remediate its configuration before proceeding with the deployment:
          please ensure that all the datacenter hosts and storage domain are listed as up or in maintenance mode before proceeding.
          This is normally not required when restoring an up to date and coherent backup. (Yes, No)[No]: yes
8.Added missing network to alma07 via:
[ INFO  ] You can now connect to https://alma07.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, please continue only when the host is listed as 'up'
[ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : include_tasks]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Create temporary lock file]
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [ovirt.ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.fso1sz7n_he_setup_lock is removed, delete it once ready to proceed]
9.And when finished I deleted lock "rm -rf /tmp/ansible.fso1sz7n_he_setup_lock".
10.Restore successfully finished. "[ INFO  ] Hosted Engine successfully deployed".

Comment 6 Yedidyah Bar David 2020-11-03 07:42:57 UTC
(In reply to Marina Kalinin from comment #4)
> It is very unclear to the user that this is required.
> This is part of any migration process to 4.4, in addition to DR recovery.
> 
> I suggest we fix it in the software and adjust our documentation until then.

How exactly?

Change the "pause" prompt to default to "Yes" (will require each user to
manually remove the lock file once ready, even if the host is up and no
intervention was required)?

Change the text?

Something more complex?

Comment 7 Marina Kalinin 2020-11-03 18:40:50 UTC
Michal had some ideas when we discussed this at the build meeting.

I suggested to notify the customer after backup taking that there are required network and customer needs to be aware of the issue at restore. But otherwise - I can't advise how exactly to fix it.

Comment 8 Steffen Froemer 2020-11-04 22:33:58 UTC
As Marina said, this issue isn't 
 1) documented through update process, so every user performing upgrade would fail on this step. 
 2) should be changed to a process the customer does not need to interact with the system until it's deployed completely.


It's not clear, what requirements are set to the host. When the host does require to have set the networks configured, why we can't configure the host like required?
We should have all information available. If additional settings or information required, they could be defined through host-deploy process before the process is started. That said, it should be possible to setup the expected network from the customer before the backup is restored. 

This could be done either
 - automatically, based on information written during backup
 - manually using ansible/yaml file through customer

Comment 9 Michal Skrivanek 2020-11-05 10:24:59 UTC
we can't fix it automatically, but we can detect the case and instruct users to fix it instead of immediate abort

Comment 10 Yedidyah Bar David 2020-11-05 10:46:49 UTC
(In reply to Steffen Froemer from comment #8)
> As Marina said, this issue isn't 
>  1) documented through update process, so every user performing upgrade
> would fail on this step. 
>  2) should be changed to a process the customer does not need to interact
> with the system until it's deployed completely.
> 
> 
> It's not clear, what requirements are set to the host. When the host does
> require to have set the networks configured, why we can't configure the host
> like required?

Because we can't know which nic should be attached to which required network.
Even if we have this information in the backup, it can be different on the
host you restore on (due to hardware being different, even slightly (such as
nic is in a different pci slot)).

> We should have all information available. If additional settings or
> information required, they could be defined through host-deploy process
> before the process is started. That said, it should be possible to setup the
> expected network from the customer before the backup is restored. 
> 
> This could be done either
>  - automatically, based on information written during backup
>  - manually using ansible/yaml file through customer

I do not think we want to go this route. The whole point of the migration
to node-zero hosted-engine-deploy, was to stop duplicating functionality
we already have in the engine. The engine has good means (UI/API) to let users
setup networks. Let's use that.

Users that want to automate this, and know what they are doing, can already do
this, using a enginevm_after_engine_setup hook, as in comment 0 - but please
note that the path is different now, since the move to ovirt-ansible-collection
(now filed bug 1894875 for this).

So this bug is meant for interactive restores. For these, I think it's enough
that we allow the user to manually fix/configure/whatever, even if they didn't
realize they might want/need to, and didn't reply 'Yes' to "pause?". Right?

Michal and I now discussed this in private, and have some ideas about the
details, but first have to get an ack for above.

At minimum, I think it won't be that bad even to just do a one-line change,
which is to change the default answer to 'pause?' to 'Yes'.

Comment 11 Marina Kalinin 2020-11-05 21:51:22 UTC
(In reply to Yedidyah Bar David from comment #10)
> (In reply to Steffen Froemer from comment #8)
> > As Marina said, this issue isn't 
> >  1) documented through update process, so every user performing upgrade
> > would fail on this step. 
> >  2) should be changed to a process the customer does not need to interact
> > with the system until it's deployed completely.
> > 
> > 
> > It's not clear, what requirements are set to the host. When the host does
> > require to have set the networks configured, why we can't configure the host
> > like required?
> 
> Because we can't know which nic should be attached to which required network.
> Even if we have this information in the backup, it can be different on the
> host you restore on (due to hardware being different, even slightly (such as
> nic is in a different pci slot)).
> 
> > We should have all information available. If additional settings or
> > information required, they could be defined through host-deploy process
> > before the process is started. That said, it should be possible to setup the
> > expected network from the customer before the backup is restored. 
> > 
> > This could be done either
> >  - automatically, based on information written during backup
> >  - manually using ansible/yaml file through customer
> 
> I do not think we want to go this route. The whole point of the migration
> to node-zero hosted-engine-deploy, was to stop duplicating functionality
> we already have in the engine. The engine has good means (UI/API) to let
> users
> setup networks. Let's use that.
> 
> Users that want to automate this, and know what they are doing, can already
> do
> this, using a enginevm_after_engine_setup hook, as in comment 0 - but please
> note that the path is different now, since the move to
> ovirt-ansible-collection
> (now filed bug 1894875 for this).
> 
> So this bug is meant for interactive restores. For these, I think it's enough
> that we allow the user to manually fix/configure/whatever, even if they
> didn't
> realize they might want/need to, and didn't reply 'Yes' to "pause?". Right?
> 
> Michal and I now discussed this in private, and have some ideas about the
> details, but first have to get an ack for above.
> 
> At minimum, I think it won't be that bad even to just do a one-line change,
> which is to change the default answer to 'pause?' to 'Yes'.
Sounds like a good idea. Though I would also like to get my idea from above implemented, if possible -> to give the customer an earlier notification, that this step would be needed, already at the end of taking the backup. This will enable the customer to be more proactive and know what to expect during the restore.

Comment 12 Marina Kalinin 2020-11-05 21:52:55 UTC
Didi,

We need proper correct steps for hte customer today.
Can you please work with Steffen to get them properly documented?
Once we have that done, we can update the documentation and the upgrade helper and give you more time to address the issue in the code.

Thank you!

Comment 13 Yedidyah Bar David 2020-11-08 07:30:17 UTC
(In reply to Marina Kalinin from comment #12)
> Didi,
> 
> We need proper correct steps for hte customer today.

The steps should be:

If you have any reason whatsoever to think that deployment might
fail, and perhaps even if you do not:

1. When asked whether to pause the execution after adding the host
to the engine, reply 'Yes'.

2. At some point during deployment, you'll see something like:

[ INFO  ] You can now connect to https://alma03.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, please continue only when the host is listed as 'up'
[ INFO  ] TASK [ovirt.hosted_engine_setup : include_tasks]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [ovirt.hosted_engine_setup : Create temporary lock file]
[ INFO  ] changed: [localhost -> localhost]
[ INFO  ] TASK [ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.WGeSW8_he_setup_lock is removed, delete it once ready to proceed]

At that point, it will pause and wait.

3. Login to the engine's web admin ui.

4. Verify that the host is in Up state. If it's not, manually do
whatever that's needed to bring it up.

5. In particular, if you had required networks before backup,
make sure to assign host NICs to the required networks as applicable.

6. Once ready, remove the file indicated at step (2.) to let the
process continue.

> Can you please work with Steffen to get them properly documented?

I think above is already part of the documentation, perhaps in less
suggestive language... If not, please arrange that it becomes so.

> Once we have that done, we can update the documentation and the upgrade
> helper and give you more time to address the issue in the code.

There is no problem with the code :-), IMO.

The only code changes I personally considered and discussed are
only about the above interaction - to make the text clearer,
perhaps change the default answer to 'Yes', perhaps prompt
again on failure even if user replied 'No', etc.

> 
> Thank you!

:-)

Comment 14 Yedidyah Bar David 2020-11-10 11:09:28 UTC
Steffen/Marina - can you please ack comment 10?

If so:

Is it enough/ok to just change the default answer to 'pause?' to Yes? Perhaps with some text changes for the prompt there (suggestions are welcome)?

Thanks!

Comment 15 Steffen Froemer 2020-11-11 07:24:44 UTC
Didi, I'm not very happy with this solution, to simply change the default answer of "pausing" the deployment process and wait for customer interaction.

As already mentioned, we would only know on which NIC the networks need to be assigned when the host is the same as before it could fail on new hardware.
Would it be a more convenient way, to give the customer 2 options, when we detect the host does have required networks set?

- Provide setup-host-network.yml to apply the correct network settings during setup-routine (similar to the hook from [1] but with correct setup)
- Ask customer to change networks to non-required before the deployment is started. 

Would that be possible?

One additional question, what firewall-requirements are needed to get access to the hosted-engine during the deployment process?
Where does the engine listen to?

Thanks, Steffen


[1]: https://access.redhat.com/solutions/4088711

Comment 16 Yedidyah Bar David 2020-11-11 11:26:21 UTC
(In reply to Steffen Froemer from comment #15)
> Didi, I'm not very happy with this solution, to simply change the default
> answer of "pausing" the deployment process and wait for customer interaction.
> 
> As already mentioned, we would only know on which NIC the networks need to
> be assigned when the host is the same as before it could fail on new
> hardware.

(I assume you meant this as two separate statments - "... as before. It ...".
Otherwise, I fail to parse).

> Would it be a more convenient way, to give the customer 2 options, when we
> detect the host does have required networks set?

Doing this "when we detect", currently, is rather late. I expect you meant
to add new code that tests this beforehand.

> 
> - Provide setup-host-network.yml to apply the correct network settings
> during setup-routine (similar to the hook from [1] but with correct setup)

If users know what to provide, they can already do that beforehand.
We assume that your problem is mainly for the case they do not know (either
because they lack experience/skill or simply because it's different hardware
and they didn't yet test).

> - Ask customer to change networks to non-required before the deployment is
> started. 

Do you really think that's an option?

I admit I do not have much experience with this side of the project.
I thought changing networks is cluster-wide and is not something you'd
like to do casually, without preparation/planning/etc.

> 
> Would that be possible?

We'd rather not add code to check the backup's content to see if it includes
required networks etc. . I realize that without this, the failure would happen
much later, which can be inconvenient.

However, we *think* that it's not so bad, if, when it does fail, this does
not completely fail deploy/restore, but allows the user to manually fix and
then continue.

If you decide that this is completely unacceptable, then we'll have to "bite
the bullet" and do add this complexity.

> 
> One additional question, what firewall-requirements are needed to get access
> to the hosted-engine during the deployment process?
> Where does the engine listen to?

I didn't try this myself, but see previous comments - the tool automatically
temporarily opens up access to the engine web ui, via an ssh tunnel - you
connect to the host on port 6900 and should reach the engine.

I guess this requires more documentation, btw - searching for relevant
keywords and with "6900" only finds me above-mentioned bugs.

Thanks!

Comment 17 Steffen Froemer 2020-11-11 13:30:06 UTC
(In reply to Yedidyah Bar David from comment #16)
> 
> (I assume you meant this as two separate statments - "... as before. It ...".
> Otherwise, I fail to parse).
Ignore this part. It does not contain useful amount of information :)

> 
> > Would it be a more convenient way, to give the customer 2 options, when we
> > detect the host does have required networks set?
> 
> Doing this "when we detect", currently, is rather late. I expect you meant
> to add new code that tests this beforehand.
Maybe. Details next two steps.

> 
> > 
> > - Provide setup-host-network.yml to apply the correct network settings
> > during setup-routine (similar to the hook from [1] but with correct setup)
> 
> If users know what to provide, they can already do that beforehand.
> We assume that your problem is mainly for the case they do not know (either
> because they lack experience/skill or simply because it's different hardware
> and they didn't yet test).
The biggest problem here is the missing documentation. The current workaround I know of, is mark the networks for the deployment-host as non-required using  the hooks-mechanism.
This would not the solution I would aim for. 
If I'm an experienced user, I would be possible to create a yaml-file to add to the hooks-mechanism, which will setup my host-network like required. This should include
 - host IP
 - host bond/vlan settings
 - rhv network / host network->nic assignment



> 
> > - Ask customer to change networks to non-required before the deployment is
> > started. 
> 
> Do you really think that's an option?
> 
> I admit I do not have much experience with this side of the project.
> I thought changing networks is cluster-wide and is not something you'd
> like to do casually, without preparation/planning/etc.
As the current workaround doing exactly this, it might be an easier step for the customer.
The changes can be applied afterwards as well. But I would prefer option above.


> 
> > 
> > Would that be possible?
> 
> We'd rather not add code to check the backup's content to see if it includes
> required networks etc. . I realize that without this, the failure would
> happen
> much later, which can be inconvenient.
> 
> However, we *think* that it's not so bad, if, when it does fail, this does
> not completely fail deploy/restore, but allows the user to manually fix and
> then continue.
Technically this would be true, as nothing will break. But if our exception will be to have the upgrade experience fail the first time, to give customer advices how to act for the current upgrade process, this is the worst user experience we can provide. From a customer perspective buying an enterprise product, I would definitely not expecting such behavior!


> 
> If you decide that this is completely unacceptable, then we'll have to "bite
> the bullet" and do add this complexity.
See above


> > One additional question, what firewall-requirements are needed to get access
> > to the hosted-engine during the deployment process?
> > Where does the engine listen to?
> 
> I didn't try this myself, but see previous comments - the tool automatically
> temporarily opens up access to the engine web ui, via an ssh tunnel - you
> connect to the host on port 6900 and should reach the engine.
> 
> I guess this requires more documentation, btw - searching for relevant
> keywords and with "6900" only finds me above-mentioned bugs.
> 
That's another problem. Most customers I'm aware of, does not have access to the hosts directly and only to SSH (port 22). 
Their network is secured by different firewalls in between and the admin is not able to connect different ports than 80/443 of the manager (fixed IP)
This means even if the installation routine will pause, it will not help any customer to fix the issue.

We need other options here.

Comment 18 Marina Kalinin 2020-11-12 02:50:44 UTC
1. If I understand correctly, we have the hook to make the networks not required for the cases that customer does not have access to the hypervisor and cannot open the RHVM UI when it is paused, right?
Do we have the same hook for 4.4? If not, we must provide it.
1.1. BTW, does the hook really make the networks non-required? Then we also need to explain this to the user and the user may want to correct this after the environment is up.

2. I went through the upgrade guide and I do not see where we mention that this step is needed. So we would need a documentation bug to add this.
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/upgrade_guide/index
Please correct me if I am wrong and it is there.

3. Code fix suggestions:
3.1. During engine-backup check if there are required networks besides ovirtmgmt and let user know by the end of engine-backup that this step would be needed and those are the networks they need to keep in mind.
3.2. Clarify the text during hosted engine deployment to make it more obvious what is expected from the user.


Didi, 
Would you like to have new bugs for items 1 and 3 or we can cover them here?

Comment 19 Yedidyah Bar David 2020-11-12 09:46:59 UTC
(In reply to Marina Kalinin from comment #18)
> 1. If I understand correctly, we have the hook to make the networks not
> required for the cases that customer does not have access to the hypervisor
> and cannot open the RHVM UI when it is paused, right?
> Do we have the same hook for 4.4? If not, we must provide it.

Not sure what you mean here. If it's "do we have the hook _mechanism_", then
yes, we have it. If it's whether we have the actual content, then assuming
the example provided in "Additional info" works, tested and verified, then
we have it - but it's just an example - users willing to use it will need
at minimum to amend the list of networks there (under "with_items:").

> 1.1. BTW, does the hook really make the networks non-required? Then we also
> need to explain this to the user and the user may want to correct this after
> the environment is up.

That's my reading. I didn't test it.

> 
> 2. I went through the upgrade guide and I do not see where we mention that
> this step is needed. So we would need a documentation bug to add this.
> https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/
> html-single/upgrade_guide/index
> Please correct me if I am wrong and it is there.

I didn't read all of it. The "deploy with --import-from-file" part, step 8
of [1], does not provide any details. It probably implies that this process
is very similar to clean install - the well-detailed [2] - which is
mostly true, with a few notable exceptions - one of which is the "pause?"
prompt (which only appears in --import-from-file). So from doc POV, it's
probably enough to add some content to [1], perhaps as another paragraph
or a "Note".

[1] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/upgrade_guide/index#Upgrading_the_Manager_to_4-4_4-3_SHE

[2] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/installing_red_hat_virtualization_as_a_self-hosted_engine_using_the_command_line/index#Deploying_the_Self-Hosted_Engine_Using_the_CLI_install_RHVM

> 
> 3. Code fix suggestions:
> 3.1. During engine-backup check if there are required networks besides
> ovirtmgmt and let user know by the end of engine-backup that this step would
> be needed and those are the networks they need to keep in mind.

This is very easy to do - just supply the text you want and we can add it.

That said, IMO it's useless. People do not read such outputs unless the
program hangs telling them "Press Enter to continue", and then they also
usually press Enter without reading. Often, this is ran by cron or whatever.
Last but not least - this requires a change in 4.3, so you should file the
bug there - and I am not sure it's consider worth adding there.

> 3.2. Clarify the text during hosted engine deployment to make it more
> obvious what is expected from the user.

Only the text? Also change the default of "pause?" to Yes? Something more
complex? Please coordinate with Steffen (and PM if needed) and get to a
conclusion. Thanks!

> 
> 
> Didi, 
> Would you like to have new bugs for items 1 and 3 or we can cover them here?

I think (3.2.) can be left as current bug, and another bug for (1.).

If you want also (3.1.), please open a bug on engine-backup on 4.3. Thanks.

Comment 20 Yedidyah Bar David 2020-11-16 10:36:32 UTC
(In reply to Steffen Froemer from comment #17)
> If I'm an experienced user, I would be possible to create a yaml-file to add
> to the hooks-mechanism, which will setup my host-network like required. This
> should include
>  - host IP
>  - host bond/vlan settings
>  - rhv network / host network->nic assignment

I see your point.

Can you please link/attach logs for an example failed deploy/restore?

I guess it failed during "Wait for the host to be up" (but probably never had the host up). Not sure.

I am not sure our current ansible modules allow passing above information directly when adding a host.
So if that's what you want, we basically have two options:

1. First patch the ansible module to allow that, then patch hosted-engine to allow hooking into this (during adding the host).

2. Use current work around (set networks as non-required), and allow another hook, between "Add host" and "Wait for the host to be up". There, you'll be able to supply your own hook, which will assign the needed networks (using ovirt_host_networks) and try to activate the host again.

We can do both, of course - and let the user pick one.

> Technically this would be true, as nothing will break. But if our exception
> will be to have the upgrade experience fail the first time, to give customer
> advices how to act for the current upgrade process, this is the worst user
> experience we can provide. From a customer perspective buying an enterprise
> product, I would definitely not expecting such behavior!

It will never seem like a failure - just another part of the interaction.
I also think (didn't test - see above) that it's currently the only way,
whether you do this manually or automatically (using a hook we'll add or
adding a specific feature in the code).

If all goes well, the only place a user will see "failure" is in the logs.
Is that so bad?

> That's another problem. Most customers I'm aware of, does not have access to
> the hosts directly and only to SSH (port 22). 
> Their network is secured by different firewalls in between and the admin is
> not able to connect different ports than 80/443 of the manager (fixed IP)
> This means even if the installation routine will pause, it will not help any
> customer to fix the issue.
> 
> We need other options here.

I think you can do whatever you want using an ssh tunnel. Something like:

ssh -L443:localhost:6900 root@host (perhaps with sudo)

and add a (temporary) suitable entry to your local /etc/hosts.

I didn't try that.

Comment 21 Steve Goodman 2020-11-17 05:53:59 UTC
Based on comment 5, This is what I understand we need to add to the Upgrade Guide for SHE:

7. During restore answer "yes" to:
   Pause the execution after adding this host to the engine?

8. Add missing network to alma07:
  a. HOW DO I DO THIS STEP?: Connect to https://alma07.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it,   
  b. Continue when the host is listed as 'up'
9. Delete the lock file:
   "rm -rf /tmp/ansible.fso1sz7n_he_setup_lock".


THIS IS WHAT'S NOT CLEAR:
How do I add the missing network to alma07?
 - How do I connect to it while in the middle of deployment? Do I need to open another terminal? Or connect from a different machine? Or what?
 - Once I've connected, how do I remediate the status of the host?

Comment 22 Yedidyah Bar David 2020-11-17 07:22:16 UTC
(In reply to Steve Goodman from comment #21)
> Based on comment 5, This is what I understand we need to add to the Upgrade
> Guide for SHE:
> 
> 7. During restore answer "yes" to:
>    Pause the execution after adding this host to the engine?
> 
> 8. Add missing network to alma07:
>   a. HOW DO I DO THIS STEP?: Connect to
> https://alma07.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status
> of this host and eventually remediate it,   
>   b. Continue when the host is listed as 'up'
> 9. Delete the lock file:
>    "rm -rf /tmp/ansible.fso1sz7n_he_setup_lock".
> 
> 
> THIS IS WHAT'S NOT CLEAR:
> How do I add the missing network to alma07?
>  - How do I connect to it while in the middle of deployment? Do I need to
> open another terminal? Or connect from a different machine? Or what?

Just use a browser with this url. It should work. It's temporary - will stop working when deployment finishes.

>  - Once I've connected, how do I remediate the status of the host?

Depending on why it's not up. If it's only due to the issue of current bug, you should go to "Network Interfaces" -> "Setup Host Networks", and assign the relevant interfaces to the required networks.

Please create a doc bug for this. It seems like we'll also do some code changes for current bug.

Comment 23 Nikolai Sednev 2020-11-17 08:29:48 UTC
Removing needinfo forth to comment #22.

Comment 24 Yedidyah Bar David 2020-11-24 20:18:42 UTC
Using the current patches in the linked PR, verified both ways. Deployed 4.3 hosted-engine, added dummy nic and network (required), ran backup. Then restored in 4.4.

1. Manually - it paused (even though I replied 'No' to "Pause?". There is a separate var for this, default true, we do not ask about it), I created a dummy nic, connected to web admin ui at :6900, attached the nic to the network, activated the host and removed the lock file.

2. Automatically - Edited the example hook, copied to the right place as instructed in it, used an answer file generated by a previous run, and deploy ran fully automatically without a prompt until asking about storage.

Sounds reasonable?

Comment 25 Steffen Froemer 2020-11-26 00:09:05 UTC
(In reply to Yedidyah Bar David from comment #24)
> Using the current patches in the linked PR, verified both ways. Deployed 4.3
> hosted-engine, added dummy nic and network (required), ran backup. Then
> restored in 4.4.
> 
> 1. Manually - it paused (even though I replied 'No' to "Pause?". There is a
> separate var for this, default true, we do not ask about it), I created a
> dummy nic, connected to web admin ui at :6900, attached the nic to the
> network, activated the host and removed the lock file.
> 
> 2. Automatically - Edited the example hook, copied to the right place as
> instructed in it, used an answer file generated by a previous run, and
> deploy ran fully automatically without a prompt until asking about storage.
> 
> Sounds reasonable?

Yes. now it only need to be documented well.
Also I just created a small playbook, which allows customers to easily create the hook automatically, based on the old platform.

https://github.com/knumskull/ovirt-ansible

what do  you think?

Comment 26 Yedidyah Bar David 2020-11-26 06:22:17 UTC
(In reply to Steffen Froemer from comment #25)
> Yes. now it only need to be documented well.

Definitely. But I hope it will be clear and intuitive enough even for those that do not read documentation.

> Also I just created a small playbook, which allows customers to easily
> create the hook automatically, based on the old platform.
> 
> https://github.com/knumskull/ovirt-ansible
> 
> what do  you think?

Looks good to me.

I understand it's designed to be ran against the live old engine, right?
So useful for upgrades, but not for "real" restores.

Comment 27 Yedidyah Bar David 2020-11-26 15:19:38 UTC
QE:

To reproduce/verify:

1. Deploy 4.3 hosted-engine
2. Change the "Default" Cluster to have more than one network, and have the other network also "required".
If you do not have a host with more than one nic, you can create a dummy one with e.g. 'ip link add dummy_1 type dummy'.
3. Take a backup
4. Try to upgrade to 4.4 using this backup, twice:
4.1. As-is, just following the docs. Accept the default 'No' to 'pause?'. Instead of failing (which is what happens in previous versions), it:
- Tell you that adding the host failed
- Provide some details. Specifically, the output should include something about the required networks. If it does not, that's a bug, please report it and attach logs.
- Include a link to the web admin ui. The URL will include the host, not the engine, on port :6900. This is temporary, and works only during the deployment.
- Output a lock file you should remove once finished.
So: Connect to the web admin, fix the issue (you might need to add a dummy nic as above), activate the host, then remove the lock file. It should continue successfully.
4.2 Alternatively, supply a hook, see [1][2] for details:
- Copy the file from /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/hosted_engine_setup/examples/required_networks_fix.yml to /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/hosted_engine_setup/hooks/after_add_host/
- Edit it replacing "myhost", "eth0" and "net1" as applicable
- Then try to upgrade/restore. It should succeed without asking anything until the storage prompt.

If you intend to reuse the same host for testing more than one flow without complete reinstall, please note that 'ovirt-hosted-engine-cleanup' does not completely clean networks data, so next attempts might be affected. I pushed a patch for this [3], but didn't yet open a bug. So either run, after cleanup, also 'vdsm-tool clear-nets --exclude-net ovirtmgmt', or reinstall the OS.

Doc: Only doc for now is [1][2]. We probably want a doc bug as well.

[1] https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hosted_engine_setup/README.md#make-changes-in-the-engine-vm-during-the-deployment
[2] https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hosted_engine_setup/examples/required_networks_fix.yml
[3] https://gerrit.ovirt.org/112336

Comment 29 Yedidyah Bar David 2020-12-06 15:05:48 UTC
We have a related doc bug 1695523.

Comment 30 Nikolai Sednev 2020-12-14 09:47:10 UTC
ovirt-ansible-collection-1.2.3

Comment 34 errata-xmlrpc 2021-02-02 13:58:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0312


Note You need to log in before you can comment on or make changes to this bug.