Description of problem: During upgrade from RHV-4.3 hosted engine, the deployment of new HE will fail if any non-management logical network is marked as required in backup file Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.4.6-2.el8ev.noarch How reproducible: always Steps to Reproduce: 1. Have a backup_file required non-management logical networks. 2. hosted-engine --deploy --restore-from-file=backup/file_name Actual results: It's failing with following errors Deployment errorlogs ~~~ 2020-10-28 23:16:55,614+0000 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."} 2020-10-28 23:18:49,098+0000 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} 2020-10-28 23:18:51,005+0000 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed executing ansible-playbook 2020-10-28 23:19:16,527+0000 ERROR otopi.plugins.gr_he_common.core.misc misc._terminate:167 Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch. ~~~ Expected results: it should work Additional info: The workaround from KCS #4088711 works well # cat > /usr/share/ansible/roles/ovirt.hosted-engine-setup/hooks/enginevm_after_engine_setup/network_fix.yml << EOF - include_tasks: auth_sso.yml - name: Wait for the engine to reach a stable condition wait_for: timeout=300 - name: fix network ovirt_network: auth: "{{ ovirt_auth }}" name: "{{ item }}" data_center: Default clusters: - name: Default required: False with_items: - backend - frontend - public-zone - security-zone - storage EOF
It's related or same as https://bugzilla.redhat.com/show_bug.cgi?id=1686575
(In reply to Steffen Froemer from comment #0) > Steps to Reproduce: > 1. Have a backup_file required non-management logical networks. > 2. hosted-engine --deploy --restore-from-file=backup/file_name > Expected results: > it should work (In reply to Steffen Froemer from comment #1) > It's related or same as https://bugzilla.redhat.com/show_bug.cgi?id=1686575 Indeed. Did you check this bug? The "solution" isn't automatic, as in "it should work". Copying also bug 1712667 comment 13: > During restore I had to answer "YES" to "Pause the execution after adding > this host to the engine? > You will be able to iteratively connect to the restored engine in > order to manually review and remediate its configuration before proceeding > with the deployment: > please ensure that all the datacenter hosts and storage domain are > listed as up or in maintenance mode before proceeding. > This is normally not required when restoring an up to date and > coherent backup. (Yes, No)[No]: yes" > > Then at some point I was asked to make changes to running engine environment: > "[ INFO ] You can now connect to > https://alma03.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status > of this host and eventually remediate it, please continue only when the host > is listed as 'up' > [ INFO ] TASK [ovirt.hosted_engine_setup : include_tasks] > [ INFO ] ok: [localhost] > [ INFO ] TASK [ovirt.hosted_engine_setup : Create temporary lock file] > [ INFO ] changed: [localhost -> localhost] > [ INFO ] TASK [ovirt.hosted_engine_setup : Pause execution until > /tmp/ansible.WGeSW8_he_setup_lock is removed, delete it once ready to > proceed] > " > When I was done and the host got listed "up", I manually deleted > "/tmp/ansible.WGeSW8_he_setup_lock" from host and deployment continued as > desired, until it successfully finished. > Then I disabled global maintenance via CLI and restore was finished. > > Moving to verified. Didn't check docs - it might be worth a doc bug, if this flow isn't explained there well enough. I tend to close current as duplicate of one of the above bugs.
It is very unclear to the user that this is required. This is part of any migration process to 4.4, in addition to DR recovery. I suggest we fix it in the software and adjust our documentation until then. CEE is working on the documentation part + Upgrade helper. Let's keep this bug to track this problem and see if we can address it in the software.
1.I deployed rhvm-4.3.11.3-0.1.el7.noarch on 3 ha hosts over NFS, running on these components: ovirt-ansible-hosted-engine-setup-1.0.38-1.el7ev.noarch ovirt-ansible-engine-setup-1.1.9-1.el7ev.noarch ovirt-ansible-repositories-1.1.6-1.el7ev.noarch ansible-2.9.13-1.el7ae.noarch ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.13-2.el7ev.noarch Linux 3.10.0-1160.7.1.el7.x86_64 #1 SMP Thu Oct 29 16:14:02 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.9 (Maipo) 2.I prepared environment to use one additional non-management logical network on all 3 hosts. 3.I set environment to global maintenance. 4.I ran backup and copied it's files to my laptop. engine-backup --mode=backup --file=nsednev_from_alma07_rhevm_4_3 --log=Log_nsednev_from_alma07_rhevm_4_3 5.I reprovisioned alma07 to RHEL8.3, the host that was running the engine and it was also SPM host. 6.I restored from backup file e.g. "[root@alma07 ~]# hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma07_rhevm_4_3", while using 4.4.3-12 latest build from 30.10.20, on clean alma07, after copying the backup file to alma07's /root from my laptop. Host had these components: ovirt-hosted-engine-setup-2.4.8-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch ovirt-ansible-collection-1.2.0-1.el8ev.noarch ansible-2.9.14-1.el8ae.noarch Linux 4.18.0-240.2.1.el8_3.x86_64 #1 SMP Tue Oct 27 08:54:58 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.3 (Ootpa) 7.During restore I answered "yes" to: Pause the execution after adding this host to the engine? You will be able to iteratively connect to the restored engine in order to manually review and remediate its configuration before proceeding with the deployment: please ensure that all the datacenter hosts and storage domain are listed as up or in maintenance mode before proceeding. This is normally not required when restoring an up to date and coherent backup. (Yes, No)[No]: yes 8.Added missing network to alma07 via: [ INFO ] You can now connect to https://alma07.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, please continue only when the host is listed as 'up' [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : include_tasks] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Create temporary lock file] [ INFO ] changed: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.fso1sz7n_he_setup_lock is removed, delete it once ready to proceed] 9.And when finished I deleted lock "rm -rf /tmp/ansible.fso1sz7n_he_setup_lock". 10.Restore successfully finished. "[ INFO ] Hosted Engine successfully deployed".
(In reply to Marina Kalinin from comment #4) > It is very unclear to the user that this is required. > This is part of any migration process to 4.4, in addition to DR recovery. > > I suggest we fix it in the software and adjust our documentation until then. How exactly? Change the "pause" prompt to default to "Yes" (will require each user to manually remove the lock file once ready, even if the host is up and no intervention was required)? Change the text? Something more complex?
Michal had some ideas when we discussed this at the build meeting. I suggested to notify the customer after backup taking that there are required network and customer needs to be aware of the issue at restore. But otherwise - I can't advise how exactly to fix it.
As Marina said, this issue isn't 1) documented through update process, so every user performing upgrade would fail on this step. 2) should be changed to a process the customer does not need to interact with the system until it's deployed completely. It's not clear, what requirements are set to the host. When the host does require to have set the networks configured, why we can't configure the host like required? We should have all information available. If additional settings or information required, they could be defined through host-deploy process before the process is started. That said, it should be possible to setup the expected network from the customer before the backup is restored. This could be done either - automatically, based on information written during backup - manually using ansible/yaml file through customer
we can't fix it automatically, but we can detect the case and instruct users to fix it instead of immediate abort
(In reply to Steffen Froemer from comment #8) > As Marina said, this issue isn't > 1) documented through update process, so every user performing upgrade > would fail on this step. > 2) should be changed to a process the customer does not need to interact > with the system until it's deployed completely. > > > It's not clear, what requirements are set to the host. When the host does > require to have set the networks configured, why we can't configure the host > like required? Because we can't know which nic should be attached to which required network. Even if we have this information in the backup, it can be different on the host you restore on (due to hardware being different, even slightly (such as nic is in a different pci slot)). > We should have all information available. If additional settings or > information required, they could be defined through host-deploy process > before the process is started. That said, it should be possible to setup the > expected network from the customer before the backup is restored. > > This could be done either > - automatically, based on information written during backup > - manually using ansible/yaml file through customer I do not think we want to go this route. The whole point of the migration to node-zero hosted-engine-deploy, was to stop duplicating functionality we already have in the engine. The engine has good means (UI/API) to let users setup networks. Let's use that. Users that want to automate this, and know what they are doing, can already do this, using a enginevm_after_engine_setup hook, as in comment 0 - but please note that the path is different now, since the move to ovirt-ansible-collection (now filed bug 1894875 for this). So this bug is meant for interactive restores. For these, I think it's enough that we allow the user to manually fix/configure/whatever, even if they didn't realize they might want/need to, and didn't reply 'Yes' to "pause?". Right? Michal and I now discussed this in private, and have some ideas about the details, but first have to get an ack for above. At minimum, I think it won't be that bad even to just do a one-line change, which is to change the default answer to 'pause?' to 'Yes'.
(In reply to Yedidyah Bar David from comment #10) > (In reply to Steffen Froemer from comment #8) > > As Marina said, this issue isn't > > 1) documented through update process, so every user performing upgrade > > would fail on this step. > > 2) should be changed to a process the customer does not need to interact > > with the system until it's deployed completely. > > > > > > It's not clear, what requirements are set to the host. When the host does > > require to have set the networks configured, why we can't configure the host > > like required? > > Because we can't know which nic should be attached to which required network. > Even if we have this information in the backup, it can be different on the > host you restore on (due to hardware being different, even slightly (such as > nic is in a different pci slot)). > > > We should have all information available. If additional settings or > > information required, they could be defined through host-deploy process > > before the process is started. That said, it should be possible to setup the > > expected network from the customer before the backup is restored. > > > > This could be done either > > - automatically, based on information written during backup > > - manually using ansible/yaml file through customer > > I do not think we want to go this route. The whole point of the migration > to node-zero hosted-engine-deploy, was to stop duplicating functionality > we already have in the engine. The engine has good means (UI/API) to let > users > setup networks. Let's use that. > > Users that want to automate this, and know what they are doing, can already > do > this, using a enginevm_after_engine_setup hook, as in comment 0 - but please > note that the path is different now, since the move to > ovirt-ansible-collection > (now filed bug 1894875 for this). > > So this bug is meant for interactive restores. For these, I think it's enough > that we allow the user to manually fix/configure/whatever, even if they > didn't > realize they might want/need to, and didn't reply 'Yes' to "pause?". Right? > > Michal and I now discussed this in private, and have some ideas about the > details, but first have to get an ack for above. > > At minimum, I think it won't be that bad even to just do a one-line change, > which is to change the default answer to 'pause?' to 'Yes'. Sounds like a good idea. Though I would also like to get my idea from above implemented, if possible -> to give the customer an earlier notification, that this step would be needed, already at the end of taking the backup. This will enable the customer to be more proactive and know what to expect during the restore.
Didi, We need proper correct steps for hte customer today. Can you please work with Steffen to get them properly documented? Once we have that done, we can update the documentation and the upgrade helper and give you more time to address the issue in the code. Thank you!
(In reply to Marina Kalinin from comment #12) > Didi, > > We need proper correct steps for hte customer today. The steps should be: If you have any reason whatsoever to think that deployment might fail, and perhaps even if you do not: 1. When asked whether to pause the execution after adding the host to the engine, reply 'Yes'. 2. At some point during deployment, you'll see something like: [ INFO ] You can now connect to https://alma03.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, please continue only when the host is listed as 'up' [ INFO ] TASK [ovirt.hosted_engine_setup : include_tasks] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Create temporary lock file] [ INFO ] changed: [localhost -> localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Pause execution until /tmp/ansible.WGeSW8_he_setup_lock is removed, delete it once ready to proceed] At that point, it will pause and wait. 3. Login to the engine's web admin ui. 4. Verify that the host is in Up state. If it's not, manually do whatever that's needed to bring it up. 5. In particular, if you had required networks before backup, make sure to assign host NICs to the required networks as applicable. 6. Once ready, remove the file indicated at step (2.) to let the process continue. > Can you please work with Steffen to get them properly documented? I think above is already part of the documentation, perhaps in less suggestive language... If not, please arrange that it becomes so. > Once we have that done, we can update the documentation and the upgrade > helper and give you more time to address the issue in the code. There is no problem with the code :-), IMO. The only code changes I personally considered and discussed are only about the above interaction - to make the text clearer, perhaps change the default answer to 'Yes', perhaps prompt again on failure even if user replied 'No', etc. > > Thank you! :-)
Steffen/Marina - can you please ack comment 10? If so: Is it enough/ok to just change the default answer to 'pause?' to Yes? Perhaps with some text changes for the prompt there (suggestions are welcome)? Thanks!
Didi, I'm not very happy with this solution, to simply change the default answer of "pausing" the deployment process and wait for customer interaction. As already mentioned, we would only know on which NIC the networks need to be assigned when the host is the same as before it could fail on new hardware. Would it be a more convenient way, to give the customer 2 options, when we detect the host does have required networks set? - Provide setup-host-network.yml to apply the correct network settings during setup-routine (similar to the hook from [1] but with correct setup) - Ask customer to change networks to non-required before the deployment is started. Would that be possible? One additional question, what firewall-requirements are needed to get access to the hosted-engine during the deployment process? Where does the engine listen to? Thanks, Steffen [1]: https://access.redhat.com/solutions/4088711
(In reply to Steffen Froemer from comment #15) > Didi, I'm not very happy with this solution, to simply change the default > answer of "pausing" the deployment process and wait for customer interaction. > > As already mentioned, we would only know on which NIC the networks need to > be assigned when the host is the same as before it could fail on new > hardware. (I assume you meant this as two separate statments - "... as before. It ...". Otherwise, I fail to parse). > Would it be a more convenient way, to give the customer 2 options, when we > detect the host does have required networks set? Doing this "when we detect", currently, is rather late. I expect you meant to add new code that tests this beforehand. > > - Provide setup-host-network.yml to apply the correct network settings > during setup-routine (similar to the hook from [1] but with correct setup) If users know what to provide, they can already do that beforehand. We assume that your problem is mainly for the case they do not know (either because they lack experience/skill or simply because it's different hardware and they didn't yet test). > - Ask customer to change networks to non-required before the deployment is > started. Do you really think that's an option? I admit I do not have much experience with this side of the project. I thought changing networks is cluster-wide and is not something you'd like to do casually, without preparation/planning/etc. > > Would that be possible? We'd rather not add code to check the backup's content to see if it includes required networks etc. . I realize that without this, the failure would happen much later, which can be inconvenient. However, we *think* that it's not so bad, if, when it does fail, this does not completely fail deploy/restore, but allows the user to manually fix and then continue. If you decide that this is completely unacceptable, then we'll have to "bite the bullet" and do add this complexity. > > One additional question, what firewall-requirements are needed to get access > to the hosted-engine during the deployment process? > Where does the engine listen to? I didn't try this myself, but see previous comments - the tool automatically temporarily opens up access to the engine web ui, via an ssh tunnel - you connect to the host on port 6900 and should reach the engine. I guess this requires more documentation, btw - searching for relevant keywords and with "6900" only finds me above-mentioned bugs. Thanks!
(In reply to Yedidyah Bar David from comment #16) > > (I assume you meant this as two separate statments - "... as before. It ...". > Otherwise, I fail to parse). Ignore this part. It does not contain useful amount of information :) > > > Would it be a more convenient way, to give the customer 2 options, when we > > detect the host does have required networks set? > > Doing this "when we detect", currently, is rather late. I expect you meant > to add new code that tests this beforehand. Maybe. Details next two steps. > > > > > - Provide setup-host-network.yml to apply the correct network settings > > during setup-routine (similar to the hook from [1] but with correct setup) > > If users know what to provide, they can already do that beforehand. > We assume that your problem is mainly for the case they do not know (either > because they lack experience/skill or simply because it's different hardware > and they didn't yet test). The biggest problem here is the missing documentation. The current workaround I know of, is mark the networks for the deployment-host as non-required using the hooks-mechanism. This would not the solution I would aim for. If I'm an experienced user, I would be possible to create a yaml-file to add to the hooks-mechanism, which will setup my host-network like required. This should include - host IP - host bond/vlan settings - rhv network / host network->nic assignment > > > - Ask customer to change networks to non-required before the deployment is > > started. > > Do you really think that's an option? > > I admit I do not have much experience with this side of the project. > I thought changing networks is cluster-wide and is not something you'd > like to do casually, without preparation/planning/etc. As the current workaround doing exactly this, it might be an easier step for the customer. The changes can be applied afterwards as well. But I would prefer option above. > > > > > Would that be possible? > > We'd rather not add code to check the backup's content to see if it includes > required networks etc. . I realize that without this, the failure would > happen > much later, which can be inconvenient. > > However, we *think* that it's not so bad, if, when it does fail, this does > not completely fail deploy/restore, but allows the user to manually fix and > then continue. Technically this would be true, as nothing will break. But if our exception will be to have the upgrade experience fail the first time, to give customer advices how to act for the current upgrade process, this is the worst user experience we can provide. From a customer perspective buying an enterprise product, I would definitely not expecting such behavior! > > If you decide that this is completely unacceptable, then we'll have to "bite > the bullet" and do add this complexity. See above > > One additional question, what firewall-requirements are needed to get access > > to the hosted-engine during the deployment process? > > Where does the engine listen to? > > I didn't try this myself, but see previous comments - the tool automatically > temporarily opens up access to the engine web ui, via an ssh tunnel - you > connect to the host on port 6900 and should reach the engine. > > I guess this requires more documentation, btw - searching for relevant > keywords and with "6900" only finds me above-mentioned bugs. > That's another problem. Most customers I'm aware of, does not have access to the hosts directly and only to SSH (port 22). Their network is secured by different firewalls in between and the admin is not able to connect different ports than 80/443 of the manager (fixed IP) This means even if the installation routine will pause, it will not help any customer to fix the issue. We need other options here.
1. If I understand correctly, we have the hook to make the networks not required for the cases that customer does not have access to the hypervisor and cannot open the RHVM UI when it is paused, right? Do we have the same hook for 4.4? If not, we must provide it. 1.1. BTW, does the hook really make the networks non-required? Then we also need to explain this to the user and the user may want to correct this after the environment is up. 2. I went through the upgrade guide and I do not see where we mention that this step is needed. So we would need a documentation bug to add this. https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/upgrade_guide/index Please correct me if I am wrong and it is there. 3. Code fix suggestions: 3.1. During engine-backup check if there are required networks besides ovirtmgmt and let user know by the end of engine-backup that this step would be needed and those are the networks they need to keep in mind. 3.2. Clarify the text during hosted engine deployment to make it more obvious what is expected from the user. Didi, Would you like to have new bugs for items 1 and 3 or we can cover them here?
(In reply to Marina Kalinin from comment #18) > 1. If I understand correctly, we have the hook to make the networks not > required for the cases that customer does not have access to the hypervisor > and cannot open the RHVM UI when it is paused, right? > Do we have the same hook for 4.4? If not, we must provide it. Not sure what you mean here. If it's "do we have the hook _mechanism_", then yes, we have it. If it's whether we have the actual content, then assuming the example provided in "Additional info" works, tested and verified, then we have it - but it's just an example - users willing to use it will need at minimum to amend the list of networks there (under "with_items:"). > 1.1. BTW, does the hook really make the networks non-required? Then we also > need to explain this to the user and the user may want to correct this after > the environment is up. That's my reading. I didn't test it. > > 2. I went through the upgrade guide and I do not see where we mention that > this step is needed. So we would need a documentation bug to add this. > https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/ > html-single/upgrade_guide/index > Please correct me if I am wrong and it is there. I didn't read all of it. The "deploy with --import-from-file" part, step 8 of [1], does not provide any details. It probably implies that this process is very similar to clean install - the well-detailed [2] - which is mostly true, with a few notable exceptions - one of which is the "pause?" prompt (which only appears in --import-from-file). So from doc POV, it's probably enough to add some content to [1], perhaps as another paragraph or a "Note". [1] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/upgrade_guide/index#Upgrading_the_Manager_to_4-4_4-3_SHE [2] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/installing_red_hat_virtualization_as_a_self-hosted_engine_using_the_command_line/index#Deploying_the_Self-Hosted_Engine_Using_the_CLI_install_RHVM > > 3. Code fix suggestions: > 3.1. During engine-backup check if there are required networks besides > ovirtmgmt and let user know by the end of engine-backup that this step would > be needed and those are the networks they need to keep in mind. This is very easy to do - just supply the text you want and we can add it. That said, IMO it's useless. People do not read such outputs unless the program hangs telling them "Press Enter to continue", and then they also usually press Enter without reading. Often, this is ran by cron or whatever. Last but not least - this requires a change in 4.3, so you should file the bug there - and I am not sure it's consider worth adding there. > 3.2. Clarify the text during hosted engine deployment to make it more > obvious what is expected from the user. Only the text? Also change the default of "pause?" to Yes? Something more complex? Please coordinate with Steffen (and PM if needed) and get to a conclusion. Thanks! > > > Didi, > Would you like to have new bugs for items 1 and 3 or we can cover them here? I think (3.2.) can be left as current bug, and another bug for (1.). If you want also (3.1.), please open a bug on engine-backup on 4.3. Thanks.
(In reply to Steffen Froemer from comment #17) > If I'm an experienced user, I would be possible to create a yaml-file to add > to the hooks-mechanism, which will setup my host-network like required. This > should include > - host IP > - host bond/vlan settings > - rhv network / host network->nic assignment I see your point. Can you please link/attach logs for an example failed deploy/restore? I guess it failed during "Wait for the host to be up" (but probably never had the host up). Not sure. I am not sure our current ansible modules allow passing above information directly when adding a host. So if that's what you want, we basically have two options: 1. First patch the ansible module to allow that, then patch hosted-engine to allow hooking into this (during adding the host). 2. Use current work around (set networks as non-required), and allow another hook, between "Add host" and "Wait for the host to be up". There, you'll be able to supply your own hook, which will assign the needed networks (using ovirt_host_networks) and try to activate the host again. We can do both, of course - and let the user pick one. > Technically this would be true, as nothing will break. But if our exception > will be to have the upgrade experience fail the first time, to give customer > advices how to act for the current upgrade process, this is the worst user > experience we can provide. From a customer perspective buying an enterprise > product, I would definitely not expecting such behavior! It will never seem like a failure - just another part of the interaction. I also think (didn't test - see above) that it's currently the only way, whether you do this manually or automatically (using a hook we'll add or adding a specific feature in the code). If all goes well, the only place a user will see "failure" is in the logs. Is that so bad? > That's another problem. Most customers I'm aware of, does not have access to > the hosts directly and only to SSH (port 22). > Their network is secured by different firewalls in between and the admin is > not able to connect different ports than 80/443 of the manager (fixed IP) > This means even if the installation routine will pause, it will not help any > customer to fix the issue. > > We need other options here. I think you can do whatever you want using an ssh tunnel. Something like: ssh -L443:localhost:6900 root@host (perhaps with sudo) and add a (temporary) suitable entry to your local /etc/hosts. I didn't try that.
Based on comment 5, This is what I understand we need to add to the Upgrade Guide for SHE: 7. During restore answer "yes" to: Pause the execution after adding this host to the engine? 8. Add missing network to alma07: a. HOW DO I DO THIS STEP?: Connect to https://alma07.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status of this host and eventually remediate it, b. Continue when the host is listed as 'up' 9. Delete the lock file: "rm -rf /tmp/ansible.fso1sz7n_he_setup_lock". THIS IS WHAT'S NOT CLEAR: How do I add the missing network to alma07? - How do I connect to it while in the middle of deployment? Do I need to open another terminal? Or connect from a different machine? Or what? - Once I've connected, how do I remediate the status of the host?
(In reply to Steve Goodman from comment #21) > Based on comment 5, This is what I understand we need to add to the Upgrade > Guide for SHE: > > 7. During restore answer "yes" to: > Pause the execution after adding this host to the engine? > > 8. Add missing network to alma07: > a. HOW DO I DO THIS STEP?: Connect to > https://alma07.qa.lab.tlv.redhat.com:6900/ovirt-engine/ and check the status > of this host and eventually remediate it, > b. Continue when the host is listed as 'up' > 9. Delete the lock file: > "rm -rf /tmp/ansible.fso1sz7n_he_setup_lock". > > > THIS IS WHAT'S NOT CLEAR: > How do I add the missing network to alma07? > - How do I connect to it while in the middle of deployment? Do I need to > open another terminal? Or connect from a different machine? Or what? Just use a browser with this url. It should work. It's temporary - will stop working when deployment finishes. > - Once I've connected, how do I remediate the status of the host? Depending on why it's not up. If it's only due to the issue of current bug, you should go to "Network Interfaces" -> "Setup Host Networks", and assign the relevant interfaces to the required networks. Please create a doc bug for this. It seems like we'll also do some code changes for current bug.
Removing needinfo forth to comment #22.
Using the current patches in the linked PR, verified both ways. Deployed 4.3 hosted-engine, added dummy nic and network (required), ran backup. Then restored in 4.4. 1. Manually - it paused (even though I replied 'No' to "Pause?". There is a separate var for this, default true, we do not ask about it), I created a dummy nic, connected to web admin ui at :6900, attached the nic to the network, activated the host and removed the lock file. 2. Automatically - Edited the example hook, copied to the right place as instructed in it, used an answer file generated by a previous run, and deploy ran fully automatically without a prompt until asking about storage. Sounds reasonable?
(In reply to Yedidyah Bar David from comment #24) > Using the current patches in the linked PR, verified both ways. Deployed 4.3 > hosted-engine, added dummy nic and network (required), ran backup. Then > restored in 4.4. > > 1. Manually - it paused (even though I replied 'No' to "Pause?". There is a > separate var for this, default true, we do not ask about it), I created a > dummy nic, connected to web admin ui at :6900, attached the nic to the > network, activated the host and removed the lock file. > > 2. Automatically - Edited the example hook, copied to the right place as > instructed in it, used an answer file generated by a previous run, and > deploy ran fully automatically without a prompt until asking about storage. > > Sounds reasonable? Yes. now it only need to be documented well. Also I just created a small playbook, which allows customers to easily create the hook automatically, based on the old platform. https://github.com/knumskull/ovirt-ansible what do you think?
(In reply to Steffen Froemer from comment #25) > Yes. now it only need to be documented well. Definitely. But I hope it will be clear and intuitive enough even for those that do not read documentation. > Also I just created a small playbook, which allows customers to easily > create the hook automatically, based on the old platform. > > https://github.com/knumskull/ovirt-ansible > > what do you think? Looks good to me. I understand it's designed to be ran against the live old engine, right? So useful for upgrades, but not for "real" restores.
QE: To reproduce/verify: 1. Deploy 4.3 hosted-engine 2. Change the "Default" Cluster to have more than one network, and have the other network also "required". If you do not have a host with more than one nic, you can create a dummy one with e.g. 'ip link add dummy_1 type dummy'. 3. Take a backup 4. Try to upgrade to 4.4 using this backup, twice: 4.1. As-is, just following the docs. Accept the default 'No' to 'pause?'. Instead of failing (which is what happens in previous versions), it: - Tell you that adding the host failed - Provide some details. Specifically, the output should include something about the required networks. If it does not, that's a bug, please report it and attach logs. - Include a link to the web admin ui. The URL will include the host, not the engine, on port :6900. This is temporary, and works only during the deployment. - Output a lock file you should remove once finished. So: Connect to the web admin, fix the issue (you might need to add a dummy nic as above), activate the host, then remove the lock file. It should continue successfully. 4.2 Alternatively, supply a hook, see [1][2] for details: - Copy the file from /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/hosted_engine_setup/examples/required_networks_fix.yml to /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/hosted_engine_setup/hooks/after_add_host/ - Edit it replacing "myhost", "eth0" and "net1" as applicable - Then try to upgrade/restore. It should succeed without asking anything until the storage prompt. If you intend to reuse the same host for testing more than one flow without complete reinstall, please note that 'ovirt-hosted-engine-cleanup' does not completely clean networks data, so next attempts might be affected. I pushed a patch for this [3], but didn't yet open a bug. So either run, after cleanup, also 'vdsm-tool clear-nets --exclude-net ovirtmgmt', or reinstall the OS. Doc: Only doc for now is [1][2]. We probably want a doc bug as well. [1] https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hosted_engine_setup/README.md#make-changes-in-the-engine-vm-during-the-deployment [2] https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hosted_engine_setup/examples/required_networks_fix.yml [3] https://gerrit.ovirt.org/112336
We have a related doc bug 1695523.
ovirt-ansible-collection-1.2.3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0312