Bug 1853225
| Summary: | Failed importing the Hosted Engine VM | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Nikolai Sednev <nsednev> | ||||||||||||
| Component: | General | Assignee: | Arik <ahadas> | ||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> | ||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||
| Priority: | unspecified | ||||||||||||||
| Version: | 4.4.0 | CC: | ahadas, aoconnor, bugs, dfodor, khakimi, lsvaty, michal.skrivanek, mkalinin, pelauter, rhodain, sgoodman, stirabos | ||||||||||||
| Target Milestone: | ovirt-4.4.1-1 | Keywords: | Reopened, Triaged | ||||||||||||
| Target Release: | --- | Flags: | pm-rhel:
ovirt-4.4+
aoconnor: blocker+ |
||||||||||||
| Hardware: | x86_64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | org.ovirt.engine-root-4.4.1.10-1 | Doc Type: | No Doc Update | ||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2020-08-05 06:28:20 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Nikolai Sednev
2020-07-02 08:23:41 UTC
Created attachment 1699606 [details]
RHEL7.8 stuck in preparing for maintenance with an old HE-VM
please provide more details on the exact steps taken. (In reply to Michal Skrivanek from comment #2) > please provide more details on the exact steps taken. (In reply to Michal Skrivanek from comment #2) > please provide more details on the exact steps taken. The issue arose at step 12. 1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM and it's an SPM and it was the first initial ha-host on which HE was deployed first, then alma03 and alma04 ha-hosts were added, all three hosts were IBRS CPU hosts) with following components: ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux 2.Added NFS data storage for guest-VMs. 3.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3 ha-hosts. 4.Set the environment to global maintenance, stopped the engine ("systemctl stop ovirt-engine" on the engine-VM) and created backup file from the engine "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3 --log=Log_nsednev_from_alma07_SPM_rhevm_4_3". 5.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and nsednev_from_alma07_SPM_rhevm_4_3) to my laptop. 6.Reprovisioned alma07 to latest RHEL8.2 with these components: ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch rhvm-appliance-4.4-20200604.0.el8ev.x86_64 Red Hat Enterprise Linux release 8.2 (Ootpa) Linux 4.18.0-193.11.1.el8_2.x86_64 #1 SMP Fri Jun 26 16:18:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux 7.Copied backup file from laptop to /root on reprovisioned and clean alma07. 8.Restored engine's DB using "hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3" and got Software Version:4.4.1.2-0.10.el8ev engine deployed on alma07, using rhvm-appliance-4.4-20200604.0.el8ev.x86_64. 9.Removed global maintenance from the UI of the engine and then moved alma03 and alma04 in to local maintenance. 10.Placed alma03 to local maintenance and then removed it from the environment to reprovision it to RHEL8.2. 11.Installed ovirt-hosted-engine-setup on alma03 and added it back to environment as ha-host. 12.Tried to move alma04 to local maintenance, but failed to do so, it was running 3 VMs, 2 guest-VMs, which got migrated and 1 old HE-VM, which could not get migrated and host got stuck in "Preparing for maintenance". 13.I forcefully rebooted alma04 with an old HE-VM running on it and reprovisioned it to RHEL8.2. 14.Marked in UI that alma04 was rebooted and placed it to local maintenance, while host was shown as down and then removed the host from the environment. 15.Installed ovirt-hosted-engine-setup on alma04 and added it back to environment as ha-host. 16.Bumped up host-cluster 4.3->4.4 and received this error: "Operation Canceled Error while executing action: Update of cluster compatibility version failed because there are VMs/Templates [HostedEngine] with incorrect configuration. To fix the issue, please go to each of them, edit, change the Custom Compatibility Version of the VM/Template to the cluster level you want to update the cluster to and press OK. If the save does not pass, fix the dialog validation. After successful cluster update, you can revert your Custom Compatibility Version change." 17.I couldn't proceed to bumping up data-center 4.3->4.4. *My host-cluster remained 4.3 after restore was complete, so when I finished with alma03 and alma04. **After step 13 an old HE-VM jumped to alma03 and I could not set it to local maintenance when tried to rid off UI's "Upadate available" and tried to upgrade the host using UI and it got stuck in "Preparing for maintenance" as an old HE-VM could not get migrated from it elsewhere, alma03 was already RHEL8.2 at this point. ***To summarize I failed to finish with the upgrade, an old HE-VM will remain on an environment as a zombie shifting among the ha-hosts, uncontrolled and preventing them to be able to get set in local maintenance. (In reply to Nikolai Sednev from comment #3) > (In reply to Michal Skrivanek from comment #2) > > please provide more details on the exact steps taken. > The issue arose at step 12. > > > 1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat > Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM > and it's an SPM and it was the first initial ha-host on which HE was > deployed first, then alma03 and alma04 ha-hosts were added, all three hosts > were IBRS CPU hosts) with following components: > ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch > ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch > Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 > x86_64 x86_64 GNU/Linux > 2.Added NFS data storage for guest-VMs. > 3.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3 > ha-hosts. > 4.Set the environment to global maintenance, stopped the engine ("systemctl > stop ovirt-engine" on the engine-VM) and created backup file from the engine > "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3 > --log=Log_nsednev_from_alma07_SPM_rhevm_4_3". > 5.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and > nsednev_from_alma07_SPM_rhevm_4_3) to my laptop. > 6.Reprovisioned alma07 to latest RHEL8.2 with these components: > ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch > ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch > rhvm-appliance-4.4-20200604.0.el8ev.x86_64 > Red Hat Enterprise Linux release 8.2 (Ootpa) > Linux 4.18.0-193.11.1.el8_2.x86_64 #1 SMP Fri Jun 26 16:18:58 UTC 2020 > x86_64 x86_64 x86_64 GNU/Linux > 7.Copied backup file from laptop to /root on reprovisioned and clean alma07. > 8.Restored engine's DB using "hosted-engine --deploy > --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3" and got > Software Version:4.4.1.2-0.10.el8ev engine deployed on alma07, using > rhvm-appliance-4.4-20200604.0.el8ev.x86_64. how did this finish. Did you have a running engine at this point? next step suggests you did.... > 9.Removed global maintenance from the UI of the engine and then moved alma03 > and alma04 in to local maintenance. ... can you confirm you were logging into the new 4.4 instance? > 10.Placed alma03 to local maintenance and then removed it from the > environment to reprovision it to RHEL8.2. > 11.Installed ovirt-hosted-engine-setup on alma03 and added it back to > environment as ha-host. > 12.Tried to move alma04 to local maintenance, but failed to do so, it was > running 3 VMs, 2 guest-VMs, which got migrated and 1 old HE-VM, which could > not get migrated and host got stuck in "Preparing for maintenance". so the 2 guest VMs make sense, since that's what you had initially. The question is how did the HE start on this host. It seems it can happen when you remove the global maintenance in step 9 while reusing the same HE SD. Did you use the same HE SD without wiping it? the rest is not too interesting as the problem happens here already. (In reply to Michal Skrivanek from comment #4) > (In reply to Nikolai Sednev from comment #3) > > (In reply to Michal Skrivanek from comment #2) > > > please provide more details on the exact steps taken. > > The issue arose at step 12. > > > > > > 1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat > > Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM > > and it's an SPM and it was the first initial ha-host on which HE was > > deployed first, then alma03 and alma04 ha-hosts were added, all three hosts > > were IBRS CPU hosts) with following components: > > ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch > > ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch > > Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 > > x86_64 x86_64 GNU/Linux > > 2.Added NFS data storage for guest-VMs. > > 3.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3 > > ha-hosts. > > 4.Set the environment to global maintenance, stopped the engine ("systemctl > > stop ovirt-engine" on the engine-VM) and created backup file from the engine > > "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3 > > --log=Log_nsednev_from_alma07_SPM_rhevm_4_3". > > 5.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and > > nsednev_from_alma07_SPM_rhevm_4_3) to my laptop. > > 6.Reprovisioned alma07 to latest RHEL8.2 with these components: > > ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch > > ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch > > rhvm-appliance-4.4-20200604.0.el8ev.x86_64 > > Red Hat Enterprise Linux release 8.2 (Ootpa) > > Linux 4.18.0-193.11.1.el8_2.x86_64 #1 SMP Fri Jun 26 16:18:58 UTC 2020 > > x86_64 x86_64 x86_64 GNU/Linux > > 7.Copied backup file from laptop to /root on reprovisioned and clean alma07. > > 8.Restored engine's DB using "hosted-engine --deploy > > --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3" and got > > Software Version:4.4.1.2-0.10.el8ev engine deployed on alma07, using > > rhvm-appliance-4.4-20200604.0.el8ev.x86_64. > > how did this finish. Did you have a running engine at this point? next step > suggests you did.... Yes > > > 9.Removed global maintenance from the UI of the engine and then moved alma03 > > and alma04 in to local maintenance. > > ... can you confirm you were logging into the new 4.4 instance? Yes > > > 10.Placed alma03 to local maintenance and then removed it from the > > environment to reprovision it to RHEL8.2. > > 11.Installed ovirt-hosted-engine-setup on alma03 and added it back to > > environment as ha-host. > > 12.Tried to move alma04 to local maintenance, but failed to do so, it was > > running 3 VMs, 2 guest-VMs, which got migrated and 1 old HE-VM, which could > > not get migrated and host got stuck in "Preparing for maintenance". > > so the 2 guest VMs make sense, since that's what you had initially. The > question is how did the HE start on this host. It seems it can happen when > you remove the global maintenance in step 9 while reusing the same HE SD. > Did you use the same HE SD without wiping it? > No, I think I hit https://bugzilla.redhat.com/show_bug.cgi?id=1830872. Deployment was made on exclusively new storage volume on different storage. > the rest is not too interesting as the problem happens here already. Probably it's related to https://bugzilla.redhat.com/show_bug.cgi?id=1830872, older appliance being used during verification, hence I'll try to verify again using latest rhvm-appliance-4.3-20200702.0.el7. new SD, ok. Why do you think bug 1830872 is related? if you're doing it again, please try to find out if after step 9 you still see global maintenance set on the old hosts (since they are looking at the old SD it should be the case). At no point in time the old hosts may move out of global maintenance - that would result exactly in what you've reported Error at step 16 is a different thing and fixed already (bug 1847513) Waiting for new tests results using ovirt-engine >= 4.4.1.5 for covering bug #1847513 and bug #1830872 (In reply to Michal Skrivanek from comment #6) > new SD, ok. Why do you think bug 1830872 is related? > > if you're doing it again, please try to find out if after step 9 you still > see global maintenance set on the old hosts (since they are looking at the > old SD it should be the case). At no point in time the old hosts may move > out of global maintenance - that would result exactly in what you've reported > > Error at step 16 is a different thing and fixed already (bug 1847513) The issue, I think, with an old appliance, but it's the latest available to QA from CI in brew. rhvm-appliance-4.4-20200604.0.el8ev=Software Version:4.4.1.2-0.10.el8ev, I have to get the engine with Software Version:4.4.1.5-0.17.el8ev to get rid from https://bugzilla.redhat.com/show_bug.cgi?id=1830872. I'll try to find out the way to update it to latest bits before it starts with engine-setup during restore. Works for me on latest Software Version:4.4.1.7-0.3.el8ev. ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) Reported issue no longer exists. I've got to the same situation again, I see an old HE-VM keeps running on one of the old hosts like a zombie: alma03 ~]# virsh -r list --all Id Name State ---------------------------------------------------- 4 HostedEngine running On a reprovisioned alma07 I've made a restore and I see a new HE-VM is running as expected: alma07 ~]# virsh -r list --all Id Name State ------------------------------ 2 HostedEngine running 3 VM5 running 4 VM2 running SD used for a new engine is nsednev_he_2, the old SD is nsednev_he_1. We should not get this flow during restore. I still have an environment, please contact me if you need reproducer. Reopening this bug for investigation. Components used on engine: ovirt-engine-setup-4.4.1.7-0.3.el8ev.noarch Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) On host alma07 (the one that was running the restore): ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch vdsm-4.40.22-1.el8ev.x86_64 qemu-kvm-4.2.0-28.module+el8.2.1+7211+16dfe810.x86_64 libvirt-client-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 sanlock-3.8.0-2.el8.x86_64 Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) Steps during reproduction:
1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM and it's an SPM and it was the first initial ha-host on which HE was deployed first, then alma03 and alma04 ha-hosts were added, all three hosts were IBRS CPU hosts) with following components:
ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
2.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3 ha-hosts.
3.Set the environment to global maintenance, stopped the engine ("systemctl stop ovirt-engine" on the engine-VM) and created backup file from the engine "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3 --log=Log_nsednev_from_alma07_SPM_rhevm_4_3".
4.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and nsednev_from_alma07_SPM_rhevm_4_3) to my laptop.
5.Reprovisioned alma07 to latest RHEL8.2 with these components:
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
vdsm-4.40.22-1.el8ev.x86_64
qemu-kvm-4.2.0-28.module+el8.2.1+7211+16dfe810.x86_64
libvirt-client-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64
sanlock-3.8.0-2.el8.x86_64
Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)
6.Copied backup file from laptop to /root on reprovisioned and clean alma07.
7.Restored engine's DB using "hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3", fetched newest repos to the engineduring deployment and got Software Version:4.4.1.7-0.3.el8ev engine deployed on alma07.
8.Removed global maintenance from the UI of the engine.
9.Moved alma03 to local maintenance.
10.Found that alma03 is stuck in "Preparing for maintenance" and checked it's VMs, and found that VM named "HostedEngine" was running in parallel on both alma03 and alma07!
(In reply to Nikolai Sednev from comment #12) > 8.Removed global maintenance from the UI of the engine. Which host did you use for removing global maintenance? (In reply to Sandro Bonazzola from comment #13) > (In reply to Nikolai Sednev from comment #12) > > > 8.Removed global maintenance from the UI of the engine. > > Which host did you use for removing global maintenance? From UI I clicked on alma03 to highlight it and then disabled global maintenance when the option appeared. There should not be any difference on which host I will click to get the ability to disable global maintenance as its global and should not be host specific, global maintenance influences all ha-hosts in hosted-engine host-cluster. Thank you for clarifying which host was selected. alma03 was an old host at that time. It caused the old setup to reactivate and that's why it started the old HE VM. This must not happen, we need to make it clear in documentation I suggest to move this to Documentation, suggest to move out of global maintenance only once all HE hosts are upgraded, and warn about making sure you select the new host if you need/want to do it sooner (In reply to Michal Skrivanek from comment #15) > I suggest to move this to Documentation, suggest to move out of global > maintenance only once all HE hosts are upgraded, and warn about making sure > you select the new host if you need/want to do it sooner I addressed this as part of bug 1802650. Please fill in the "Fixed In Version:" field with exact component version. Latest rhvm available for QA is rhvm-4.4.1.10-0.1.el8ev.noarch from rhvm-appliance-4.4-20200722.0.el8ev.x86_64.rpm. Nothing to verify. The issue had been fixed. alma04 ~]# virsh -r list --all Id Name State ---------------------------------------------------- 1 VM5 running 3 VM6 running 4 VM4 running 5 VM3 running alma07 ~]# virsh -r list --all Id Name State ------------------------------ 2 HostedEngine running 3 VM1 running 4 VM2 running alma03 had been sent to local maintenance, then removed for farther host's upgrade. No "zombie" old HE-VM appeared if I tried to disable global maintenance from any of 4.3 old ha-hosts (alma03 or alma04, check the attached recording). Works for me on: ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch rhvm-appliance-4.4-20200722.0.el8ev.x86_64 Software Version:4.4.1.10-0.1.el8ev Linux 4.18.0-193.13.2.el8_2.x86_64 #1 SMP Mon Jul 13 23:17:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) Backup was made from iSCSI deployed HE 4.3 and restore performed to different iSCSI volume at the same storage. Moving to verified. Created attachment 1702248 [details]
recording 1
Created attachment 1702249 [details]
recording 2
Created attachment 1702250 [details]
recording 3
This bugzilla is included in oVirt 4.4.1.1 Async release, published on July 13th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |