1853225 – Failed importing the Hosted Engine VM

Bug 1853225 - Failed importing the Hosted Engine VM

Summary: Failed importing the Hosted Engine VM

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.4.1-1
Target Release:	---
Assignee:	Arik
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-02 08:23 UTC by Nikolai Sednev
Modified:	2020-08-05 06:28 UTC (History)
CC List:	12 users (show)
Fixed In Version:	org.ovirt.engine-root-4.4.1.10-1
Clone Of:
Environment:
Last Closed:	2020-08-05 06:28:20 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+ aoconnor: blocker+

Attachments	(Terms of Use)
sosreport from the new engine (5.75 MB, application/x-xz) 2020-07-02 08:23 UTC, Nikolai Sednev	no flags	Details
RHEL7.8 stuck in preparing for maintenance with an old HE-VM (73.66 KB, image/png) 2020-07-02 08:30 UTC, Nikolai Sednev	no flags	Details
recording 1 (787.96 KB, application/x-matroska) 2020-07-23 15:57 UTC, Nikolai Sednev	no flags	Details
recording 2 (4.29 MB, application/x-matroska) 2020-07-23 15:58 UTC, Nikolai Sednev	no flags	Details
recording 3 (9.40 MB, application/x-matroska) 2020-07-23 15:58 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	110226	master	MERGED	core: run global maintenance on hosted engine's run_on_vds	2020-09-24 10:22:50 UTC
oVirt gerrit	110317	ovirt-engine-4.4.1.z	MERGED	core: run global maintenance on hosted engine's run_on_vds	2020-09-24 10:22:45 UTC
oVirt gerrit	110345	master	MERGED	core: run global maintenance on hosted engine's run_on_vds - fixed	2020-09-24 10:22:46 UTC
oVirt gerrit	110347	ovirt-engine-4.4.1.z	MERGED	core: run global maintenance on hosted engine's run_on_vds - fixed	2020-09-24 10:22:45 UTC

Description Nikolai Sednev 2020-07-02 08:23:41 UTC

Created attachment 1699604 [details]
sosreport from the new engine

Description of problem:
During 4.3 to 4.4 backup and restore as part of the upgrade, I've encountered with an issue of placing one of the tree ha-hosts in to local maintenance after the upgrade was finished on the engine and it got to 4.4. The ha-host had to be reprovisioned and upgraded from RHEL7.8 to RHEL8.2, as part of the upgrade, but not all VMs got migrated from it and it got stuck in "Preparing for maintenance". I checked that on the host alma04 was some VM running and it appeared that it was an old HE-VM:

alma04 ~]# virsh -r list --all
 Id    Name                           State
----------------------------------------------------
 6     HostedEngine                   running

In the engine logs I saw as follows:
2020-07-02 10:54:52,603+03 INFO  [org.ovirt.engine.core.bll.exportimport.ImportVmCommand] (EE-ManagedThreadFactory-engine-Thread-16515) [629228fd] Lock freed to object 'EngineLock:{exclusiveLocks='[9667d591-ec0c-4289-91c1-dfbc8f36e54c=VM, HostedEngine=VM_NAME]', sharedLocks='[9667d591-ec0c-4289-91c1-dfbc8f36e54c=REMOTE_VM]'}'
2020-07-02 10:54:52,603+03 ERROR [org.ovirt.engine.core.bll.HostedEngineImporter] (EE-ManagedThreadFactory-engine-Thread-16515) [629228fd] Failed importing the Hosted Engine VM
2020-07-02 10:54:53,507+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-74) [78db2a67] Lock Acquired to object 'EngineLock:{exclusiveLocks='[35b34d53-53d0-4b7e-bafe-39e5f81d5d88=PROVIDER]', sharedLocks=''}'
2020-07-02 10:54:53,515+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-74) [78db2a67] Running command: SyncNetworkProviderCommand internal: true.
2020-07-02 10:54:53,637+03 INFO  [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-7) [] User admin@internal successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access
2020-07-02 10:54:53,783+03 INFO  [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-74) [78db2a67] Lock freed to object 'EngineLock:{exclusiveLocks='[35b34d53-53d0-4b7e-bafe-39e5f81d5d88=PROVIDER]', sharedLocks=''}'
2020-07-02 10:55:07,607+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [] VM '9667d591-ec0c-4289-91c1-dfbc8f36e54c' was discovered as 'Up' on VDS 'bfff7524-a664-450b-a51b-89a791215394'(alma04.qa.lab.tlv.redhat.com)
2020-07-02 10:55:07,622+03 INFO  [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [69da8f11] Running command: AddUnmanagedVmsCommand internal: true.
2020-07-02 10:55:07,623+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [69da8f11] START, DumpXmlsVDSCommand(HostName = alma04.qa.lab.tlv.redhat.com, Params:{hostId='bfff7524-a664-450b-a51b-89a791215394', vmIds='[9667d591-ec0c-4289-91c1-dfbc8f36e54c]'}), log id: 66257fa0
2020-07-02 10:55:07,627+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [69da8f11] FINISH, DumpXmlsVDSCommand, return: {9667d591-ec0c-4289-91c1-dfbc8f36e54c=<domain type='kvm' id='6'>
  <name>HostedEngine</name>
  <uuid>9667d591-ec0c-4289-91c1-dfbc8f36e54c</uuid>
  <metadata xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-tune:qos/>
    <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot>
    <ovirt-vm:memGuaranteedSize type="int">0</ovirt-vm:memGuaranteedSize>
    <ovirt-vm:startTime type="float">1593676020.56</ovirt-vm:startTime>
    <ovirt-vm:device mac_address="00:16:3e:7b:b8:53">
        <ovirt-vm:deviceId>586d7a62-53b8-4e39-8185-19b5a1026be8</ovirt-vm:deviceId>
        <ovirt-vm:network>ovirtmgmt</ovirt-vm:network>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="hdc">
        <ovirt-vm:deviceId>cbaff936-c35c-446f-81b7-082752a876ae</ovirt-vm:deviceId>
        <ovirt-vm:shared>false</ovirt-vm:shared>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="vda">
        <ovirt-vm:deviceId>548cd107-4279-479c-8226-82f67813aaa5</ovirt-vm:deviceId>
        <ovirt-vm:domainID>b3ddc753-2419-4299-bf42-f68a961721d1</ovirt-vm:domainID>
        <ovirt-vm:guestName>/dev/vda2</ovirt-vm:guestName>
        <ovirt-vm:imageID>c6eaae81-7d92-418f-b20e-b8d0c2e195b1</ovirt-vm:imageID>
        <ovirt-vm:poolID>00000000-0000-0000-0000-000000000000</ovirt-vm:poolID>
        <ovirt-vm:shared>exclusive</ovirt-vm:shared>
        <ovirt-vm:volumeID>548cd107-4279-479c-8226-82f67813aaa5</ovirt-vm:volumeID>
        <ovirt-vm:volumeChain>
            <ovirt-vm:volumeChainNode>
                <ovirt-vm:domainID>b3ddc753-2419-4299-bf42-f68a961721d1</ovirt-vm:domainID>
                <ovirt-vm:imageID>c6eaae81-7d92-418f-b20e-b8d0c2e195b1</ovirt-vm:imageID>
                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>
                <ovirt-vm:leasePath>/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_nsednev__he__1/b3ddc753-2419-4299-bf42-f68a961721d1/images/c6eaae81-7d92-418f-b20e-b8d0c2e195b1/548cd107-4279-479c-8226-82f67813aaa5.lease</ovirt-vm:leasePath>
                <ovirt-vm:path>/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_nsednev__he__1/b3ddc753-2419-4299-bf42-f68a961721d1/images/c6eaae81-7d92-418f-b20e-b8d0c2e195b1/548cd107-4279-479c-8226-82f67813aaa5</ovirt-vm:path>
                <ovirt-vm:volumeID>548cd107-4279-479c-8226-82f67813aaa5</ovirt-vm:volumeID>
            </ovirt-vm:volumeChainNode>
        </ovirt-vm:volumeChain>
    </ovirt-vm:device>
</ovirt-vm:vm>
  </metadata>
  <memory unit='KiB'>13068288</memory>
  <currentMemory unit='KiB'>13068288</currentMemory>
  <vcpu placement='static' current='4'>8</vcpu>
  <cputune>
    <shares>1020</shares>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>Red</entry>
      <entry name='product'>RHEV Hypervisor</entry>
      <entry name='version'>7.9-1.el7</entry>
      <entry name='serial'>4c4c4544-0059-4410-8054-b8c04f573032</entry>
      <entry name='uuid'>9667d591-ec0c-4289-91c1-dfbc8f36e54c</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.3.0'>hvm</type>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>SandyBridge</model>
    <feature policy='require' name='pcid'/>
    <feature policy='require' name='spec-ctrl'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='md-clear'/>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='arat'/>
    <feature policy='require' name='xsaveopt'/>
  </cpu>
  <clock offset='variable' adjustment='0' basis='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>destroy</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' error_policy='stop'/>
      <source startupPolicy='optional'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/>
      <source file='/var/run/vdsm/storage/b3ddc753-2419-4299-bf42-f68a961721d1/c6eaae81-7d92-418f-b20e-b8d0c2e195b1/548cd107-4279-479c-8226-82f67813aaa5'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <serial>c6eaae81-7d92-418f-b20e-b8d0c2e195b1</serial>
      <boot order='1'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='piix3-uhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <lease>
      <lockspace>b3ddc753-2419-4299-bf42-f68a961721d1</lockspace>
      <key>548cd107-4279-479c-8226-82f67813aaa5</key>
      <target path='/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_nsednev__he__1/b3ddc753-2419-4299-bf42-f68a961721d1/images/c6eaae81-7d92-418f-b20e-b8d0c2e195b1/548cd107-4279-479c-8226-82f67813aaa5.lease'/>
    </lease>
    <interface type='bridge'>
      <mac address='00:16:3e:7b:b8:53'/>
      <source bridge='ovirtmgmt'/>
      <target dev='vnet2'/>
      <model type='virtio'/>
      <link state='up'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <console type='pty' tty='/dev/pts/1'>
      <source path='/dev/pts/1'/>
      <target type='virtio' port='0'/>
      <alias name='console0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/9667d591-ec0c-4289-91c1-dfbc8f36e54c.com.redhat.rhevm.vdsm'/>
      <target type='virtio' name='com.redhat.rhevm.vdsm' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/9667d591-ec0c-4289-91c1-dfbc8f36e54c.org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel1'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/9667d591-ec0c-4289-91c1-dfbc8f36e54c.org.ovirt.hosted-e
ngine-setup.0'/>
      <target type='virtio' name='org.ovirt.hosted-engine-setup.0' state='disconnected'/>
      <alias name='channel2'/>
      <address type='virtio-serial' controller='0' bus='0' port='3'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <graphics type='vnc' port='5906' autoport='yes' listen='0' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='address' address='0'/>
    </graphics>
    <video>
      <model type='vga' vram='32768' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='none'/>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='rng0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c280,c788</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c280,c788</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>
}, log id: 66257fa0
2020-07-02 10:55:07,639+03 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-23) [69da8f11] null architecture type, replacing with x86_64, VM [HostedEngine]
2020-07-02 10:55:07,641+03 INFO  [org.ovirt.engine.core.bll.HostedEngineImporter] (EE-ManagedThreadFactory-engine-Thread-16522) [69da8f11] Try to import the Hosted Engine VM 'VM [HostedEngine]'
2020-07-02 10:55:07,645+03 INFO  [org.ovirt.engine.core.bll.exportimport.ImportVmCommand] (EE-ManagedThreadFactory-engine-Thread-16522) [b451074] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9667d591-ec0c-4289-91c1-dfbc8f36e54c=VM, HostedEngine=VM_NAME]', sharedLocks='[9667d591-ec0c-4289-91c1-dfbc8f36e54c=REMOTE_VM]'}'
2020-07-02 10:55:07,649+03 WARN  [org.ovirt.engine.core.bll.exportimport.ImportVmCommand] (EE-ManagedThreadFactory-engine-Thread-16522) [b451074] Validation of action 'ImportVm' failed for user SYSTEM. Reasons: VAR__ACTION__IMPORT,VAR__TYPE__VM,ACTION_TYPE_FAILED_CANNOT_ADD_IFACE_DUE_TO_MAC_DUPLICATES,$ACTION_TYPE_FAILED_CANNOT_ADD_IFACE_DUE_TO_MAC_DUPLICATES_LIST  net0,$ACTION_TYPE_FAILED_CANNOT_ADD_IFACE_DUE_TO_MAC_DUPLICATES_LIST_COUNTER 1
2020-07-02 10:55:07,649+03 INFO  [org.ovirt.engine.core.bll.exportimport.ImportVmCommand] (EE-ManagedThreadFactory-engine-Thread-16522) [b451074] Lock freed to object 'EngineLock:{exclusiveLocks='[9667d591-ec0c-4289-91c1-dfbc8f36e54c=VM, HostedEngine=VM_NAME]', sharedLocks='[9667d591-ec0c-4289-91c1-dfbc8f36e54c=REMOTE_VM]'}'
2020-07-02 10:55:07,649+03 ERROR [org.ovirt.engine.core.bll.HostedEngineImporter] (EE-ManagedThreadFactory-engine-Thread-16522) [b451074] Failed importing the Hosted Engine VM

The issue is blocking the upgrade as it's not possible to upgrade all old ha-hosts and bump up host-cluster compatibility mode from 4.3 to 4.4 and then to bump up the data-center.

Version-Release number of selected component (if applicable):
ovirt-engine-setup-4.4.1.2-0.10.el8ev.noarch
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch
Linux 4.18.0-193.11.1.el8_2.x86_64 #1 SMP Fri Jun 26 16:18:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)


How reproducible:
100%

Steps to Reproduce:
1.Upgrade from 4.3 environment with 3 ha-hosts to 4.4 environment using backup&restore flow.

Actual results:
Can't place in to maintenance one of the three ha-hosts and an old HE-VM is  still running on an old ha-host and refuses to migrate to new RHEL8.2 ha-hosts, thus blocking me from finishing the whole upgrade properly and bumping up host-cluster 4.3->4.4 and data-center as well. 

Expected results:
Upgrade should get finished successfully.

Additional info:
Sosreport from the new HE-VM is attached.

Comment 1 Nikolai Sednev 2020-07-02 08:30:59 UTC

Created attachment 1699606 [details]
RHEL7.8 stuck in preparing for maintenance with an old HE-VM

Comment 2 Michal Skrivanek 2020-07-02 09:10:32 UTC

please provide more details on the exact steps taken.

Comment 3 Nikolai Sednev 2020-07-02 11:28:22 UTC

(In reply to Michal Skrivanek from comment #2)
> please provide more details on the exact steps taken.

(In reply to Michal Skrivanek from comment #2)
> please provide more details on the exact steps taken.
The issue arose at step 12.


1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM and it's an SPM and it was the first initial ha-host on which HE was deployed first, then alma03 and alma04 ha-hosts were added, all three hosts were IBRS CPU hosts) with following components:
ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
2.Added NFS data storage for guest-VMs.
3.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3 ha-hosts.
4.Set the environment to global maintenance, stopped the engine ("systemctl stop ovirt-engine" on the engine-VM) and created backup file from the engine "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3 --log=Log_nsednev_from_alma07_SPM_rhevm_4_3".
5.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and nsednev_from_alma07_SPM_rhevm_4_3) to my laptop.
6.Reprovisioned alma07 to latest RHEL8.2 with these components:
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch
rhvm-appliance-4.4-20200604.0.el8ev.x86_64
Red Hat Enterprise Linux release 8.2 (Ootpa)
Linux 4.18.0-193.11.1.el8_2.x86_64 #1 SMP Fri Jun 26 16:18:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
7.Copied backup file from laptop to /root on reprovisioned and clean alma07.
8.Restored engine's DB using "hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3" and got Software Version:4.4.1.2-0.10.el8ev engine deployed on alma07, using rhvm-appliance-4.4-20200604.0.el8ev.x86_64.
9.Removed global maintenance from the UI of the engine and then moved alma03 and alma04 in to local maintenance. 
10.Placed alma03 to local maintenance and then removed it from the environment to reprovision it to RHEL8.2.
11.Installed ovirt-hosted-engine-setup on alma03 and added it back to environment as ha-host.
12.Tried to move alma04 to local maintenance, but failed to do so, it was running 3 VMs, 2 guest-VMs, which got migrated and 1 old HE-VM, which could not get migrated and host got stuck in "Preparing for maintenance".
13.I forcefully rebooted alma04 with an old HE-VM running on it and reprovisioned it to RHEL8.2.
14.Marked in UI that alma04 was rebooted and placed it to local maintenance, while host was shown as down and then removed the host from the environment.
15.Installed ovirt-hosted-engine-setup on alma04 and added it back to environment as ha-host.
16.Bumped up host-cluster 4.3->4.4 and received this error:
"Operation Canceled
Error while executing action: Update of cluster compatibility version failed because there are VMs/Templates [HostedEngine] with incorrect configuration. To fix the issue, please go to each of them, edit, change the Custom Compatibility Version of the VM/Template to the cluster level you want to update the cluster to and press OK. If the save does not pass, fix the dialog validation. After successful cluster update, you can revert your Custom Compatibility Version change."
17.I couldn't proceed to bumping up data-center 4.3->4.4.

*My host-cluster remained 4.3 after restore was complete, so when I finished with alma03 and alma04.
**After step 13 an old HE-VM jumped to alma03 and I could not set it to local maintenance when tried to rid off UI's "Upadate available" and tried to upgrade the host using UI and it got stuck in "Preparing for maintenance" as an old HE-VM could not get migrated from it elsewhere, alma03 was already RHEL8.2 at this point.

***To summarize I failed to finish with the upgrade, an old HE-VM will remain on an environment as a zombie shifting among the ha-hosts, uncontrolled and preventing them to be able to get set in local maintenance.

Comment 4 Michal Skrivanek 2020-07-02 11:59:06 UTC

(In reply to Nikolai Sednev from comment #3)
> (In reply to Michal Skrivanek from comment #2)
> > please provide more details on the exact steps taken.
> The issue arose at step 12.
> 
> 
> 1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat
> Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM
> and it's an SPM and it was the first initial ha-host on which HE was
> deployed first, then alma03 and alma04 ha-hosts were added, all three hosts
> were IBRS CPU hosts) with following components:
> ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch
> ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
> Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64
> x86_64 x86_64 GNU/Linux
> 2.Added NFS data storage for guest-VMs.
> 3.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3
> ha-hosts.
> 4.Set the environment to global maintenance, stopped the engine ("systemctl
> stop ovirt-engine" on the engine-VM) and created backup file from the engine
> "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3
> --log=Log_nsednev_from_alma07_SPM_rhevm_4_3".
> 5.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and
> nsednev_from_alma07_SPM_rhevm_4_3) to my laptop.
> 6.Reprovisioned alma07 to latest RHEL8.2 with these components:
> ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
> ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch
> rhvm-appliance-4.4-20200604.0.el8ev.x86_64
> Red Hat Enterprise Linux release 8.2 (Ootpa)
> Linux 4.18.0-193.11.1.el8_2.x86_64 #1 SMP Fri Jun 26 16:18:58 UTC 2020
> x86_64 x86_64 x86_64 GNU/Linux
> 7.Copied backup file from laptop to /root on reprovisioned and clean alma07.
> 8.Restored engine's DB using "hosted-engine --deploy
> --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3" and got
> Software Version:4.4.1.2-0.10.el8ev engine deployed on alma07, using
> rhvm-appliance-4.4-20200604.0.el8ev.x86_64.

how did this finish. Did you have a running engine at this point? next step suggests you did....

> 9.Removed global maintenance from the UI of the engine and then moved alma03
> and alma04 in to local maintenance. 

... can you confirm you were logging into the new 4.4 instance?

> 10.Placed alma03 to local maintenance and then removed it from the
> environment to reprovision it to RHEL8.2.
> 11.Installed ovirt-hosted-engine-setup on alma03 and added it back to
> environment as ha-host.
> 12.Tried to move alma04 to local maintenance, but failed to do so, it was
> running 3 VMs, 2 guest-VMs, which got migrated and 1 old HE-VM, which could
> not get migrated and host got stuck in "Preparing for maintenance".

so the 2 guest VMs make sense, since that's what you had initially. The question is how did the HE start on this host. It seems it can happen when you remove the global maintenance in step 9 while reusing the same HE SD. 
Did you use the same HE SD without wiping it?

the rest is not too interesting as the problem happens here already.

Comment 5 Nikolai Sednev 2020-07-02 12:23:22 UTC

(In reply to Michal Skrivanek from comment #4)
> (In reply to Nikolai Sednev from comment #3)
> > (In reply to Michal Skrivanek from comment #2)
> > > please provide more details on the exact steps taken.
> > The issue arose at step 12.
> > 
> > 
> > 1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat
> > Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM
> > and it's an SPM and it was the first initial ha-host on which HE was
> > deployed first, then alma03 and alma04 ha-hosts were added, all three hosts
> > were IBRS CPU hosts) with following components:
> > ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch
> > ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
> > Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64
> > x86_64 x86_64 GNU/Linux
> > 2.Added NFS data storage for guest-VMs.
> > 3.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3
> > ha-hosts.
> > 4.Set the environment to global maintenance, stopped the engine ("systemctl
> > stop ovirt-engine" on the engine-VM) and created backup file from the engine
> > "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3
> > --log=Log_nsednev_from_alma07_SPM_rhevm_4_3".
> > 5.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and
> > nsednev_from_alma07_SPM_rhevm_4_3) to my laptop.
> > 6.Reprovisioned alma07 to latest RHEL8.2 with these components:
> > ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
> > ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch
> > rhvm-appliance-4.4-20200604.0.el8ev.x86_64
> > Red Hat Enterprise Linux release 8.2 (Ootpa)
> > Linux 4.18.0-193.11.1.el8_2.x86_64 #1 SMP Fri Jun 26 16:18:58 UTC 2020
> > x86_64 x86_64 x86_64 GNU/Linux
> > 7.Copied backup file from laptop to /root on reprovisioned and clean alma07.
> > 8.Restored engine's DB using "hosted-engine --deploy
> > --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3" and got
> > Software Version:4.4.1.2-0.10.el8ev engine deployed on alma07, using
> > rhvm-appliance-4.4-20200604.0.el8ev.x86_64.
> 
> how did this finish. Did you have a running engine at this point? next step
> suggests you did....
Yes
> 
> > 9.Removed global maintenance from the UI of the engine and then moved alma03
> > and alma04 in to local maintenance. 
> 
> ... can you confirm you were logging into the new 4.4 instance?
Yes
> 
> > 10.Placed alma03 to local maintenance and then removed it from the
> > environment to reprovision it to RHEL8.2.
> > 11.Installed ovirt-hosted-engine-setup on alma03 and added it back to
> > environment as ha-host.
> > 12.Tried to move alma04 to local maintenance, but failed to do so, it was
> > running 3 VMs, 2 guest-VMs, which got migrated and 1 old HE-VM, which could
> > not get migrated and host got stuck in "Preparing for maintenance".
> 
> so the 2 guest VMs make sense, since that's what you had initially. The
> question is how did the HE start on this host. It seems it can happen when
> you remove the global maintenance in step 9 while reusing the same HE SD. 
> Did you use the same HE SD without wiping it?
> 
No, I think I hit https://bugzilla.redhat.com/show_bug.cgi?id=1830872. Deployment was made on exclusively new storage volume on different storage.
> the rest is not too interesting as the problem happens here already.

Probably it's related to https://bugzilla.redhat.com/show_bug.cgi?id=1830872, older appliance being used during verification, hence I'll try to verify again using latest rhvm-appliance-4.3-20200702.0.el7.

Comment 6 Michal Skrivanek 2020-07-02 13:37:02 UTC

new SD, ok. Why do you think bug 1830872 is related?

if you're doing it again, please try to find out if after step 9 you still see global maintenance set on the old hosts (since they are looking at the old SD it should be the case). At no point in time the old hosts may move out of global maintenance - that would result exactly in what you've reported

Error at step 16 is a different thing and fixed already (bug 1847513)

Comment 7 Sandro Bonazzola 2020-07-02 14:03:19 UTC

Waiting for new tests results using ovirt-engine >= 4.4.1.5 for covering bug #1847513 and bug #1830872

Comment 8 Nikolai Sednev 2020-07-02 17:11:10 UTC

(In reply to Michal Skrivanek from comment #6)
> new SD, ok. Why do you think bug 1830872 is related?
> 
> if you're doing it again, please try to find out if after step 9 you still
> see global maintenance set on the old hosts (since they are looking at the
> old SD it should be the case). At no point in time the old hosts may move
> out of global maintenance - that would result exactly in what you've reported
> 
> Error at step 16 is a different thing and fixed already (bug 1847513)

The issue, I think, with an old appliance, but it's the latest available to QA from CI in brew.
rhvm-appliance-4.4-20200604.0.el8ev=Software Version:4.4.1.2-0.10.el8ev, I have to get the engine with Software Version:4.4.1.5-0.17.el8ev to get rid from https://bugzilla.redhat.com/show_bug.cgi?id=1830872.
I'll try to find out the way to update it to latest bits before it starts with engine-setup during restore.

Comment 9 Nikolai Sednev 2020-07-06 13:14:15 UTC

Works for me on latest Software Version:4.4.1.7-0.3.el8ev.
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)

Reported issue no longer exists.

Comment 10 Nikolai Sednev 2020-07-07 13:35:22 UTC

I've got to the same situation again, I see an old HE-VM keeps running on one of the old hosts like a zombie:
alma03 ~]# virsh -r list --all
 Id    Name                           State
----------------------------------------------------
 4     HostedEngine                   running
On a reprovisioned alma07 I've made a restore and I see a new HE-VM is running as expected:
alma07 ~]# virsh -r list --all
 Id   Name           State
------------------------------
 2    HostedEngine   running
 3    VM5            running
 4    VM2            running

SD used for a new engine is nsednev_he_2, the old SD is nsednev_he_1.

We should not get this flow during restore.
I still have an environment, please contact me if you need reproducer.

Reopening this bug for investigation.

Comment 11 Nikolai Sednev 2020-07-07 13:39:27 UTC

Components used on engine:
ovirt-engine-setup-4.4.1.7-0.3.el8ev.noarch
Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)

On host alma07 (the one that was running the restore):
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
vdsm-4.40.22-1.el8ev.x86_64
qemu-kvm-4.2.0-28.module+el8.2.1+7211+16dfe810.x86_64
libvirt-client-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64
sanlock-3.8.0-2.el8.x86_64
Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)

Comment 12 Nikolai Sednev 2020-07-07 14:21:48 UTC

Steps during reproduction:
1.Deployed Software Version:4.3.10.3-0.2.el7 over NFS on 3 Red Hat Enterprise Linux Server release 7.9 Beta (Maipo) hosts (alma07 hosting HE-VM and it's an SPM and it was the first initial ha-host on which HE was deployed first, then alma03 and alma04 ha-hosts were added, all three hosts were IBRS CPU hosts) with following components:
ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
2.Added 6 guest-VMs, 3RHEL7 and 3RHEL8 and distributed them evenly across 3 ha-hosts.
3.Set the environment to global maintenance, stopped the engine ("systemctl stop ovirt-engine" on the engine-VM) and created backup file from the engine "engine-backup --mode=backup --file=nsednev_from_alma07_SPM_rhevm_4_3 --log=Log_nsednev_from_alma07_SPM_rhevm_4_3".
4.Copied both files (Log_nsednev_from_alma07_SPM_rhevm_4_3 and nsednev_from_alma07_SPM_rhevm_4_3) to my laptop.
5.Reprovisioned alma07 to latest RHEL8.2 with these components:
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
vdsm-4.40.22-1.el8ev.x86_64
qemu-kvm-4.2.0-28.module+el8.2.1+7211+16dfe810.x86_64
libvirt-client-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64
sanlock-3.8.0-2.el8.x86_64
Linux 4.18.0-193.12.1.el8_2.x86_64 #1 SMP Thu Jul 2 15:48:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)
6.Copied backup file from laptop to /root on reprovisioned and clean alma07.
7.Restored engine's DB using "hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma07_SPM_rhevm_4_3", fetched newest repos to the engineduring deployment and got Software Version:4.4.1.7-0.3.el8ev engine deployed on alma07.
8.Removed global maintenance from the UI of the engine.
9.Moved alma03 to local maintenance. 
10.Found that alma03 is stuck in "Preparing for maintenance" and checked it's VMs, and found that VM named "HostedEngine" was running in parallel on both alma03 and alma07!

Comment 13 Sandro Bonazzola 2020-07-08 08:27:55 UTC

(In reply to Nikolai Sednev from comment #12)

> 8.Removed global maintenance from the UI of the engine.

Which host did you use for removing global maintenance?

Comment 14 Nikolai Sednev 2020-07-08 08:44:58 UTC

(In reply to Sandro Bonazzola from comment #13)
> (In reply to Nikolai Sednev from comment #12)
> 
> > 8.Removed global maintenance from the UI of the engine.
> 
> Which host did you use for removing global maintenance?

From UI I clicked on alma03 to highlight it and then disabled global maintenance when the option appeared. There should not be any difference on which host I will click to get the ability to disable global maintenance as its global and should not be host specific, global maintenance influences all ha-hosts in hosted-engine host-cluster.

Comment 15 Michal Skrivanek 2020-07-08 13:09:49 UTC

Thank you for clarifying which host was selected. alma03 was an old host at that time. It caused the old setup to reactivate and that's why it started the old HE VM. This must not happen, we need to make it clear in documentation

I suggest to move this to Documentation, suggest to move out of global maintenance only once all HE hosts are upgraded, and warn about making sure you select the new host if you need/want to do it sooner

Comment 16 Steve Goodman 2020-07-09 16:06:21 UTC

(In reply to Michal Skrivanek from comment #15)
> I suggest to move this to Documentation, suggest to move out of global
> maintenance only once all HE hosts are upgraded, and warn about making sure
> you select the new host if you need/want to do it sooner

I addressed this as part of bug 1802650.

Comment 20 Nikolai Sednev 2020-07-23 10:54:52 UTC

Please fill in the "Fixed In Version:" field with exact component version.

Comment 21 Nikolai Sednev 2020-07-23 12:52:41 UTC

Latest rhvm available for QA is rhvm-4.4.1.10-0.1.el8ev.noarch from rhvm-appliance-4.4-20200722.0.el8ev.x86_64.rpm.
Nothing to verify.

Comment 22 Nikolai Sednev 2020-07-23 15:56:22 UTC

The issue had been fixed.
alma04 ~]# virsh -r list --all
 Id    Name                           State
----------------------------------------------------
 1     VM5                            running
 3     VM6                            running
 4     VM4                            running
 5     VM3                            running

alma07 ~]# virsh -r list --all
 Id   Name           State
------------------------------
 2    HostedEngine   running
 3    VM1            running
 4    VM2            running

alma03 had been sent to local maintenance, then removed for farther host's upgrade.

No "zombie" old HE-VM appeared if I tried to disable global maintenance from any of 4.3 old ha-hosts (alma03 or alma04, check the attached recording).

Works for me on:
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.5-1.el8ev.noarch
rhvm-appliance-4.4-20200722.0.el8ev.x86_64
Software Version:4.4.1.10-0.1.el8ev
Linux 4.18.0-193.13.2.el8_2.x86_64 #1 SMP Mon Jul 13 23:17:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)

Backup was made from iSCSI deployed HE 4.3 and restore performed to different iSCSI volume at the same storage.
Moving to verified.

Comment 23 Nikolai Sednev 2020-07-23 15:57:36 UTC

Created attachment 1702248 [details]
recording 1

Comment 24 Nikolai Sednev 2020-07-23 15:58:00 UTC

Created attachment 1702249 [details]
recording 2

Comment 25 Nikolai Sednev 2020-07-23 15:58:38 UTC

Created attachment 1702250 [details]
recording 3

Comment 26 Sandro Bonazzola 2020-08-05 06:28:20 UTC

This bugzilla is included in oVirt 4.4.1.1 Async release, published on July 13th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.