Bug 1422470 - [downstream clone - 4.0.7] Restoring self-hosted engine from backup has conflict between new and old HostedEngine VM
Summary: [downstream clone - 4.0.7] Restoring self-hosted engine from backup has confl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.0.7
: ---
Assignee: Simone Tiraboschi
QA Contact: Artyom
URL:
Whiteboard:
Depends On: 1240466
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-15 12:03 UTC by rhev-integ
Modified: 2020-05-14 15:38 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Previously, when restoring a backup of a self-hosted engine on a different environment, for disaster recovery purposes, administrators were sometimes required to remove the previous self-hosted engine's storage domain and virtual machine. This was accomplished from within the engine's database, which is a risk-prone procedure. In this release, a new CLI option enables administrators to remove the previous self-hosted engine's storage domain and virtual machine directly from the backup of the engine, during the restore procedure.
Clone Of: 1240466
Environment:
Last Closed: 2017-03-16 15:33:46 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0542 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0.7 2017-03-16 19:25:04 UTC
oVirt gerrit 64966 0 master MERGED hosted-engine: add a DB cleaner utility 2020-11-16 10:13:44 UTC

Description rhev-integ 2017-02-15 12:03:42 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1240466 +++
======================================================================

Description of problem:
The current backup/restore procedure for the self-hosted engine runs into a conflict where the HostedEngine VM is present in the database and creates a problem when the new environment is deployed.

After the Manager has been restored (and engine-setup has run), the HostedEngine VM is present in the environment in an Unknown state - as are all VMs until the new host has finished deployment, becomes active, and contends for SPM, after which all VMs go into a Down state (non-HostedEngine VMs can then be started). The HostedEngine VM that is present in the Manager is a ghost from the backup, and prevents the new HostedEngine VM from appearing, presumably because the old HostedEngine VM and the new HostedEngine VM have the same name ((I imagine this would not be the case if the old VM had had its name edited, as in: https://access.redhat.com/articles/1248993)). The old HostedEngine cannot be brought into Up state or removed by any conventional means because it is not controlled by the Manager.

The current workaround is to edit the name of the new HostedEngine VM, to differentiate the two VMs, at which point it will appear in the Manager as 'external-<newName>' and in an Up state. After this, the old HostedEngine VM (and associated snapshot) can be removed from the engine database. This procedure is documented in the following article: https://access.redhat.com/solutions/1517683

More information can also be found throughout BZ#1232136


Version-Release number of selected component (if applicable):
3.4 and 3.5 (Have not tested with 3.3)

How reproducible:
Every time.

Steps to Reproduce:
1. Backup Self-Hosted Engine with engine-backup tool
2. Deploy Self-Hosted Engine on new host (can also be an old host providing it was not hosting any VMs at time of backup)
3. Restore Self-Hosted Engine with engine-backup tool on new HostedEngine VM and run engine-setup
4. Log into Freshly restored Manager
5. Shake fist at persistent ghost of old HostedEngine VM

Actual results:
HostedEngine VM is in Unknown and then Down state and cannot be brought into Up state or removed by conventional means.

Expected results:
The new HostedEngine VM supersedes the old one and is in Up state at the completion of hosted-engine deployment.

Additional info:
I have saved logs from the last time I ran this procedure, if they would be useful.

(Originally by Andrew Burden)

Comment 1 rhev-integ 2017-02-15 12:03:55 UTC
Andrew, can you attach the logs? Roy, I think we should check this work flow with the import / edit feature you're working on.

(Originally by Sandro Bonazzola)

Comment 7 rhev-integ 2017-02-15 12:04:34 UTC
Meital, has the existing procedure been tested on 3.6?

(Originally by Sandro Bonazzola)

Comment 8 rhev-integ 2017-02-15 12:04:41 UTC
(In reply to Sandro Bonazzola from comment #6)
> Meital, has the existing procedure been tested on 3.6?

We've followed the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview during 3.6 bare-metal based engine migration to 3.6 HE. The backup and restore were made accordingly to the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview. Everything worked fine.

(Originally by Nikolai Sednev)

Comment 9 rhev-integ 2017-02-15 12:04:48 UTC
Nikolai, note that migration from bare metal and restore of hosted-engine backup are different. Please check restore of the backup

(Originally by Sandro Bonazzola)

Comment 10 rhev-integ 2017-02-15 12:04:56 UTC
Performed:
1) On HE-VM with running engine and DWH&reports&console-proxy I've ran "engine-backup --mode=backup --file=nsednev_from_nsednev_he_1_rhevm_3_6 --log=Log_nsednev_from_nsednev_he_1_rhevm_3_6" command.

Results:
Backing up:
Notifying engine
- Files
- Engine database 'engine'
- DWH database 'ovirt_engine_history'
- Reports database 'ovirt_engine_reports'
Packing into file 'nsednev_from_nsednev_he_1_rhevm_3_6'
Notifying engine
Done.

2)Copied Log_nsednev_from_nsednev_he_1_rhevm_3_6 and nsednev_from_nsednev_he_1_rhevm_3_6 to second (new) host, which will replace first hosted-engine-host.

3)Powered off first hosted-engine-host.
4)Deployed HE to clean NFS share, on second (new) host and answered "no" to "Automatically execute engine-setup on the engine appliance (rhevm-appliance-20160413.0-1) on first boot (Yes, No)[Yes]? no", during deployment.
5)Copied backed up files to /root/backup at the engine VM and restored them:
[root@nsednev-he-1 ~]# mkdir backup
[root@nsednev-he-1 ~]# ll
total 8724
drwxr-xr-x. 2 root root    4096 Apr 19 05:27 backup
-rw-r--r--. 1 root root    3830 Apr 19 05:24 Log_nsednev_from_nsednev_he_1_rhevm_3_6
-rw-r--r--. 1 root root 8919541 Apr 19 05:25 nsednev_from_nsednev_he_1_rhevm_3_6
-rw-r--r--. 1 root root    1117 Apr 13 11:55 ovirt-engine-answers
[root@nsednev-he-1 ~]# cp Log_nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/
[root@nsednev-he-1 ~]# cp nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/
[root@nsednev-he-1 ~]# engine-backup --mode=restore --log=/root/backup/Log_nsednev_from_nsednev_he_1_rhevm_3_6 --file=/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6 --provision-db --provision-dwh-db --provision-reports-db --restore-permissions
Preparing to restore:
- Unpacking file '/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6'
Restoring:
- Files
Provisioning PostgreSQL users/databases:
- user 'engine', database 'engine'
- user 'ovirt_engine_history', database 'ovirt_engine_history'
- user 'ovirt_engine_reports', database 'ovirt_engine_reports'
Restoring:
- Engine database 'engine'
  - Cleaning up temporary tables in engine database 'engine'
  - Resetting DwhCurrentlyRunning in dwh_history_timekeeping in engine database
- DWH database 'ovirt_engine_history'
- Reports database 'ovirt_engine_reports'
You should now run engine-setup.
Done.
6)Ran engine-setup and once finished, continued with HE deployment on second host.
7)(1) Continue setup - oVirt-Engine installation is ready and ovirt-engine service is up
8)I've got the screenshots as attached.

(Originally by Nikolai Sednev)

Comment 11 rhev-integ 2017-02-15 12:05:05 UTC
Created attachment 1149102 [details]
HE_unknown.png

(Originally by Nikolai Sednev)

Comment 12 rhev-integ 2017-02-15 12:05:12 UTC
Created attachment 1149104 [details]
sosreport from the backed up engine

(Originally by Nikolai Sednev)

Comment 13 rhev-integ 2017-02-15 12:05:20 UTC
Created attachment 1149105 [details]
sosreport from second host

(Originally by Nikolai Sednev)

Comment 14 rhev-integ 2017-02-15 12:05:27 UTC
Roy, Didi, Simone can you have a look at procedure, screenshots and logs?
We should end up with VM status known and storage up.

(Originally by Sandro Bonazzola)

Comment 15 rhev-integ 2017-02-15 12:05:35 UTC
Adding screenshot from HE Storage.

(Originally by Nikolai Sednev)

Comment 16 rhev-integ 2017-02-15 12:05:42 UTC
Created attachment 1149112 [details]
Storage screenshot.png

(Originally by Nikolai Sednev)

Comment 17 rhev-integ 2017-02-15 12:05:50 UTC
The procedure as described in https://bugzilla.redhat.com/show_bug.cgi?id=1240466#c9 is fine.
We need to retest it since AFAIK now it's correctly working.

(Originally by Simone Tiraboschi)

Comment 18 rhev-integ 2017-02-15 12:05:57 UTC
Checked on:
rhevm-3.6.7.3-0.1.el6.noarch
ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch

The problem still exists, because VM with the name "HostedEngine" exists under the engine, the engine can not start the auto-import process of new "HostedEgnine" VM.
2016-06-09 11:15:38,016 ERROR [org.ovirt.engine.core.bll.HostedEngineImporter] (org.ovirt.thread.pool-6-thread-8) [66c4f646] Failed importing the Hosted Engine VM
See attached engine log.

(Originally by Artyom Lukianov)

Comment 19 rhev-integ 2017-02-15 12:06:03 UTC
Created attachment 1166349 [details]
new engine log

(Originally by Artyom Lukianov)

Comment 20 rhev-integ 2017-02-15 12:06:12 UTC
I also tried to destroy HE storage domain to start auto-import process from the beginning, but now engine failed to import HE SD at all.
From host vdsm log:
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 104, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1179, in prepare
    raise self.error
IndexError: list index out of range
So I will also add host vdsm logs.

(Originally by Artyom Lukianov)

Comment 21 rhev-integ 2017-02-15 12:06:20 UTC
Created attachment 1166353 [details]
new vdsm log

(Originally by Artyom Lukianov)

Comment 22 rhev-integ 2017-02-15 12:06:29 UTC
OK, so now I see the whole picture.

What is failing now it's basically a full migration of an hosted-engine setup from a storage domain to a new one.

This is failing since the engine backup itself contains a reference to the previous hosted-engine storage domain and to the old engine VM with different disk uuid and so on.

This is not blocking the migration from 3.6 EL6 EAP6 to 4.0 EL7 EAP7 (rhbz#1302228) since in that case we are using the same hosted-engine storage domain and the same VM just editing it to use a new disk deployed from the EL/ appliance.

In order to let the user move from an hosted-engine storage domain to another, which can be a good idea if the whole storage device failed or if the user want to change the hosted-engine storage domain type, we also need to somehow (at backup or at restore time) filter out (the engine will look for them as just after a fresh deployment) any reference to the old hosted-engine storage domain and to the old hosted-engine VM.
We need also to ask to the user to redeploy all the other hosts since the new metadata and lockspace volume don't contain any reference to them since we didn't clone but just recreated the two volumes.

(Originally by Simone Tiraboschi)

Comment 23 rhev-integ 2017-02-15 12:06:36 UTC
Moving to 4.0.1 based on comment #21.

(Originally by Yaniv Dary)

Comment 24 rhev-integ 2017-02-15 12:06:44 UTC
Can you provide the manual steps to make it work for a KBase or we can no support this at this point?

(Originally by Yaniv Dary)

Comment 26 rhev-integ 2017-02-15 12:06:59 UTC
Simone please check if this is still an issue.
Migration from 3.6 to 4.0 looks pretty much like the backup / restore described here.

(Originally by Sandro Bonazzola)

Comment 27 rhev-integ 2017-02-15 12:07:06 UTC
(In reply to Sandro Bonazzola from comment #25)
> Simone please check if this is still an issue.
> Migration from 3.6 to 4.0 looks pretty much like the backup / restore
> described here.

Yes, it's still an issue.
In the 3.6 to 4.0 migration we are basically upgrading in place so the engine VM uuid, the hosted-engine storage domain uuid and the host uuids are really the same and so no issue there.

This issue will happen instead when we try to restore (for disaster recovery purposes for instance) an engine-backup took on a environment on slightly or completely different one.
In that case for instance the uuid of the new engine VM and the uuid of its disk could be different from what we have in the engine DB, the same for the old hosted-engine storage domain (the engine will try to remount what it originally imported) and so on.
Unfortunately due to other reasons, the hosted-engine elements are also locked in the engine and so the user cannot simply remove them though the engine.

Probably the best solution would be to filter them out at backup creation or restore time.

(Originally by Simone Tiraboschi)

Comment 28 rhev-integ 2017-02-15 12:07:15 UTC
Verified on rhevm-4.1.0.2-0.2.el7.noarch

(Originally by Artyom Lukianov)

Comment 35 Artyom 2017-02-28 11:25:43 UTC
Verified on:
rhevm-4.0.7.3-0.1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.2-1.el7ev.noarch

1. Deploy HE environment
2. Add the storage domain to the engine(to start auto-import process)
3. Wait until the engine will have HE VM
4. Set global maintenance
5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log
6. Copy the backup file from the HE VM to the host
7. Clean host from HE deploy(reprovisioning)
8. Run the HE deployment again
9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? "
10. Enter to the HE VM and copy the backup file from the host to the HE VM
11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log  --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --provision-db
12. Run engine setup: # engine-setup --offline
13. Finish HE deployment process

Engine UP and have HE SD and HE VM in the active state

Be aware of bugs under 4.1:
https://bugzilla.redhat.com/show_bug.cgi?id=1416459
https://bugzilla.redhat.com/show_bug.cgi?id=1416466

Comment 36 Artyom 2017-03-02 08:42:29 UTC
Verified on correct version
# rpm -qa | grep hosted
ovirt-hosted-engine-setup-2.0.4.3-2.el7ev.noarch
ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch

Comment 38 Artyom 2017-03-02 14:11:12 UTC
After numbers of backup-restore operation looks like we still have the problem with the auto-import operation:
2017-03-02 05:07:40,924 INFO  [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] VDS '847b4fe4-671c-4635-ac0a-6801064c3c95' was added to the Resource Manager
2017-03-02 05:07:40,951 INFO  [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] Finished initializing ResourceManager
2017-03-02 05:07:40,981 INFO  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (ServerService Thread Pool -- 55) [] Command [id=aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70]: Compensating DELETED_OR_UPDATED_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainDynamic; snapshot: id=00000000-0000-0000-0000-000000000000.
2017-03-02 05:07:40,992 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (ServerService Thread Pool -- 55) [] transaction rolled back
2017-03-02 05:07:40,992 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 55) [] Failed to run compensation on startup for Command 'org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand', Command Id 'aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70': CallableStatementCallback; SQL [{call insertstorage_domain_dynamic(?, ?, ?)}]; ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic (
        available_disk_size,
        id,
        used_disk_size
        )
    VALUES (
        v_available_disk_size,
        v_id,
        v_used_disk_size
        )"
PL/pgSQL function insertstorage_domain_dynamic(integer,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic (

We miss the patch in our downstream build - https://gerrit.ovirt.org/#/c/72902

Comment 39 Artyom 2017-03-02 15:55:53 UTC
Verified with new vdsm:
vdsm-4.18.24-3.el7ev.x86_64

Ran backup/restore process 5 times in all 5 times HE VM and HE SD active.

Comment 42 errata-xmlrpc 2017-03-16 15:33:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html


Note You need to log in before you can comment on or make changes to this bug.