Bug 1240466 - Restoring self-hosted engine from backup has conflict between new and old HostedEngine VM
Summary: Restoring self-hosted engine from backup has conflict between new and old Hos...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-setup
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.0-beta
: ---
Assignee: Simone Tiraboschi
QA Contact: Artyom
URL:
Whiteboard:
Depends On: 1409112
Blocks: 1420604 1422470
TreeView+ depends on / blocked
 
Reported: 2015-07-07 01:44 UTC by Andrew Burden
Modified: 2020-05-14 14:59 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Previously, when restoring a backup of a self-hosted engine on a different environment administrators were sometimes required to remove the previous self-hosted engine's storage domain and virtual machine. This was accomplished from within the engine's database, which was a risk-prone procedure. With this update, a new CLI option enables administrators to remove the previous self-hosted engine's storage domain and virtual machine directly from the backup of the engine during the restore procedure.
Clone Of:
: 1422470 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:51:27 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:
nsednev: testing_plan_complete+


Attachments (Terms of Use)
HE_unknown.png (130.81 KB, image/png)
2016-04-20 13:54 UTC, Nikolai Sednev
no flags Details
sosreport from the backed up engine (7.03 MB, application/x-xz)
2016-04-20 14:07 UTC, Nikolai Sednev
no flags Details
sosreport from second host (9.60 MB, application/x-xz)
2016-04-20 14:09 UTC, Nikolai Sednev
no flags Details
Storage screenshot.png (130.95 KB, image/png)
2016-04-20 14:34 UTC, Nikolai Sednev
no flags Details
new engine log (919.99 KB, text/plain)
2016-06-09 15:19 UTC, Artyom
no flags Details
new vdsm log (2.40 MB, application/zip)
2016-06-09 15:40 UTC, Artyom
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1241811 0 high CLOSED Adding hosted-engine hosts to restored engine is messy 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2017:1002 0 normal SHIPPED_LIVE ovirt-hosted-engine-setup bug fix and enhancement update 2017-04-18 20:14:37 UTC
oVirt gerrit 64966 0 master MERGED hosted-engine: add a DB cleaner utility 2021-01-24 12:39:47 UTC

Internal Links: 1241811

Description Andrew Burden 2015-07-07 01:44:53 UTC
Description of problem:
The current backup/restore procedure for the self-hosted engine runs into a conflict where the HostedEngine VM is present in the database and creates a problem when the new environment is deployed.

After the Manager has been restored (and engine-setup has run), the HostedEngine VM is present in the environment in an Unknown state - as are all VMs until the new host has finished deployment, becomes active, and contends for SPM, after which all VMs go into a Down state (non-HostedEngine VMs can then be started). The HostedEngine VM that is present in the Manager is a ghost from the backup, and prevents the new HostedEngine VM from appearing, presumably because the old HostedEngine VM and the new HostedEngine VM have the same name ((I imagine this would not be the case if the old VM had had its name edited, as in: https://access.redhat.com/articles/1248993)). The old HostedEngine cannot be brought into Up state or removed by any conventional means because it is not controlled by the Manager.

The current workaround is to edit the name of the new HostedEngine VM, to differentiate the two VMs, at which point it will appear in the Manager as 'external-<newName>' and in an Up state. After this, the old HostedEngine VM (and associated snapshot) can be removed from the engine database. This procedure is documented in the following article: https://access.redhat.com/solutions/1517683

More information can also be found throughout BZ#1232136


Version-Release number of selected component (if applicable):
3.4 and 3.5 (Have not tested with 3.3)

How reproducible:
Every time.

Steps to Reproduce:
1. Backup Self-Hosted Engine with engine-backup tool
2. Deploy Self-Hosted Engine on new host (can also be an old host providing it was not hosting any VMs at time of backup)
3. Restore Self-Hosted Engine with engine-backup tool on new HostedEngine VM and run engine-setup
4. Log into Freshly restored Manager
5. Shake fist at persistent ghost of old HostedEngine VM

Actual results:
HostedEngine VM is in Unknown and then Down state and cannot be brought into Up state or removed by conventional means.

Expected results:
The new HostedEngine VM supersedes the old one and is in Up state at the completion of hosted-engine deployment.

Additional info:
I have saved logs from the last time I ran this procedure, if they would be useful.

Comment 1 Sandro Bonazzola 2015-07-07 06:38:23 UTC
Andrew, can you attach the logs? Roy, I think we should check this work flow with the import / edit feature you're working on.

Comment 6 Sandro Bonazzola 2016-04-11 09:39:48 UTC
Meital, has the existing procedure been tested on 3.6?

Comment 7 Nikolai Sednev 2016-04-14 08:43:30 UTC
(In reply to Sandro Bonazzola from comment #6)
> Meital, has the existing procedure been tested on 3.6?

We've followed the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview during 3.6 bare-metal based engine migration to 3.6 HE. The backup and restore were made accordingly to the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview. Everything worked fine.

Comment 8 Sandro Bonazzola 2016-04-18 09:43:05 UTC
Nikolai, note that migration from bare metal and restore of hosted-engine backup are different. Please check restore of the backup

Comment 9 Nikolai Sednev 2016-04-20 13:53:50 UTC
Performed:
1) On HE-VM with running engine and DWH&reports&console-proxy I've ran "engine-backup --mode=backup --file=nsednev_from_nsednev_he_1_rhevm_3_6 --log=Log_nsednev_from_nsednev_he_1_rhevm_3_6" command.

Results:
Backing up:
Notifying engine
- Files
- Engine database 'engine'
- DWH database 'ovirt_engine_history'
- Reports database 'ovirt_engine_reports'
Packing into file 'nsednev_from_nsednev_he_1_rhevm_3_6'
Notifying engine
Done.

2)Copied Log_nsednev_from_nsednev_he_1_rhevm_3_6 and nsednev_from_nsednev_he_1_rhevm_3_6 to second (new) host, which will replace first hosted-engine-host.

3)Powered off first hosted-engine-host.
4)Deployed HE to clean NFS share, on second (new) host and answered "no" to "Automatically execute engine-setup on the engine appliance (rhevm-appliance-20160413.0-1) on first boot (Yes, No)[Yes]? no", during deployment.
5)Copied backed up files to /root/backup at the engine VM and restored them:
[root@nsednev-he-1 ~]# mkdir backup
[root@nsednev-he-1 ~]# ll
total 8724
drwxr-xr-x. 2 root root    4096 Apr 19 05:27 backup
-rw-r--r--. 1 root root    3830 Apr 19 05:24 Log_nsednev_from_nsednev_he_1_rhevm_3_6
-rw-r--r--. 1 root root 8919541 Apr 19 05:25 nsednev_from_nsednev_he_1_rhevm_3_6
-rw-r--r--. 1 root root    1117 Apr 13 11:55 ovirt-engine-answers
[root@nsednev-he-1 ~]# cp Log_nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/
[root@nsednev-he-1 ~]# cp nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/
[root@nsednev-he-1 ~]# engine-backup --mode=restore --log=/root/backup/Log_nsednev_from_nsednev_he_1_rhevm_3_6 --file=/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6 --provision-db --provision-dwh-db --provision-reports-db --restore-permissions
Preparing to restore:
- Unpacking file '/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6'
Restoring:
- Files
Provisioning PostgreSQL users/databases:
- user 'engine', database 'engine'
- user 'ovirt_engine_history', database 'ovirt_engine_history'
- user 'ovirt_engine_reports', database 'ovirt_engine_reports'
Restoring:
- Engine database 'engine'
  - Cleaning up temporary tables in engine database 'engine'
  - Resetting DwhCurrentlyRunning in dwh_history_timekeeping in engine database
- DWH database 'ovirt_engine_history'
- Reports database 'ovirt_engine_reports'
You should now run engine-setup.
Done.
6)Ran engine-setup and once finished, continued with HE deployment on second host.
7)(1) Continue setup - oVirt-Engine installation is ready and ovirt-engine service is up
8)I've got the screenshots as attached.

Comment 10 Nikolai Sednev 2016-04-20 13:54:36 UTC
Created attachment 1149102 [details]
HE_unknown.png

Comment 11 Nikolai Sednev 2016-04-20 14:07:45 UTC
Created attachment 1149104 [details]
sosreport from the backed up engine

Comment 12 Nikolai Sednev 2016-04-20 14:09:27 UTC
Created attachment 1149105 [details]
sosreport from second host

Comment 13 Sandro Bonazzola 2016-04-20 14:32:34 UTC
Roy, Didi, Simone can you have a look at procedure, screenshots and logs?
We should end up with VM status known and storage up.

Comment 14 Nikolai Sednev 2016-04-20 14:33:22 UTC
Adding screenshot from HE Storage.

Comment 15 Nikolai Sednev 2016-04-20 14:34:04 UTC
Created attachment 1149112 [details]
Storage screenshot.png

Comment 16 Simone Tiraboschi 2016-06-08 07:50:50 UTC
The procedure as described in https://bugzilla.redhat.com/show_bug.cgi?id=1240466#c9 is fine.
We need to retest it since AFAIK now it's correctly working.

Comment 17 Artyom 2016-06-09 15:18:53 UTC
Checked on:
rhevm-3.6.7.3-0.1.el6.noarch
ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch

The problem still exists, because VM with the name "HostedEngine" exists under the engine, the engine can not start the auto-import process of new "HostedEgnine" VM.
2016-06-09 11:15:38,016 ERROR [org.ovirt.engine.core.bll.HostedEngineImporter] (org.ovirt.thread.pool-6-thread-8) [66c4f646] Failed importing the Hosted Engine VM
See attached engine log.

Comment 18 Artyom 2016-06-09 15:19:31 UTC
Created attachment 1166349 [details]
new engine log

Comment 19 Artyom 2016-06-09 15:30:43 UTC
I also tried to destroy HE storage domain to start auto-import process from the beginning, but now engine failed to import HE SD at all.
From host vdsm log:
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 104, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1179, in prepare
    raise self.error
IndexError: list index out of range
So I will also add host vdsm logs.

Comment 20 Artyom 2016-06-09 15:40:55 UTC
Created attachment 1166353 [details]
new vdsm log

Comment 21 Simone Tiraboschi 2016-06-13 09:33:01 UTC
OK, so now I see the whole picture.

What is failing now it's basically a full migration of an hosted-engine setup from a storage domain to a new one.

This is failing since the engine backup itself contains a reference to the previous hosted-engine storage domain and to the old engine VM with different disk uuid and so on.

This is not blocking the migration from 3.6 EL6 EAP6 to 4.0 EL7 EAP7 (rhbz#1302228) since in that case we are using the same hosted-engine storage domain and the same VM just editing it to use a new disk deployed from the EL/ appliance.

In order to let the user move from an hosted-engine storage domain to another, which can be a good idea if the whole storage device failed or if the user want to change the hosted-engine storage domain type, we also need to somehow (at backup or at restore time) filter out (the engine will look for them as just after a fresh deployment) any reference to the old hosted-engine storage domain and to the old hosted-engine VM.
We need also to ask to the user to redeploy all the other hosts since the new metadata and lockspace volume don't contain any reference to them since we didn't clone but just recreated the two volumes.

Comment 22 Yaniv Lavi 2016-06-13 12:23:56 UTC
Moving to 4.0.1 based on comment #21.

Comment 23 Yaniv Lavi 2016-06-13 12:24:58 UTC
Can you provide the manual steps to make it work for a KBase or we can no support this at this point?

Comment 25 Sandro Bonazzola 2016-09-01 07:54:44 UTC
Simone please check if this is still an issue.
Migration from 3.6 to 4.0 looks pretty much like the backup / restore described here.

Comment 26 Simone Tiraboschi 2016-09-01 08:08:51 UTC
(In reply to Sandro Bonazzola from comment #25)
> Simone please check if this is still an issue.
> Migration from 3.6 to 4.0 looks pretty much like the backup / restore
> described here.

Yes, it's still an issue.
In the 3.6 to 4.0 migration we are basically upgrading in place so the engine VM uuid, the hosted-engine storage domain uuid and the host uuids are really the same and so no issue there.

This issue will happen instead when we try to restore (for disaster recovery purposes for instance) an engine-backup took on a environment on slightly or completely different one.
In that case for instance the uuid of the new engine VM and the uuid of its disk could be different from what we have in the engine DB, the same for the old hosted-engine storage domain (the engine will try to remount what it originally imported) and so on.
Unfortunately due to other reasons, the hosted-engine elements are also locked in the engine and so the user cannot simply remove them though the engine.

Probably the best solution would be to filter them out at backup creation or restore time.

Comment 27 Artyom 2017-01-25 15:05:43 UTC
Verified on rhevm-4.1.0.2-0.2.el7.noarch


Note You need to log in before you can comment on or make changes to this bug.