+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1240466 +++ ====================================================================== Description of problem: The current backup/restore procedure for the self-hosted engine runs into a conflict where the HostedEngine VM is present in the database and creates a problem when the new environment is deployed. After the Manager has been restored (and engine-setup has run), the HostedEngine VM is present in the environment in an Unknown state - as are all VMs until the new host has finished deployment, becomes active, and contends for SPM, after which all VMs go into a Down state (non-HostedEngine VMs can then be started). The HostedEngine VM that is present in the Manager is a ghost from the backup, and prevents the new HostedEngine VM from appearing, presumably because the old HostedEngine VM and the new HostedEngine VM have the same name ((I imagine this would not be the case if the old VM had had its name edited, as in: https://access.redhat.com/articles/1248993)). The old HostedEngine cannot be brought into Up state or removed by any conventional means because it is not controlled by the Manager. The current workaround is to edit the name of the new HostedEngine VM, to differentiate the two VMs, at which point it will appear in the Manager as 'external-<newName>' and in an Up state. After this, the old HostedEngine VM (and associated snapshot) can be removed from the engine database. This procedure is documented in the following article: https://access.redhat.com/solutions/1517683 More information can also be found throughout BZ#1232136 Version-Release number of selected component (if applicable): 3.4 and 3.5 (Have not tested with 3.3) How reproducible: Every time. Steps to Reproduce: 1. Backup Self-Hosted Engine with engine-backup tool 2. Deploy Self-Hosted Engine on new host (can also be an old host providing it was not hosting any VMs at time of backup) 3. Restore Self-Hosted Engine with engine-backup tool on new HostedEngine VM and run engine-setup 4. Log into Freshly restored Manager 5. Shake fist at persistent ghost of old HostedEngine VM Actual results: HostedEngine VM is in Unknown and then Down state and cannot be brought into Up state or removed by conventional means. Expected results: The new HostedEngine VM supersedes the old one and is in Up state at the completion of hosted-engine deployment. Additional info: I have saved logs from the last time I ran this procedure, if they would be useful. (Originally by Andrew Burden)
Andrew, can you attach the logs? Roy, I think we should check this work flow with the import / edit feature you're working on. (Originally by Sandro Bonazzola)
Meital, has the existing procedure been tested on 3.6? (Originally by Sandro Bonazzola)
(In reply to Sandro Bonazzola from comment #6) > Meital, has the existing procedure been tested on 3.6? We've followed the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview during 3.6 bare-metal based engine migration to 3.6 HE. The backup and restore were made accordingly to the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview. Everything worked fine. (Originally by Nikolai Sednev)
Nikolai, note that migration from bare metal and restore of hosted-engine backup are different. Please check restore of the backup (Originally by Sandro Bonazzola)
Performed: 1) On HE-VM with running engine and DWH&reports&console-proxy I've ran "engine-backup --mode=backup --file=nsednev_from_nsednev_he_1_rhevm_3_6 --log=Log_nsednev_from_nsednev_he_1_rhevm_3_6" command. Results: Backing up: Notifying engine - Files - Engine database 'engine' - DWH database 'ovirt_engine_history' - Reports database 'ovirt_engine_reports' Packing into file 'nsednev_from_nsednev_he_1_rhevm_3_6' Notifying engine Done. 2)Copied Log_nsednev_from_nsednev_he_1_rhevm_3_6 and nsednev_from_nsednev_he_1_rhevm_3_6 to second (new) host, which will replace first hosted-engine-host. 3)Powered off first hosted-engine-host. 4)Deployed HE to clean NFS share, on second (new) host and answered "no" to "Automatically execute engine-setup on the engine appliance (rhevm-appliance-20160413.0-1) on first boot (Yes, No)[Yes]? no", during deployment. 5)Copied backed up files to /root/backup at the engine VM and restored them: [root@nsednev-he-1 ~]# mkdir backup [root@nsednev-he-1 ~]# ll total 8724 drwxr-xr-x. 2 root root 4096 Apr 19 05:27 backup -rw-r--r--. 1 root root 3830 Apr 19 05:24 Log_nsednev_from_nsednev_he_1_rhevm_3_6 -rw-r--r--. 1 root root 8919541 Apr 19 05:25 nsednev_from_nsednev_he_1_rhevm_3_6 -rw-r--r--. 1 root root 1117 Apr 13 11:55 ovirt-engine-answers [root@nsednev-he-1 ~]# cp Log_nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/ [root@nsednev-he-1 ~]# cp nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/ [root@nsednev-he-1 ~]# engine-backup --mode=restore --log=/root/backup/Log_nsednev_from_nsednev_he_1_rhevm_3_6 --file=/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6 --provision-db --provision-dwh-db --provision-reports-db --restore-permissions Preparing to restore: - Unpacking file '/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6' Restoring: - Files Provisioning PostgreSQL users/databases: - user 'engine', database 'engine' - user 'ovirt_engine_history', database 'ovirt_engine_history' - user 'ovirt_engine_reports', database 'ovirt_engine_reports' Restoring: - Engine database 'engine' - Cleaning up temporary tables in engine database 'engine' - Resetting DwhCurrentlyRunning in dwh_history_timekeeping in engine database - DWH database 'ovirt_engine_history' - Reports database 'ovirt_engine_reports' You should now run engine-setup. Done. 6)Ran engine-setup and once finished, continued with HE deployment on second host. 7)(1) Continue setup - oVirt-Engine installation is ready and ovirt-engine service is up 8)I've got the screenshots as attached. (Originally by Nikolai Sednev)
Created attachment 1149102 [details] HE_unknown.png (Originally by Nikolai Sednev)
Created attachment 1149104 [details] sosreport from the backed up engine (Originally by Nikolai Sednev)
Created attachment 1149105 [details] sosreport from second host (Originally by Nikolai Sednev)
Roy, Didi, Simone can you have a look at procedure, screenshots and logs? We should end up with VM status known and storage up. (Originally by Sandro Bonazzola)
Adding screenshot from HE Storage. (Originally by Nikolai Sednev)
Created attachment 1149112 [details] Storage screenshot.png (Originally by Nikolai Sednev)
The procedure as described in https://bugzilla.redhat.com/show_bug.cgi?id=1240466#c9 is fine. We need to retest it since AFAIK now it's correctly working. (Originally by Simone Tiraboschi)
Checked on: rhevm-3.6.7.3-0.1.el6.noarch ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch The problem still exists, because VM with the name "HostedEngine" exists under the engine, the engine can not start the auto-import process of new "HostedEgnine" VM. 2016-06-09 11:15:38,016 ERROR [org.ovirt.engine.core.bll.HostedEngineImporter] (org.ovirt.thread.pool-6-thread-8) [66c4f646] Failed importing the Hosted Engine VM See attached engine log. (Originally by Artyom Lukianov)
Created attachment 1166349 [details] new engine log (Originally by Artyom Lukianov)
I also tried to destroy HE storage domain to start auto-import process from the beginning, but now engine failed to import HE SD at all. From host vdsm log: Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 104, in wrapper return m(self, *a, **kw) File "/usr/share/vdsm/storage/task.py", line 1179, in prepare raise self.error IndexError: list index out of range So I will also add host vdsm logs. (Originally by Artyom Lukianov)
Created attachment 1166353 [details] new vdsm log (Originally by Artyom Lukianov)
OK, so now I see the whole picture. What is failing now it's basically a full migration of an hosted-engine setup from a storage domain to a new one. This is failing since the engine backup itself contains a reference to the previous hosted-engine storage domain and to the old engine VM with different disk uuid and so on. This is not blocking the migration from 3.6 EL6 EAP6 to 4.0 EL7 EAP7 (rhbz#1302228) since in that case we are using the same hosted-engine storage domain and the same VM just editing it to use a new disk deployed from the EL/ appliance. In order to let the user move from an hosted-engine storage domain to another, which can be a good idea if the whole storage device failed or if the user want to change the hosted-engine storage domain type, we also need to somehow (at backup or at restore time) filter out (the engine will look for them as just after a fresh deployment) any reference to the old hosted-engine storage domain and to the old hosted-engine VM. We need also to ask to the user to redeploy all the other hosts since the new metadata and lockspace volume don't contain any reference to them since we didn't clone but just recreated the two volumes. (Originally by Simone Tiraboschi)
Moving to 4.0.1 based on comment #21. (Originally by Yaniv Dary)
Can you provide the manual steps to make it work for a KBase or we can no support this at this point? (Originally by Yaniv Dary)
Simone please check if this is still an issue. Migration from 3.6 to 4.0 looks pretty much like the backup / restore described here. (Originally by Sandro Bonazzola)
(In reply to Sandro Bonazzola from comment #25) > Simone please check if this is still an issue. > Migration from 3.6 to 4.0 looks pretty much like the backup / restore > described here. Yes, it's still an issue. In the 3.6 to 4.0 migration we are basically upgrading in place so the engine VM uuid, the hosted-engine storage domain uuid and the host uuids are really the same and so no issue there. This issue will happen instead when we try to restore (for disaster recovery purposes for instance) an engine-backup took on a environment on slightly or completely different one. In that case for instance the uuid of the new engine VM and the uuid of its disk could be different from what we have in the engine DB, the same for the old hosted-engine storage domain (the engine will try to remount what it originally imported) and so on. Unfortunately due to other reasons, the hosted-engine elements are also locked in the engine and so the user cannot simply remove them though the engine. Probably the best solution would be to filter them out at backup creation or restore time. (Originally by Simone Tiraboschi)
Verified on rhevm-4.1.0.2-0.2.el7.noarch (Originally by Artyom Lukianov)
Verified on: rhevm-4.0.7.3-0.1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.2-1.el7ev.noarch 1. Deploy HE environment 2. Add the storage domain to the engine(to start auto-import process) 3. Wait until the engine will have HE VM 4. Set global maintenance 5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log 6. Copy the backup file from the HE VM to the host 7. Clean host from HE deploy(reprovisioning) 8. Run the HE deployment again 9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? " 10. Enter to the HE VM and copy the backup file from the host to the HE VM 11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --provision-db 12. Run engine setup: # engine-setup --offline 13. Finish HE deployment process Engine UP and have HE SD and HE VM in the active state Be aware of bugs under 4.1: https://bugzilla.redhat.com/show_bug.cgi?id=1416459 https://bugzilla.redhat.com/show_bug.cgi?id=1416466
Verified on correct version # rpm -qa | grep hosted ovirt-hosted-engine-setup-2.0.4.3-2.el7ev.noarch ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch
(In reply to Artyom from comment #35) > Be aware of bugs under 4.1: > https://bugzilla.redhat.com/show_bug.cgi?id=1416459 > https://bugzilla.redhat.com/show_bug.cgi?id=1416466 We backported their fixes as well: https://bugzilla.redhat.com/show_bug.cgi?id=1425893 https://bugzilla.redhat.com/show_bug.cgi?id=1425890
After numbers of backup-restore operation looks like we still have the problem with the auto-import operation: 2017-03-02 05:07:40,924 INFO [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] VDS '847b4fe4-671c-4635-ac0a-6801064c3c95' was added to the Resource Manager 2017-03-02 05:07:40,951 INFO [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] Finished initializing ResourceManager 2017-03-02 05:07:40,981 INFO [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (ServerService Thread Pool -- 55) [] Command [id=aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70]: Compensating DELETED_OR_UPDATED_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainDynamic; snapshot: id=00000000-0000-0000-0000-000000000000. 2017-03-02 05:07:40,992 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (ServerService Thread Pool -- 55) [] transaction rolled back 2017-03-02 05:07:40,992 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 55) [] Failed to run compensation on startup for Command 'org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand', Command Id 'aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70': CallableStatementCallback; SQL [{call insertstorage_domain_dynamic(?, ?, ?)}]; ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static" Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static". Where: SQL statement "INSERT INTO storage_domain_dynamic ( available_disk_size, id, used_disk_size ) VALUES ( v_available_disk_size, v_id, v_used_disk_size )" PL/pgSQL function insertstorage_domain_dynamic(integer,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static" Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static". Where: SQL statement "INSERT INTO storage_domain_dynamic ( We miss the patch in our downstream build - https://gerrit.ovirt.org/#/c/72902
Verified with new vdsm: vdsm-4.18.24-3.el7ev.x86_64 Ran backup/restore process 5 times in all 5 times HE VM and HE SD active.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0542.html