Bug 1422470
| Summary: | [downstream clone - 4.0.7] Restoring self-hosted engine from backup has conflict between new and old HostedEngine VM | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ |
| Component: | ovirt-engine | Assignee: | Simone Tiraboschi <stirabos> |
| Status: | CLOSED ERRATA | QA Contact: | Artyom <alukiano> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.5.1 | CC: | aburden, bmcclain, dfediuck, didi, eheftman, gklein, gveitmic, lsurette, lveyde, mavital, mkalinin, mlipchuk, rbalakri, rgolan, Rhev-m-bugs, sbonazzo, srevivo, stirabos, trichard, ykaul, ylavi |
| Target Milestone: | ovirt-4.0.7 | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
Previously, when restoring a backup of a self-hosted engine on a different environment, for disaster recovery purposes, administrators were sometimes required to remove the previous self-hosted engine's storage domain and virtual machine. This was accomplished from within the engine's database, which is a risk-prone procedure.
In this release, a new CLI option enables administrators to remove the previous self-hosted engine's storage domain and virtual machine directly from the backup of the engine, during the restore procedure.
|
Story Points: | --- |
| Clone Of: | 1240466 | Environment: | |
| Last Closed: | 2017-03-16 15:33:46 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1240466 | ||
| Bug Blocks: | |||
|
Description
rhev-integ
2017-02-15 12:03:42 UTC
Andrew, can you attach the logs? Roy, I think we should check this work flow with the import / edit feature you're working on. (Originally by Sandro Bonazzola) Meital, has the existing procedure been tested on 3.6? (Originally by Sandro Bonazzola) (In reply to Sandro Bonazzola from comment #6) > Meital, has the existing procedure been tested on 3.6? We've followed the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview during 3.6 bare-metal based engine migration to 3.6 HE. The backup and restore were made accordingly to the http://file.bne.redhat.com/~juwu/Self-Hosted_Engine_Guide/#Backup_and_Restore_Overview. Everything worked fine. (Originally by Nikolai Sednev) Nikolai, note that migration from bare metal and restore of hosted-engine backup are different. Please check restore of the backup (Originally by Sandro Bonazzola) Performed: 1) On HE-VM with running engine and DWH&reports&console-proxy I've ran "engine-backup --mode=backup --file=nsednev_from_nsednev_he_1_rhevm_3_6 --log=Log_nsednev_from_nsednev_he_1_rhevm_3_6" command. Results: Backing up: Notifying engine - Files - Engine database 'engine' - DWH database 'ovirt_engine_history' - Reports database 'ovirt_engine_reports' Packing into file 'nsednev_from_nsednev_he_1_rhevm_3_6' Notifying engine Done. 2)Copied Log_nsednev_from_nsednev_he_1_rhevm_3_6 and nsednev_from_nsednev_he_1_rhevm_3_6 to second (new) host, which will replace first hosted-engine-host. 3)Powered off first hosted-engine-host. 4)Deployed HE to clean NFS share, on second (new) host and answered "no" to "Automatically execute engine-setup on the engine appliance (rhevm-appliance-20160413.0-1) on first boot (Yes, No)[Yes]? no", during deployment. 5)Copied backed up files to /root/backup at the engine VM and restored them: [root@nsednev-he-1 ~]# mkdir backup [root@nsednev-he-1 ~]# ll total 8724 drwxr-xr-x. 2 root root 4096 Apr 19 05:27 backup -rw-r--r--. 1 root root 3830 Apr 19 05:24 Log_nsednev_from_nsednev_he_1_rhevm_3_6 -rw-r--r--. 1 root root 8919541 Apr 19 05:25 nsednev_from_nsednev_he_1_rhevm_3_6 -rw-r--r--. 1 root root 1117 Apr 13 11:55 ovirt-engine-answers [root@nsednev-he-1 ~]# cp Log_nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/ [root@nsednev-he-1 ~]# cp nsednev_from_nsednev_he_1_rhevm_3_6 /root/backup/ [root@nsednev-he-1 ~]# engine-backup --mode=restore --log=/root/backup/Log_nsednev_from_nsednev_he_1_rhevm_3_6 --file=/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6 --provision-db --provision-dwh-db --provision-reports-db --restore-permissions Preparing to restore: - Unpacking file '/root/backup/nsednev_from_nsednev_he_1_rhevm_3_6' Restoring: - Files Provisioning PostgreSQL users/databases: - user 'engine', database 'engine' - user 'ovirt_engine_history', database 'ovirt_engine_history' - user 'ovirt_engine_reports', database 'ovirt_engine_reports' Restoring: - Engine database 'engine' - Cleaning up temporary tables in engine database 'engine' - Resetting DwhCurrentlyRunning in dwh_history_timekeeping in engine database - DWH database 'ovirt_engine_history' - Reports database 'ovirt_engine_reports' You should now run engine-setup. Done. 6)Ran engine-setup and once finished, continued with HE deployment on second host. 7)(1) Continue setup - oVirt-Engine installation is ready and ovirt-engine service is up 8)I've got the screenshots as attached. (Originally by Nikolai Sednev) Created attachment 1149102 [details]
HE_unknown.png
(Originally by Nikolai Sednev)
Created attachment 1149104 [details]
sosreport from the backed up engine
(Originally by Nikolai Sednev)
Created attachment 1149105 [details]
sosreport from second host
(Originally by Nikolai Sednev)
Roy, Didi, Simone can you have a look at procedure, screenshots and logs? We should end up with VM status known and storage up. (Originally by Sandro Bonazzola) Adding screenshot from HE Storage. (Originally by Nikolai Sednev) Created attachment 1149112 [details]
Storage screenshot.png
(Originally by Nikolai Sednev)
The procedure as described in https://bugzilla.redhat.com/show_bug.cgi?id=1240466#c9 is fine. We need to retest it since AFAIK now it's correctly working. (Originally by Simone Tiraboschi) Checked on: rhevm-3.6.7.3-0.1.el6.noarch ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch The problem still exists, because VM with the name "HostedEngine" exists under the engine, the engine can not start the auto-import process of new "HostedEgnine" VM. 2016-06-09 11:15:38,016 ERROR [org.ovirt.engine.core.bll.HostedEngineImporter] (org.ovirt.thread.pool-6-thread-8) [66c4f646] Failed importing the Hosted Engine VM See attached engine log. (Originally by Artyom Lukianov) Created attachment 1166349 [details]
new engine log
(Originally by Artyom Lukianov)
I also tried to destroy HE storage domain to start auto-import process from the beginning, but now engine failed to import HE SD at all.
From host vdsm log:
Traceback (most recent call last):
File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper
result = ctask.prepare(func, *args, **kwargs)
File "/usr/share/vdsm/storage/task.py", line 104, in wrapper
return m(self, *a, **kw)
File "/usr/share/vdsm/storage/task.py", line 1179, in prepare
raise self.error
IndexError: list index out of range
So I will also add host vdsm logs.
(Originally by Artyom Lukianov)
Created attachment 1166353 [details]
new vdsm log
(Originally by Artyom Lukianov)
OK, so now I see the whole picture. What is failing now it's basically a full migration of an hosted-engine setup from a storage domain to a new one. This is failing since the engine backup itself contains a reference to the previous hosted-engine storage domain and to the old engine VM with different disk uuid and so on. This is not blocking the migration from 3.6 EL6 EAP6 to 4.0 EL7 EAP7 (rhbz#1302228) since in that case we are using the same hosted-engine storage domain and the same VM just editing it to use a new disk deployed from the EL/ appliance. In order to let the user move from an hosted-engine storage domain to another, which can be a good idea if the whole storage device failed or if the user want to change the hosted-engine storage domain type, we also need to somehow (at backup or at restore time) filter out (the engine will look for them as just after a fresh deployment) any reference to the old hosted-engine storage domain and to the old hosted-engine VM. We need also to ask to the user to redeploy all the other hosts since the new metadata and lockspace volume don't contain any reference to them since we didn't clone but just recreated the two volumes. (Originally by Simone Tiraboschi) Moving to 4.0.1 based on comment #21. (Originally by Yaniv Dary) Can you provide the manual steps to make it work for a KBase or we can no support this at this point? (Originally by Yaniv Dary) Simone please check if this is still an issue. Migration from 3.6 to 4.0 looks pretty much like the backup / restore described here. (Originally by Sandro Bonazzola) (In reply to Sandro Bonazzola from comment #25) > Simone please check if this is still an issue. > Migration from 3.6 to 4.0 looks pretty much like the backup / restore > described here. Yes, it's still an issue. In the 3.6 to 4.0 migration we are basically upgrading in place so the engine VM uuid, the hosted-engine storage domain uuid and the host uuids are really the same and so no issue there. This issue will happen instead when we try to restore (for disaster recovery purposes for instance) an engine-backup took on a environment on slightly or completely different one. In that case for instance the uuid of the new engine VM and the uuid of its disk could be different from what we have in the engine DB, the same for the old hosted-engine storage domain (the engine will try to remount what it originally imported) and so on. Unfortunately due to other reasons, the hosted-engine elements are also locked in the engine and so the user cannot simply remove them though the engine. Probably the best solution would be to filter them out at backup creation or restore time. (Originally by Simone Tiraboschi) Verified on rhevm-4.1.0.2-0.2.el7.noarch (Originally by Artyom Lukianov) Verified on: rhevm-4.0.7.3-0.1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.2-1.el7ev.noarch 1. Deploy HE environment 2. Add the storage domain to the engine(to start auto-import process) 3. Wait until the engine will have HE VM 4. Set global maintenance 5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log 6. Copy the backup file from the HE VM to the host 7. Clean host from HE deploy(reprovisioning) 8. Run the HE deployment again 9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? " 10. Enter to the HE VM and copy the backup file from the host to the HE VM 11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --provision-db 12. Run engine setup: # engine-setup --offline 13. Finish HE deployment process Engine UP and have HE SD and HE VM in the active state Be aware of bugs under 4.1: https://bugzilla.redhat.com/show_bug.cgi?id=1416459 https://bugzilla.redhat.com/show_bug.cgi?id=1416466 Verified on correct version # rpm -qa | grep hosted ovirt-hosted-engine-setup-2.0.4.3-2.el7ev.noarch ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch (In reply to Artyom from comment #35) > Be aware of bugs under 4.1: > https://bugzilla.redhat.com/show_bug.cgi?id=1416459 > https://bugzilla.redhat.com/show_bug.cgi?id=1416466 We backported their fixes as well: https://bugzilla.redhat.com/show_bug.cgi?id=1425893 https://bugzilla.redhat.com/show_bug.cgi?id=1425890 After numbers of backup-restore operation looks like we still have the problem with the auto-import operation:
2017-03-02 05:07:40,924 INFO [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] VDS '847b4fe4-671c-4635-ac0a-6801064c3c95' was added to the Resource Manager
2017-03-02 05:07:40,951 INFO [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] Finished initializing ResourceManager
2017-03-02 05:07:40,981 INFO [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (ServerService Thread Pool -- 55) [] Command [id=aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70]: Compensating DELETED_OR_UPDATED_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainDynamic; snapshot: id=00000000-0000-0000-0000-000000000000.
2017-03-02 05:07:40,992 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (ServerService Thread Pool -- 55) [] transaction rolled back
2017-03-02 05:07:40,992 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 55) [] Failed to run compensation on startup for Command 'org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand', Command Id 'aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70': CallableStatementCallback; SQL [{call insertstorage_domain_dynamic(?, ?, ?)}]; ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
Where: SQL statement "INSERT INTO storage_domain_dynamic (
available_disk_size,
id,
used_disk_size
)
VALUES (
v_available_disk_size,
v_id,
v_used_disk_size
)"
PL/pgSQL function insertstorage_domain_dynamic(integer,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
Where: SQL statement "INSERT INTO storage_domain_dynamic (
We miss the patch in our downstream build - https://gerrit.ovirt.org/#/c/72902
Verified with new vdsm: vdsm-4.18.24-3.el7ev.x86_64 Ran backup/restore process 5 times in all 5 times HE VM and HE SD active. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0542.html |