Bug 1422486
Summary: | [downstream clone - 4.0.7] [HE] high availability compromised due to duplicate spm id | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ |
Component: | ovirt-engine | Assignee: | Denis Chaplygin <dchaplyg> |
Status: | CLOSED ERRATA | QA Contact: | Artyom <alukiano> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 4.0.6 | CC: | dfediuck, gklein, lsurette, mavital, mkalinin, rbalakri, Rhev-m-bugs, sbonazzo, srevivo, stirabos, tnisan, trichard, ykaul, ylavi |
Target Milestone: | ovirt-4.0.7 | Keywords: | Triaged, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | integration | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Self-hosted engine always uses an SPM ID of 1 during installation of the first self-hosted engine host, without checking database settings. This release adds options to change the database during the restore process.
For disaster recovery, the --he-remove-hosts option has been added so that all hosts with SPM_ID=1 are updated and a different SPM ID assigned.
For bare metal to self-hosted engine migration, a new engine-migrate-he.py script is provided. This script should be called before migration, and supplied with the Manager REST API login/password/endpoint and path to CA certificate. Hosts in the selected data center with SPM_ID=1 will be put into Maintenance mode, so they can accept the new ID safely. Migration can then continue as usual, using the --he-remove-hosts option.
|
Story Points: | --- |
Clone Of: | 1417518 | Environment: | |
Last Closed: | 2017-03-16 15:33:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1417518 | ||
Bug Blocks: | 1431635 |
Description
rhev-integ
2017-02-15 12:19:02 UTC
Here I see just two options: 1. always filter out (or change the id) the host with spm_id=1 at restore time; this should be pretty safe 2. start hosted-engine-setup with spm_id=1; deploy as usual and add the host via engine API. Since we have another host with spm_id=1 the engine will choose a different spm_id so we need to get it somehow (REST API, VDSM, sanlock ?) and change the hosted-engine configuration before the next run. Currently nothing is removing the lock on exit since we just assume that the engine is going to refresh for id=1 so we have also to remove the lock but if we do it when the engine VM is running it would probably be killed and if we wait for the engine VM to be down, nothing is preventing the engine to start the autoimport process since another storage domain could be in the restored DB as well. (Originally by Simone Tiraboschi) Simone, I'm yet to write the KCS on how to fix this. In your opinion, what would be the best course of action to avoid HostedEngine VM getting killed? I assume we need to get host_id in hosted-engine.conf in sync with vds_spm_id_map. So what about this: 1) Get the vds_spm_id_map 2) HE maintenance mode 3) Shutdown/Migrate HE, make it release the ids lock (if single host can't move to maintenance) 3) Adjust hosted-engine.conf 4) Cleanup metadata slots of old IDs 5) Reboot Any better ideas? I was thinking, one can also hit this when deploying additional hosts right? Just use a host_id which is already used in spm_vds_id_map. No need to do all that I did on the reproduction steps. Right? Thank you (Originally by Germano Veit Michel) (In reply to Germano Veit Michel from comment #3) > I assume we need to get host_id in hosted-engine.conf in sync with > vds_spm_id_map. So what about this: > > 1) Get the vds_spm_id_map > 2) HE maintenance mode > 3) Shutdown/Migrate HE, make it release the ids lock (if single host can't > move to maintenance) > 3) Adjust hosted-engine.conf > 4) Cleanup metadata slots of old IDs > 5) Reboot > > Any better ideas? This is one option; the other is to change the spm_id of the conflicting hosts in the DB and reboot the host not involved in hosted-engine. > I was thinking, one can also hit this when deploying additional hosts right? > Just use a host_id which is already used in spm_vds_id_map. No need to do > all that I did on the reproduction steps. Right? We deprecated (in 4.0) and removed the possibility to deploy additional hosted engine hosts from the CLI; now the user could just add additional hosted-engine hosts from the engine (webui or REST APIs): so we are sure that the spm_id will be coherent between the engine and the local configuration of the hosted-engine hosts. (Originally by Simone Tiraboschi) Simone, can we close this current release provided we deprecated CLI in 4.0 and this doesn't happen using web ui? (Originally by Sandro Bonazzola) (In reply to Sandro Bonazzola from comment #5) > Simone, can we close this current release provided we deprecated CLI in 4.0 > and this doesn't happen using web ui? Hi Sandro, As stated in comment #0, this can easily be hit in two scenarios: A: Re-deploying HE (i.e. disaster recovery) B: Moving from Standalone RHV-M to HE And none of them is related to deploying additional hosts via the Web UI. Is there another BZ tracking the root cause of this (inappropriate spm id for the initial host)? (Originally by Germano Veit Michel) (In reply to Germano Veit Michel from comment #6) > Is there another BZ tracking the root cause of this (inappropriate spm id > for the initial host)? No, we can work on this bug. (Originally by Simone Tiraboschi) Sandro, can you please check if the fix really exists under the build 4.0.7. I can see that two patches attached to the bug do not satisfy to this. http://gerrit.ovirt.org/71393 - not merged at all http://gerrit.ovirt.org/72037 - merged only under the master About https://gerrit.ovirt.org/#/c/72037/ it has Change-Id: I48e11a863b28b0c9a32f5a19b72392afbe717fe9 and looks like it has not been merged so moving back to post and moving needinfo to Simone who moved to modified. About http://gerrit.ovirt.org/71393 it has Change-Id: Ib7b10f57a9350adf3da73580c4a69e5ce317502e and according to https://gerrit.ovirt.org/#/q/Ib7b10f57a9350adf3da73580c4a69e5ce317502e It has been merged for 4.0.7. Back to post Verified on: rhevm-4.0.7.4-0.1.el7ev.noarch # rpm -qa | grep hosted ovirt-hosted-engine-setup-2.0.4.3-3.el7ev.noarch ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch 1. Deploy HE environment 2. HE environment has two hosts, one HE and one non-HE engine=# select vds_id, vds_name from vds_static; vds_id | vds_name --------------------------------------+----------------- 1f6a216d-dff4-4b55-b5fb-a54cc2dca920 | alma06.qa.lab.tlv.redhat.com # non-HE host f3a839e1-1daf-4474-b4ac-5261e1818244 | hosted_engine_2 # HE host (2 rows) engine=# select * from vds_spm_id_map; storage_pool_id | vds_spm_id | vds_id --------------------------------------+------------+-------------------------------------- 00000001-0001-0001-0001-0000000001f7 | 1 | 1f6a216d-dff4-4b55-b5fb-a54cc2dca920 00000001-0001-0001-0001-0000000001f7 | 2 | f3a839e1-1daf-4474-b4ac-5261e1818244 3. Run engine-migrate-he.py script # python engine-migrate-he.py Engine REST API url[https://nsednev-he-1.qa.lab.tlv.redhat.com/ovirt-engine/api]: Engine REST API username[admin@internal]: Engine REST API password:123456 Engine CA certificate file[/etc/pki/ovirt-engine/ca.pem]: Putting host hosted_engine_1 to maintenance Waiting for host to switch into Maintenance state Waiting for host to switch into Maintenance state Waiting for host to switch into Maintenance state Host is in Maintenance state 5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log 6. Copy the backup file from the HE VM to the host 7. Clean HE host from HE deploy(reprovisioning) 8. Run the HE deployment again 9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? " 10. Enter to the HE VM and copy the backup file from the host to the HE VM 11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --provision-db 12. Run engine setup: # engine-setup --offline 13. Finish HE deployment process Engine UP and have HE SD and HE VM in the active state and DB tables looks like: engine=# select * from vds_spm_id_map; storage_pool_id | vds_spm_id | vds_id --------------------------------------+------------+-------------------------------------- 00000001-0001-0001-0001-0000000001f7 | 2 | 1f6a216d-dff4-4b55-b5fb-a54cc2dca920 00000001-0001-0001-0001-0000000001f7 | 1 | 563d36d0-dc48-4134-a350-23563f842131 (2 rows) engine=# select vds_name, vds_id from vds_static; vds_name | vds_id ------------------------------+-------------------------------------- alma06.qa.lab.tlv.redhat.com | 1f6a216d-dff4-4b55-b5fb-a54cc2dca920 hosted_engine_1 | 563d36d0-dc48-4134-a350-23563f842131 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0542.html |