Bug 1422486 - [downstream clone - 4.0.7] [HE] high availability compromised due to duplicate spm id
Summary: [downstream clone - 4.0.7] [HE] high availability compromised due to duplicat...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.0.6
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ovirt-4.0.7
: ---
Assignee: Denis Chaplygin
QA Contact: Artyom
URL:
Whiteboard: integration
Depends On: 1417518
Blocks: 1431635
TreeView+ depends on / blocked
 
Reported: 2017-02-15 12:19 UTC by rhev-integ
Modified: 2020-03-11 15:56 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Self-hosted engine always uses an SPM ID of 1 during installation of the first self-hosted engine host, without checking database settings. This release adds options to change the database during the restore process. For disaster recovery, the --he-remove-hosts option has been added so that all hosts with SPM_ID=1 are updated and a different SPM ID assigned. For bare metal to self-hosted engine migration, a new engine-migrate-he.py script is provided. This script should be called before migration, and supplied with the Manager REST API login/password/endpoint and path to CA certificate. Hosts in the selected data center with SPM_ID=1 will be put into Maintenance mode, so they can accept the new ID safely. Migration can then continue as usual, using the --he-remove-hosts option.
Clone Of: 1417518
Environment:
Last Closed: 2017-03-16 15:33:58 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2897821 0 None None None 2017-02-15 12:19:47 UTC
Red Hat Product Errata RHBA-2017:0542 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0.7 2017-03-16 19:25:04 UTC
oVirt gerrit 71393 0 None None None 2017-02-15 12:19:47 UTC
oVirt gerrit 72037 0 master MERGED he: Bare metal migration helper script added. 2017-02-28 10:12:14 UTC
oVirt gerrit 73239 0 ovirt-engine-4.0 MERGED search: Add spm_id as a searchable field for Host 2017-02-28 11:23:40 UTC
oVirt gerrit 73240 0 ovirt-engine-4.0 MERGED he: Bare metal migration helper script added. 2017-02-28 11:23:36 UTC
oVirt gerrit 73996 0 ovirt-engine-4.1 MERGED he: Notify user to redeploy HE hosts after recovery. 2017-03-27 11:22:19 UTC

Description rhev-integ 2017-02-15 12:19:02 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1417518 +++
======================================================================

Description of problem:

A Hosted-Engine and a non HE Host can and up with the same SPM id.
Hosted-Engine won't start, whole environment is down as the HE host can't acquire it's ids on the HE SD.

The way this happens is basically when deploying a new Host with HE (for disaster recovery or migration to HE).

The new host for re-deploying/migrating to HE automatically assumes spm id 1 since the HE domain is clean. However, a previous host is set with id 1 in the DB (it can be running or not). The HE deployment only cares about ids in the HE SD and has no idea about any other running SD/Master. It can't guess, so it selects ids 1.

Once the Engine goes up and the HE SD is added to the DB and the other host is activated, that older host now tries to grab a lock in the HE SD, and it fails, but all seems good and runs fine for ages. But it's definitly NOT good.

Now reboot the HE Host (for any reason), the other host will grab the lock for id 1 on the HE SD and never release it again. Now the HE won't start and the environment is down, it looks like corrupted lockspaces but it isn't, cleaning up the lockspaces won't help and one needs to figure out two hosts are fighting for the same ID, not that simple.

Version-Release number of selected component (if applicable):
rhevm-4.0.6.3-0.1.el7ev.noarch
vdsm-4.18.21-1.el7ev.x86_64
sanlock-3.4.0-1.el7.x86_64
ovirt-hosted-engine-ha-2.0.6-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.4.1-2.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
I see basically two ways to hit this, there might be more.

A: Re-deploying HE (i.e. disaster recovery)
B: Moving from Standalone RHV-M to HE

B could be avoidable via documentation, but for A I don't think so.

So, to illustrate it, please see how to hit the problem with way B (Moving from Standalone RHV-M to HE).

1. Previously Running RHV Environment
   * Bare-Metal RHV-M
   * 1 Host (vds c0976879)
   * 1 Storage Domain (SD dc7e6fad)
   * Host uses spm id 1, see:

              storage_pool_id            | vds_spm_id |                vds_id                
   --------------------------------------+------------+--------------------------------------
    588e8d50-023f-0158-0292-0000000002f3 |          1 | c0976879-6165-4545-b67c-1bd2361112d5

   * We can see sanlock on vds c0976879 taking id 1:
   s dc7e6fad-c00c-4aeb-8c93-f2887d764019:1:/rhev/data-center/mnt/<IP>\:_storage_storage/dc7e6fad-c00c-4aeb-8c93-f2887d764019/dom_md/ids:0

2. To migrate to HE, follow https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/paged/self-hosted-engine-guide/chapter-4-migrating-from-bare-metal-to-a-rhel-based-self-hosted-environment
   * Note it says one of the options is:
     - Prepare a new host with the ovirt-hosted-engine-setup package installed.      
   * Now we have Host 215f48bf, which runs HE
   * Host 215f48bf got spm id 1 on the HE SD (70345242), see:
   s 70345242-bcad-4e0b-ba2e-4b761e8132a3:1:/rhev/data-center/mnt/<IP>\:_storage_hosted/70345242-bcad-4e0b-ba2e-4b761e8132a3/dom_md/ids:0

3. Once the Migration is Finished, and the new engine activates both hosts all seem fine, but it's not:
   * The original host c0976879 just holds id 1 on the original SD and fails to get id 1 on the HE SD 70345242
   * The HE host 215f48bf holds id 1 on the HE SD 70345242 and id 2 on the original SD dc7e6fad. 
In more detail:
   * vds c0976879 still has id 1 on SD dc7e6fad:
   s dc7e6fad-c00c-4aeb-8c93-f2887d764019:1:/rhev/data-center/mnt/<IP>\:_storage_storage/dc7e6fad-c00c-4aeb-8c93-f2887d764019/dom_md/ids:0
[1]* vds c0976879 upon activation failed to get id 1 on SD 70345242 (HE):
   * vds 215f48bf (HE) got id 2 on SD dc7e6fad:
   s dc7e6fad-c00c-4aeb-8c93-f2887d764019:2:/rhev/data-center/mnt/<IP>\:_storage_storage/dc7e6fad-c00c-4aeb-8c93-f2887d764019/dom_md/ids:0
[2]* vds 215f48bf (HE) got id 1 on SD (HE):
   s 70345242-bcad-4e0b-ba2e-4b761e8132a3:1:/rhev/data-center/mnt/<IP>\:_storage_hosted/70345242-bcad-4e0b-ba2e-4b761e8132a3/dom_md/ids:0
   * And we see the HE Host should be using id 2:
              storage_pool_id            | vds_spm_id |                vds_id                
   --------------------------------------+------------+--------------------------------------
    588e8d50-023f-0158-0292-0000000002f3 |          1 | c0976879-6165-4545-b67c-1bd2361112d5
    588e8d50-023f-0158-0292-0000000002f3 |          2 | 215f48bf-5162-46c6-9e6b-bdb2465d1e88

4. It's all crossed, things are not good. Then once we reboot the HE Host:
   * vds c0976879 still has id 1 on SD dc7e6fad
   * vds c0976879 finally gets id 1 on SD (HE).

5. Once the HE Host comes up again, it can't get it's ids lock, all fails. HE is down.
   * HE host 215f48bf can't get id 1 as it's being hold by the other host c0976879
   * it still tried to get spm id 1, due to hosted-engine.conf(?)
     host_id=1

And it never corrects itself without manual intervention. Depending on the order the Hosts are rebooted/shutdown the problem may hit the HE Host (bad outcomes) or the non HE Host (not too bad). But the problem remains, that the HE Host is set to use id 2 in the DB and uses id 1 in the HE SD.

Apparently we need some extra logic to handle this. Should the HE really use HE host_id as a key to the ids lockspace (SD host_id)?

Actual results:
[1] Shouldn't this produce an error to warn the user something is wrong? How can a Host activate and show up if it can't even acquire ids lock?
[2] Once activated it still has ID 1, maybe the trick is to switch it to id 2 at the final stages of he deploy (add host)?

Expected results:
No hosts fighting for the same id, Hosted-Engine VM able to start.

(Originally by Germano Veit Michel)

Comment 3 rhev-integ 2017-02-15 12:19:13 UTC
Here I see just two options:
1. always filter out (or change the id) the host with spm_id=1 at restore time; this should be pretty safe 
2. start hosted-engine-setup with spm_id=1; deploy as usual and add the host via engine API. Since we have another host with spm_id=1 the engine will choose a different spm_id so we need to get it somehow (REST API, VDSM, sanlock ?) and change the hosted-engine configuration before the next run.
Currently nothing is removing the lock on exit since we just assume that the engine is going to refresh for id=1 so we have also to remove the lock but if we do it when the engine VM is running it would probably be killed and if we wait for the engine VM to be down, nothing is preventing the engine to start the autoimport process since another storage domain could be in the restored DB as well.

(Originally by Simone Tiraboschi)

Comment 4 rhev-integ 2017-02-15 12:19:18 UTC
Simone,

I'm yet to write the KCS on how to fix this. In your opinion, what would be the best course of action to avoid HostedEngine VM getting killed?

I assume we need to get host_id in hosted-engine.conf in sync with vds_spm_id_map. So what about this:

1) Get the vds_spm_id_map
2) HE maintenance mode
3) Shutdown/Migrate HE, make it release the ids lock (if single host can't move to maintenance)
3) Adjust hosted-engine.conf
4) Cleanup metadata slots of old IDs
5) Reboot

Any better ideas?

I was thinking, one can also hit this when deploying additional hosts right? Just use a host_id which is already used in spm_vds_id_map. No need to do all that I did on the reproduction steps. Right?

Thank you

(Originally by Germano Veit Michel)

Comment 5 rhev-integ 2017-02-15 12:19:22 UTC
(In reply to Germano Veit Michel from comment #3)
> I assume we need to get host_id in hosted-engine.conf in sync with
> vds_spm_id_map. So what about this:
> 
> 1) Get the vds_spm_id_map
> 2) HE maintenance mode
> 3) Shutdown/Migrate HE, make it release the ids lock (if single host can't
> move to maintenance)
> 3) Adjust hosted-engine.conf
> 4) Cleanup metadata slots of old IDs
> 5) Reboot
> 
> Any better ideas?

This is one option; the other is to change the spm_id of the conflicting hosts in the DB and reboot the host not involved in hosted-engine.

> I was thinking, one can also hit this when deploying additional hosts right?
> Just use a host_id which is already used in spm_vds_id_map. No need to do
> all that I did on the reproduction steps. Right?

We deprecated (in 4.0) and removed the possibility to deploy additional hosted engine hosts from the CLI; now the user could just add additional hosted-engine hosts from the engine (webui or REST APIs): so we are sure that the spm_id will be coherent between the engine and the local configuration of the hosted-engine hosts.

(Originally by Simone Tiraboschi)

Comment 6 rhev-integ 2017-02-15 12:19:28 UTC
Simone, can we close this current release provided we deprecated CLI in 4.0 and this doesn't happen using web ui?

(Originally by Sandro Bonazzola)

Comment 7 rhev-integ 2017-02-15 12:19:32 UTC
(In reply to Sandro Bonazzola from comment #5)
> Simone, can we close this current release provided we deprecated CLI in 4.0
> and this doesn't happen using web ui?

Hi Sandro,

As stated in comment #0, this can easily be hit in two scenarios:

A: Re-deploying HE (i.e. disaster recovery)
B: Moving from Standalone RHV-M to HE

And none of them is related to deploying additional hosts via the Web UI.

Is there another BZ tracking the root cause of this (inappropriate spm id for the initial host)?

(Originally by Germano Veit Michel)

Comment 8 rhev-integ 2017-02-15 12:19:37 UTC
(In reply to Germano Veit Michel from comment #6)
> Is there another BZ tracking the root cause of this (inappropriate spm id
> for the initial host)?

No, we can work on this bug.

(Originally by Simone Tiraboschi)

Comment 10 Artyom 2017-02-28 07:26:23 UTC
Sandro, can you please check if the fix really exists under the build 4.0.7.
I can see that two patches attached to the bug do not satisfy to this.

http://gerrit.ovirt.org/71393 - not merged at all
http://gerrit.ovirt.org/72037 - merged only under the master

Comment 11 Sandro Bonazzola 2017-02-28 07:44:40 UTC
About https://gerrit.ovirt.org/#/c/72037/ it has Change-Id: I48e11a863b28b0c9a32f5a19b72392afbe717fe9 and looks like it has not been merged so moving back to post and moving needinfo to Simone who moved to modified.

About http://gerrit.ovirt.org/71393 it has Change-Id: Ib7b10f57a9350adf3da73580c4a69e5ce317502e and according to
https://gerrit.ovirt.org/#/q/Ib7b10f57a9350adf3da73580c4a69e5ce317502e
It has been merged for 4.0.7.

Comment 12 Simone Tiraboschi 2017-02-28 10:45:03 UTC
Back to post

Comment 14 Artyom 2017-03-02 16:05:15 UTC
Verified on:
rhevm-4.0.7.4-0.1.el7ev.noarch
# rpm -qa | grep hosted
ovirt-hosted-engine-setup-2.0.4.3-3.el7ev.noarch
ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch

1. Deploy HE environment

2. HE environment has two hosts, one HE and one non-HE
engine=# select vds_id, vds_name from vds_static;
                vds_id                |    vds_name     
--------------------------------------+-----------------
 1f6a216d-dff4-4b55-b5fb-a54cc2dca920 | alma06.qa.lab.tlv.redhat.com # non-HE host
 f3a839e1-1daf-4474-b4ac-5261e1818244 | hosted_engine_2 # HE host
(2 rows)

engine=# select * from vds_spm_id_map;
           storage_pool_id            | vds_spm_id |                vds_id                
--------------------------------------+------------+--------------------------------------
 00000001-0001-0001-0001-0000000001f7 |          1 | 1f6a216d-dff4-4b55-b5fb-a54cc2dca920
 00000001-0001-0001-0001-0000000001f7 |          2 | f3a839e1-1daf-4474-b4ac-5261e1818244

3. Run engine-migrate-he.py script
# python engine-migrate-he.py 
Engine REST API url[https://nsednev-he-1.qa.lab.tlv.redhat.com/ovirt-engine/api]:
Engine REST API username[admin@internal]:
Engine REST API password:123456
Engine CA certificate file[/etc/pki/ovirt-engine/ca.pem]:
Putting host hosted_engine_1 to maintenance
Waiting for host to switch into Maintenance state
Waiting for host to switch into Maintenance state
Waiting for host to switch into Maintenance state
Host is in Maintenance state

5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log

6. Copy the backup file from the HE VM to the host

7. Clean HE host from HE deploy(reprovisioning)

8. Run the HE deployment again

9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? "
10. Enter to the HE VM and copy the backup file from the host to the HE VM

11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log  --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --provision-db

12. Run engine setup: # engine-setup --offline

13. Finish HE deployment process

Engine UP and have HE SD and HE VM in the active state and DB tables looks like:
engine=# select * from vds_spm_id_map;
           storage_pool_id            | vds_spm_id |                vds_id                
--------------------------------------+------------+--------------------------------------
 00000001-0001-0001-0001-0000000001f7 |          2 | 1f6a216d-dff4-4b55-b5fb-a54cc2dca920
 00000001-0001-0001-0001-0000000001f7 |          1 | 563d36d0-dc48-4134-a350-23563f842131
(2 rows)

engine=# select vds_name, vds_id from vds_static;
           vds_name           |                vds_id                
------------------------------+--------------------------------------
 alma06.qa.lab.tlv.redhat.com | 1f6a216d-dff4-4b55-b5fb-a54cc2dca920
 hosted_engine_1              | 563d36d0-dc48-4134-a350-23563f842131

Comment 16 errata-xmlrpc 2017-03-16 15:33:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html


Note You need to log in before you can comment on or make changes to this bug.