1548891 – After HE restore DC is down due to bumped spm_id of non HE-host (current SPM)

Bug 1548891 - After HE restore DC is down due to bumped spm_id of non HE-host (current SPM)

Summary: After HE restore DC is down due to bumped spm_id of non HE-host (current SPM)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-setup
Sub Component:
Version:	4.1.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-25 23:26 UTC by Germano Veit Michel
Modified:	2021-06-10 15:01 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-hosted-engine-setup-2.2.15
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-15 17:32:28 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3363521	None	None	None	2018-02-25 23:46:32 UTC
Red Hat Product Errata	RHBA-2018:1471	None	None	None	2018-05-15 17:33:04 UTC
oVirt gerrit	88466	master	MERGED	ansible: spmid: detect spmid from VDSM	2021-01-24 12:39:15 UTC
oVirt gerrit	88476	ovirt-hosted-engine-setup-2.2	MERGED	ansible: spmid: detect spmid from VDSM	2021-01-24 12:39:15 UTC
oVirt gerrit	88556	master	MERGED	ansible: cli: skip uneffective questions	2021-01-24 12:39:57 UTC
oVirt gerrit	88734	ovirt-hosted-engine-setup-2.2	MERGED	ansible: cli: skip uneffective questions	2021-01-24 12:39:15 UTC

Description Germano Veit Michel 2018-02-25 23:26:47 UTC

Description of problem:

The patch below, backported to 4.1.4, during HE restore, bumps the spm_id of any host that has spm_id=1 in the newly restored DB (before engine-setup).

This assumes that that a host will simply switch to the new ID once the new engine comes up and sends it the new ConnectStoragePool command with a new host id. This is not true, the host will refuse to connect to the pool with the new id once it is already connected to the pool with the old id.

Use case:
1. Host X has spm_id=1, its the current SPM and not a HE host.
2. Backup and restore HE (DR)
3. Host X is not removed from the DB, is kept up with VMs running, and has the SPM role.
4. New engine is restored, it's spm id is bumped to 2.
5. New environment is down, Host X refuses to connect to storage pool with id 2. It's currently holding id 1 with VMs running and SDM lease acquired with id 1.

NOTE: non HE hosts are not meant to be removed or re-installed during HE restore process. Either the host must gracefully switch to the new ID, or this ID change cannot be done.

Version-Release number of selected component (if applicable):
4.1.9

How reproducible:
100%

Steps to Reproduce:
1. Have a DB with a non HE host with SPM_ID 1
2. Run:
# engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log  --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --pro
vision-db              

Actual results:
Data-Center is down post HE Restore. Host now has SPM_ID 2.

Expected results:
Data-Center is up post HE Restore

Additional info:

he: Ensures that there will be no spm_id=1 host after restore.
    
    Restoring HE environment from backup could lead into
    two hosts, thinking they have spm_id==1 at same moment.
    One of those hosts will be newly deployed HE host with
    default spm_id==1, another one host will be old
    host from the database. This patch changes
    spm_id of the host with value '1' to some unused value.
    
    Bug-Url: https://bugzilla.redhat.com/1417518


2018-02-22 13:43:58,237+0100 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='9fd75b4b-e64c-4e09-a882-dbeb57dfc30f') Unexpected error (task:872)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 879, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in connectStoragePool
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 983, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/share/vdsm/storage/hsm.py", line 1023, in _connectStoragePool
    masterVersion, domainsMap)
  File "/usr/share/vdsm/storage/hsm.py", line 989, in _updateStoragePool
    "hostId=%s, newHostId=%s" % (pool.id, hostId))
StoragePoolConnected: Cannot perform action while storage pool is connected: ('hostId=1, newHostId=2',)

Comment 1 Simone Tiraboschi 2018-02-26 13:42:40 UTC

Unfortunately we cannot do that much for that on the technical side.
As you pointed out we cannot change the SPM ID on a live host and so we have to force it to maintenance mode but the engine is down at that point.

The user could potentially append an answer file with something like 
  [environment:default]
  OVEHOSTED_STORAGE/hostID=int:N
to hosted-engine-setup to deploy the first host with a custom SPM ID but N has exactly to match the SPM_ID that will be chosen by the engine for that host and it's not that simple:
- we cannot ask the engine since there is no running engine at that stage
- we cannot easily check in the DB since we have just a backup that usually contains a DB dump in binary format

We already have a migration helper script that has to be run before migrating to hosted-engine. As for https://gerrit.ovirt.org/#/c/73238/ it will set the host with SPM_ID=1 in maintenance mode but it will require the original bare metal engine to be still up.

Comment 2 Germano Veit Michel 2018-03-04 23:00:10 UTC

(In reply to Simone Tiraboschi from comment #1)
> Unfortunately we cannot do that much for that on the technical side.
> As you pointed out we cannot change the SPM ID on a live host and so we have
> to force it to maintenance mode but the engine is down at that point.
> 
> The user could potentially append an answer file with something like 
>   [environment:default]
>   OVEHOSTED_STORAGE/hostID=int:N
> to hosted-engine-setup to deploy the first host with a custom SPM ID but N
> has exactly to match the SPM_ID that will be chosen by the engine for that
> host and it's not that simple:
> - we cannot ask the engine since there is no running engine at that stage
> - we cannot easily check in the DB since we have just a backup that usually
> contains a DB dump in binary format

Yes, the KCS Solution on how to do this used to contain some steps to pick ID=1 for redeploy, but we removed those steps as they were not needed anymore. Maybe we should put them back to reduce the chances of problems like this.

> We already have a migration helper script that has to be run before
> migrating to hosted-engine. As for https://gerrit.ovirt.org/#/c/73238/ it
> will set the host with SPM_ID=1 in maintenance mode but it will require the
> original bare metal engine to be still up.

Right, this could have been done on this exact case but is not useful for disaster recovery or on specific cases.

From what I can see this whole host/spm ID issue on HE restore is still very tricky. We need to have an easier way for customers to do this. It's too complicated and with too many corner cases that simply fail. Especially for DR, it must be something that brings the environment up quickly and reliably. Maybe use this bug to investigate a better solution for all these cases?

Comment 3 Simone Tiraboschi 2018-03-05 10:27:47 UTC

(In reply to Germano Veit Michel from comment #2)
> Yes, the KCS Solution on how to do this used to contain some steps to pick
> ID=1 for redeploy, but we removed those steps as they were not needed
> anymore. Maybe we should put them back to reduce the chances of problems
> like this.

Yes, I agree.

> From what I can see this whole host/spm ID issue on HE restore is still very
> tricky. We need to have an easier way for customers to do this. It's too
> complicated and with too many corner cases that simply fail. Especially for
> DR, it must be something that brings the environment up quickly and
> reliably. Maybe use this bug to investigate a better solution for all these
> cases?

The new node-zero ansible flow will be much safer on this side: we don't have to trigger direct vdsm actions with an explicit SPM_ID at setup time.
We still have to improve that flow to be able to automatically inject a backup file to be restored on the local engine VM.

Comment 9 Nikolai Sednev 2018-05-03 14:20:02 UTC

Works for me on these components:
ovirt-engine-4.2.3.3-0.1.el7.noarch
rhvm-appliance-4.2-20180427.0.el7.noarch
ovirt-hosted-engine-setup-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch
Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Comment 12 errata-xmlrpc 2018-05-15 17:32:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1471

Comment 13 Franta Kust 2019-05-16 13:06:54 UTC

BZ<2>Jira Resync

Comment 14 Daniel Gur 2019-08-28 13:13:43 UTC

sync2jira

Comment 15 Daniel Gur 2019-08-28 13:17:57 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.