Description of problem: The patch below, backported to 4.1.4, during HE restore, bumps the spm_id of any host that has spm_id=1 in the newly restored DB (before engine-setup). This assumes that that a host will simply switch to the new ID once the new engine comes up and sends it the new ConnectStoragePool command with a new host id. This is not true, the host will refuse to connect to the pool with the new id once it is already connected to the pool with the old id. Use case: 1. Host X has spm_id=1, its the current SPM and not a HE host. 2. Backup and restore HE (DR) 3. Host X is not removed from the DB, is kept up with VMs running, and has the SPM role. 4. New engine is restored, it's spm id is bumped to 2. 5. New environment is down, Host X refuses to connect to storage pool with id 2. It's currently holding id 1 with VMs running and SDM lease acquired with id 1. NOTE: non HE hosts are not meant to be removed or re-installed during HE restore process. Either the host must gracefully switch to the new ID, or this ID change cannot be done. Version-Release number of selected component (if applicable): 4.1.9 How reproducible: 100% Steps to Reproduce: 1. Have a DB with a non HE host with SPM_ID 1 2. Run: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --pro vision-db Actual results: Data-Center is down post HE Restore. Host now has SPM_ID 2. Expected results: Data-Center is up post HE Restore Additional info: he: Ensures that there will be no spm_id=1 host after restore. Restoring HE environment from backup could lead into two hosts, thinking they have spm_id==1 at same moment. One of those hosts will be newly deployed HE host with default spm_id==1, another one host will be old host from the database. This patch changes spm_id of the host with value '1' to some unused value. Bug-Url: https://bugzilla.redhat.com/1417518 2018-02-22 13:43:58,237+0100 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='9fd75b4b-e64c-4e09-a882-dbeb57dfc30f') Unexpected error (task:872) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 879, in _run return fn(*args, **kargs) File "<string>", line 2, in connectStoragePool File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 983, in connectStoragePool spUUID, hostID, msdUUID, masterVersion, domainsMap) File "/usr/share/vdsm/storage/hsm.py", line 1023, in _connectStoragePool masterVersion, domainsMap) File "/usr/share/vdsm/storage/hsm.py", line 989, in _updateStoragePool "hostId=%s, newHostId=%s" % (pool.id, hostId)) StoragePoolConnected: Cannot perform action while storage pool is connected: ('hostId=1, newHostId=2',)
Unfortunately we cannot do that much for that on the technical side. As you pointed out we cannot change the SPM ID on a live host and so we have to force it to maintenance mode but the engine is down at that point. The user could potentially append an answer file with something like [environment:default] OVEHOSTED_STORAGE/hostID=int:N to hosted-engine-setup to deploy the first host with a custom SPM ID but N has exactly to match the SPM_ID that will be chosen by the engine for that host and it's not that simple: - we cannot ask the engine since there is no running engine at that stage - we cannot easily check in the DB since we have just a backup that usually contains a DB dump in binary format We already have a migration helper script that has to be run before migrating to hosted-engine. As for https://gerrit.ovirt.org/#/c/73238/ it will set the host with SPM_ID=1 in maintenance mode but it will require the original bare metal engine to be still up.
(In reply to Simone Tiraboschi from comment #1) > Unfortunately we cannot do that much for that on the technical side. > As you pointed out we cannot change the SPM ID on a live host and so we have > to force it to maintenance mode but the engine is down at that point. > > The user could potentially append an answer file with something like > [environment:default] > OVEHOSTED_STORAGE/hostID=int:N > to hosted-engine-setup to deploy the first host with a custom SPM ID but N > has exactly to match the SPM_ID that will be chosen by the engine for that > host and it's not that simple: > - we cannot ask the engine since there is no running engine at that stage > - we cannot easily check in the DB since we have just a backup that usually > contains a DB dump in binary format Yes, the KCS Solution on how to do this used to contain some steps to pick ID=1 for redeploy, but we removed those steps as they were not needed anymore. Maybe we should put them back to reduce the chances of problems like this. > We already have a migration helper script that has to be run before > migrating to hosted-engine. As for https://gerrit.ovirt.org/#/c/73238/ it > will set the host with SPM_ID=1 in maintenance mode but it will require the > original bare metal engine to be still up. Right, this could have been done on this exact case but is not useful for disaster recovery or on specific cases. From what I can see this whole host/spm ID issue on HE restore is still very tricky. We need to have an easier way for customers to do this. It's too complicated and with too many corner cases that simply fail. Especially for DR, it must be something that brings the environment up quickly and reliably. Maybe use this bug to investigate a better solution for all these cases?
(In reply to Germano Veit Michel from comment #2) > Yes, the KCS Solution on how to do this used to contain some steps to pick > ID=1 for redeploy, but we removed those steps as they were not needed > anymore. Maybe we should put them back to reduce the chances of problems > like this. Yes, I agree. > From what I can see this whole host/spm ID issue on HE restore is still very > tricky. We need to have an easier way for customers to do this. It's too > complicated and with too many corner cases that simply fail. Especially for > DR, it must be something that brings the environment up quickly and > reliably. Maybe use this bug to investigate a better solution for all these > cases? The new node-zero ansible flow will be much safer on this side: we don't have to trigger direct vdsm actions with an explicit SPM_ID at setup time. We still have to improve that flow to be able to automatically inject a backup file to be restored on the local engine VM.
Works for me on these components: ovirt-engine-4.2.3.3-0.1.el7.noarch rhvm-appliance-4.2-20180427.0.el7.noarch ovirt-hosted-engine-setup-2.2.19-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1471
BZ<2>Jira Resync
sync2jira