1576923 – RFE: Ability to move master role to another domain without putting the domain to maintenance

Bug 1576923 - RFE: Ability to move master role to another domain without putting the domain to maintenance

Summary: RFE: Ability to move master role to another domain without putting the domain...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.4.4-1
Target Release:	4.4.4
Assignee:	shani
QA Contact:	Evelina Shames
Docs Contact:
URL:
Whiteboard:
Depends On:	1911597
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-10 17:57 UTC by Natalie Gavrielov
Modified:	2024-10-01 16:07 UTC (History)
CC List:	25 users (show)
Fixed In Version:	ovirt-engine-4.4.4.7
Doc Type:	Enhancement
Doc Text:	Previously, you could not migrate the master role to a newer domain without migrating the virtual machines from the old domain and putting it into maintenance mode. Additionally, you could not put a hosted_storage domain into maintenance mode. With this release, you can use the REST API to move the master role to another storage domain without putting the domain into maintenance mode. For example, to set a storage domain with ID `456` as a master on a data center with ID `123`, send the following request: ---- POST /ovirt-engine/api/datacenters/123/setmaster With a request body like this: <action> <storage_domain id="456"/> </action> ---- Alternatively, this example uses the name of the storage domain: ---- <action> <storage_domain> <name>my-nfs</name> </storage_domain> </action> ----
Clone Of:
Environment:
Last Closed:	2021-02-02 14:00:17 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
switch master test cases logs and output (175.73 KB, application/zip) 2020-10-14 21:56 UTC, shani	no flags	Details
Failed to verify 23.12.20 (183.53 KB, application/zip) 2020-12-23 11:29 UTC, Ilan Zuckerman	no flags	Details
reconstruction of all SDs (144.63 KB, image/png) 2020-12-30 10:47 UTC, Shir Fishbain	no flags	Details
Reconstructing master domain on Data Center error (22.96 KB, image/png) 2020-12-30 10:49 UTC, Shir Fishbain	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1620314	urgent	CLOSED	[downstream clone - 4.2.7] SHE disaster recovery is broken in new 4.2 deployments as hosted_storage is master	2022-03-13 16:19:07 UTC
Red Hat Knowledge Base (Solution)	34923	None	None	None	2020-05-15 01:46:19 UTC
Red Hat Product Errata	RHSA-2021:0383	None	None	None	2021-02-02 14:00:27 UTC
oVirt gerrit	111225	master	MERGED	storage: Support manually setting a storage domain as a master	2021-02-15 06:58:02 UTC
oVirt gerrit	111228	master	MERGED	core: introduce SwitchMasterStorageDomainCommand	2021-02-15 06:58:02 UTC
oVirt gerrit	111563	master	MERGED	API: Introduce switchMaster verb	2021-02-15 06:58:03 UTC
oVirt gerrit	112136	master	MERGED	core: introduce SwitchMasterStorageDomainVDSCommand	2021-02-15 06:58:02 UTC
oVirt gerrit	112348	master	MERGED	core: add test for SwitchMasterStorageDomainCommand	2021-02-15 06:58:02 UTC
oVirt gerrit	112495	master	MERGED	Upgrade to model 4.4.21	2021-02-15 06:58:02 UTC
oVirt gerrit	112823	master	MERGED	core: avoid switching master role for gluster storage domains	2021-02-15 06:58:04 UTC
oVirt gerrit	112854	master	MERGED	core: clear switchMaster end_successfully entities	2021-02-15 06:58:04 UTC
oVirt gerrit	112861	ovirt-engine-4.4.4.z	MERGED	core: avoid switching master role for gluster storage domains	2021-02-15 06:58:04 UTC

Internal Links: 1620314

Description Natalie Gavrielov 2018-05-10 17:57:46 UTC

Description of problem:
In order to move master storage domain role to another storage domain, we need to put the master to maintenance.
In case hosted engine is using that storage - operation is blocked with the following message:

2018-05-05 04:00:23,859+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-56) [] Operation Failed: [Cannot deactivate Storage. The storage selected contains the self hosted engine.]

Right now this is blocking some of our automation tests (mainly reconstruct master).
It would help a lot if move master role would not require maintenance.

Comment 1 Yaniv Kaul 2018-05-10 18:58:12 UTC

What's the use case? Automation purely?

Comment 3 Natalie Gavrielov 2018-05-14 12:02:51 UTC

(In reply to Yaniv Kaul from comment #1)
> What's the use case? Automation purely?

Well, I don't think it's purely automation issue, though I cannot think of a good real-world scenario. 
It's just something that might pop up from the field in the near future.
In case you have a storage domain that holds the hosted engine VM disks and it's also the master storage domain you will never be able to switch this master role - because you can't put this storage domain to maintenance (it's running the hosted engine..)

Comment 4 Yaniv Kaul 2018-05-14 12:41:25 UTC

I wonder what happens if you create (via the API) a domain with 'master' set.

Comment 5 Elad 2018-06-14 07:23:04 UTC

We've shifted our automation environments to HE (ansibled deployed). So without the ability to change master role for a storage domain without maintenance, reconstruct master cannot be tested. Therefore, marking as AutomationBlocker.

Comment 7 Germano Veit Michel 2018-08-28 00:36:32 UTC

A use case for this is SHE -> baremetal migration. Which is currently broken on fresh 4.2 SHE deployments as hosted_storage is master.

It is also useful to have this for SHE disaster recovery/backup+restore and storage domain migration. Currently they are all broken if hosted_storage is master.

Comment 8 Robert McSwain 2018-10-02 20:18:00 UTC

+1 to this. With the HE storage domain being the master in almost all production instances of RHV at this point, not having the ability to change the master role without putting the hosted_storage domain into maintenance causes issues if another domain needs to be chosen for one reason or another.

Comment 10 ladislav 2018-11-22 12:07:30 UTC

(In reply to Natalie Gavrielov from comment #3)
> (In reply to Yaniv Kaul from comment #1)
> > What's the use case? Automation purely?
> 
> Well, I don't think it's purely automation issue, though I cannot think of a
> good real-world scenario. 
> It's just something that might pop up from the field in the near future.
> In case you have a storage domain that holds the hosted engine VM disks and
> it's also the master storage domain you will never be able to switch this
> master role - because you can't put this storage domain to maintenance (it's
> running the hosted engine..)

Example of the real world scenario:
we have 8 storage domains over 4 clusters (2 domains per cluster) with cca 1000 VMs per cluster. The master domain is running on the oldest storage (in fact this NFS cluster is small and not performing well). So we would like to move the master domain to some bigger NFS hardware without impact, and without migrating 1000 of VMs with over 24TB of live data.

Comment 12 Miles Aronnax 2020-02-19 21:34:48 UTC

Like ladislav, I have a real world senario:

Our RHV/ovirt environment was setup with stand-alone iSCSI boxes. A few months ago, we added a Ceph cluster (also connected through the iSCSI). Because the Ceph cluster has multiple physical boxes of redundancy, we want it to be the master instead of one of the standalone iSCSI boxes.

The 'put all other storage domains into maintenance' solution seems like a needless hassle. I would have to shutdown (or migrate) around half of our 700+ VMs. This would involve stopping computation for dozens of researchers and bringing down various other services -- which would likely impact end user perceptions of RHV/ovirt as an enterprise class product.

Comment 14 Brian Sipos 2020-03-30 16:34:10 UTC

It would be valuable to have a workaround to manually choose the master domain, even if not exposed on the web UI and even if it requires running VDSM commands and/or engine DB manipulation.

I need to decommission an old replica-3 storage domain and have several other replica-3 and replica-2+1 domains and would much prefer the master being one of the replica-3 domains.

Comment 15 Marina Kalinin 2020-06-30 20:17:02 UTC

Switching component to downstream due to customer tickets attached.

Comment 16 Marina Kalinin 2020-06-30 20:20:14 UTC

Hi Shani,

I am wondering, if this bug can be closed as duplicate of this one: bz#1836034?

Comment 17 shani 2020-07-01 08:49:51 UTC

Sounds like the same area but not the same issue.
IIUC, here you can't put in maintenance an SD which is used by hosted-engine (the operation is blocked), 
while on bz#1836034, the DC moved into maintenance and the remaining SD wasn't picked as a master, although it's active (not the result we'd expected).

Comment 18 Nir Soffer 2020-08-19 16:28:05 UTC

This is complex and risky change, and it is unlikely to be ready for 4.4.3.

Comment 21 shani 2020-10-14 21:56:57 UTC

Created attachment 1721626 [details]
switch master test cases logs and output

Comment 24 Ilan Zuckerman 2020-12-23 11:27:47 UTC

This scenario could not be verified on rhv-4.4.4-6

The engine fails to Deactivate storage domain (master) in a datacenter if HE VM resides on the same SD.

2020-12-23 12:04:25,709+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-16) [] Operation Failed
: [Cannot deactivate Storage. The storage selected contains the self hosted engine.]

Attaching engine + vdsm logs from this Automated TC (TestCase6222)

Comment 25 Ilan Zuckerman 2020-12-23 11:29:17 UTC

Created attachment 1741520 [details]
Failed to verify 23.12.20

Comment 26 shani 2020-12-23 11:41:34 UTC

Hi Ilan,
The set master command doesn't deactivate the current master.
The switch master storage domain operation is being executed while the domains are up and running.

It seems from the logs that the SwitchMasterStorageDomain command wasn't called at all.
Did you follow the step of using the command as described here: https://gerrit.ovirt.org/#/c/ovirt-engine/+/111228/?
  POST /ovirt-engine/api/datacenters/123/setmaster

  With a request body like this:

  <action>
    <storage_domain id="456"/>
  </action>

I believe you encountered this one while trying to deactivate your storage domain:
https://bugzilla.redhat.com/show_bug.cgi?id=1402789

Comment 27 Evelina Shames 2020-12-27 15:41:54 UTC

Verified on rhv-4.4.4-6 with the following:
curl -X POST -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal --cacert pki-resource.cer -T sd.xml <engine>/ovirt-engine/api/datacenters/<dc_id>/setmaster
sd.xml:
<action>
    <storage_domain id="sd_id"/>
</action>


Master moved from old sd to new sd.

Comment 28 Ilan Zuckerman 2020-12-28 14:47:19 UTC

After meeting with Evelina, and Shani I am removing the 'verified' flag and putting this to 'need_info' for the following reason:

After attempting to migrate the master to another SD from the same storage type (for example gluster1 to gluster2), The migration fails and the source SD being reconstructed.
At some point the SPM host is having a 'stuck' task. Clearing it does not resolve the issue.

[root@storage-ge5-vdsm1 ~]# vdsm-client Host getAllTasksStatuses
{
    "6fb0c398-a64a-4b1c-b00c-fa328ab9696f": {
        "code": 100,
        "message": "value=(1, 0, b'', b'') abortedcode=100",
        "taskID": "6fb0c398-a64a-4b1c-b00c-fa328ab9696f",
        "taskResult": "cleanSuccess",
        "taskState": "finished"
    }
}
[root@storage-ge5-vdsm1 ~]# 
[root@storage-ge5-vdsm1 ~]# 
[root@storage-ge5-vdsm1 ~]# vdsm-client Task clear taskID=6fb0c398-a64a-4b1c-b00c-fa328ab9696f



Steps to reproduce:
Issue the command to migrate SD for a few times (from one SD to another) on the same storage type.
This particular issue was discovered when using gluster SD, and NFS.
At 2nd or third attempt you will hit this issue. Just tail the logs.

Adding SPM vdsm log + engine log

Engine:

2020-12-28 14:25:49,669+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] Failed in 'HSMGetAllTasksStatusesVDS' method
2020-12-28 14:25:49,678+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: value=(1, 0, b'', b'') abortedcode=100



SPM host:

2020-12-28 14:27:31,149+0200 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='4552b855-99a7-4c96-9584-b2c11f382bf0') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-33>", line 2, in connectStoragePool
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1058, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1098, in _connectStoragePool
    masterVersion, domainsMap)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1078, in _updateStoragePool
    pool.refresh(msdUUID, masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1372, in refresh
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1294, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1519, in setMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
vdsm.storage.exception.StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=a0f7d7f8-65bb-4185-9986-f68697ff4ad6, pool=dfad0b2f-c1e9-4a0d-9ede-30414f6bee36'

Comment 29 Ilan Zuckerman 2020-12-30 09:23:29 UTC

Opened a new bug about this issue described in comment #28
https://bugzilla.redhat.com/show_bug.cgi?id=1911597

Comment 30 Shir Fishbain 2020-12-30 10:44:11 UTC

The customers assigned to this RFE are using RHHI and gluster, due to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1911597 we can't confirm and verify it! 
The functionality isn't perfect when setting Master SD from gluster > gluster SD only, tasks in engine DB and SPM host leads to an infinite (or long) reconstruction of all SDs (screenshot attached).
Reconstructing master domain on Data Center error appears when moving master SD from gluster > gluster SD.(screenshot attached)
We must revert the patches related to this RFE ASAP until bug 1911597 is fixed. Please do it.

Comment 31 Shir Fishbain 2020-12-30 10:47:20 UTC

Created attachment 1743181 [details]
reconstruction of all SDs

Comment 32 Shir Fishbain 2020-12-30 10:49:16 UTC

Created attachment 1743182 [details]
Reconstructing master domain on Data Center error

Comment 38 Evelina Shames 2021-01-11 09:07:31 UTC

Verified blocked operation for gluster on rhv-4.4.4-7 with the following:
curl -X POST -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal --cacert pki-resource.cer -T sd.xml <engine>/ovirt-engine/api/datacenters/<dc_id>/setmaster
sd.xml:
<action>
    <storage_domain id="sd_id"/>
</action>

Result:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fault>
    <detail>[Cannot switch master storage domain. Switch master storage domain is not supported for gluster-based domains.]</detail>
    <reason>Operation Failed</reason>
</fault>

Moving to 'Verified'

Comment 44 errata-xmlrpc 2021-02-02 14:00:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV-M (ovirt-engine) 4.4.z security, bug fix, enhancement upd[ovirt-4.4.4] 0-day), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0383

Note You need to log in before you can comment on or make changes to this bug.

abpatil
ahadas
amashah
branpise
BSipos
bugs
didi
gveitmic
izuckerm
jortialc
klaas
laco.humo
michal.skrivanek
mkalinin
mmartinv
mtessun
ngavrilo
nsoffer
pelauter
rassilon
rmcswain
sfishbai
sgoodman
sleviim
tnisan