Bug 1576923 - RFE: Ability to move master role to another domain without putting the domain to maintenance
Summary: RFE: Ability to move master role to another domain without putting the domain...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.4.4-1
: 4.4.4
Assignee: shani
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On: 1911597
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-10 17:57 UTC by Natalie Gavrielov
Modified: 2024-03-25 15:04 UTC (History)
25 users (show)

Fixed In Version: ovirt-engine-4.4.4.7
Doc Type: Enhancement
Doc Text:
Previously, you could not migrate the master role to a newer domain without migrating the virtual machines from the old domain and putting it into maintenance mode. Additionally, you could not put a hosted_storage domain into maintenance mode. With this release, you can use the REST API to move the master role to another storage domain without putting the domain into maintenance mode. For example, to set a storage domain with ID `456` as a master on a data center with ID `123`, send the following request: ---- POST /ovirt-engine/api/datacenters/123/setmaster With a request body like this: <action> <storage_domain id="456"/> </action> ---- Alternatively, this example uses the name of the storage domain: ---- <action> <storage_domain> <name>my-nfs</name> </storage_domain> </action> ----
Clone Of:
Environment:
Last Closed: 2021-02-02 14:00:17 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
switch master test cases logs and output (175.73 KB, application/zip)
2020-10-14 21:56 UTC, shani
no flags Details
Failed to verify 23.12.20 (183.53 KB, application/zip)
2020-12-23 11:29 UTC, Ilan Zuckerman
no flags Details
reconstruction of all SDs (144.63 KB, image/png)
2020-12-30 10:47 UTC, Shir Fishbain
no flags Details
Reconstructing master domain on Data Center error (22.96 KB, image/png)
2020-12-30 10:49 UTC, Shir Fishbain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1620314 0 urgent CLOSED [downstream clone - 4.2.7] SHE disaster recovery is broken in new 4.2 deployments as hosted_storage is master 2022-03-13 16:19:07 UTC
Red Hat Knowledge Base (Solution) 34923 0 None None None 2020-05-15 01:46:19 UTC
Red Hat Product Errata RHSA-2021:0383 0 None None None 2021-02-02 14:00:27 UTC
oVirt gerrit 111225 0 master MERGED storage: Support manually setting a storage domain as a master 2021-02-15 06:58:02 UTC
oVirt gerrit 111228 0 master MERGED core: introduce SwitchMasterStorageDomainCommand 2021-02-15 06:58:02 UTC
oVirt gerrit 111563 0 master MERGED API: Introduce switchMaster verb 2021-02-15 06:58:03 UTC
oVirt gerrit 112136 0 master MERGED core: introduce SwitchMasterStorageDomainVDSCommand 2021-02-15 06:58:02 UTC
oVirt gerrit 112348 0 master MERGED core: add test for SwitchMasterStorageDomainCommand 2021-02-15 06:58:02 UTC
oVirt gerrit 112495 0 master MERGED Upgrade to model 4.4.21 2021-02-15 06:58:02 UTC
oVirt gerrit 112823 0 master MERGED core: avoid switching master role for gluster storage domains 2021-02-15 06:58:04 UTC
oVirt gerrit 112854 0 master MERGED core: clear switchMaster end_successfully entities 2021-02-15 06:58:04 UTC
oVirt gerrit 112861 0 ovirt-engine-4.4.4.z MERGED core: avoid switching master role for gluster storage domains 2021-02-15 06:58:04 UTC

Internal Links: 1620314

Description Natalie Gavrielov 2018-05-10 17:57:46 UTC
Description of problem:
In order to move master storage domain role to another storage domain, we need to put the master to maintenance.
In case hosted engine is using that storage - operation is blocked with the following message:

2018-05-05 04:00:23,859+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-56) [] Operation Failed: [Cannot deactivate Storage. The storage selected contains the self hosted engine.]

Right now this is blocking some of our automation tests (mainly reconstruct master).
It would help a lot if move master role would not require maintenance.

Comment 1 Yaniv Kaul 2018-05-10 18:58:12 UTC
What's the use case? Automation purely?

Comment 3 Natalie Gavrielov 2018-05-14 12:02:51 UTC
(In reply to Yaniv Kaul from comment #1)
> What's the use case? Automation purely?

Well, I don't think it's purely automation issue, though I cannot think of a good real-world scenario. 
It's just something that might pop up from the field in the near future.
In case you have a storage domain that holds the hosted engine VM disks and it's also the master storage domain you will never be able to switch this master role - because you can't put this storage domain to maintenance (it's running the hosted engine..)

Comment 4 Yaniv Kaul 2018-05-14 12:41:25 UTC
I wonder what happens if you create (via the API) a domain with 'master' set.

Comment 5 Elad 2018-06-14 07:23:04 UTC
We've shifted our automation environments to HE (ansibled deployed). So without the ability to change master role for a storage domain without maintenance, reconstruct master cannot be tested. Therefore, marking as AutomationBlocker.

Comment 7 Germano Veit Michel 2018-08-28 00:36:32 UTC
A use case for this is SHE -> baremetal migration. Which is currently broken on fresh 4.2 SHE deployments as hosted_storage is master.

It is also useful to have this for SHE disaster recovery/backup+restore and storage domain migration. Currently they are all broken if hosted_storage is master.

Comment 8 Robert McSwain 2018-10-02 20:18:00 UTC
+1 to this. With the HE storage domain being the master in almost all production instances of RHV at this point, not having the ability to change the master role without putting the hosted_storage domain into maintenance causes issues if another domain needs to be chosen for one reason or another.

Comment 10 ladislav 2018-11-22 12:07:30 UTC
(In reply to Natalie Gavrielov from comment #3)
> (In reply to Yaniv Kaul from comment #1)
> > What's the use case? Automation purely?
> 
> Well, I don't think it's purely automation issue, though I cannot think of a
> good real-world scenario. 
> It's just something that might pop up from the field in the near future.
> In case you have a storage domain that holds the hosted engine VM disks and
> it's also the master storage domain you will never be able to switch this
> master role - because you can't put this storage domain to maintenance (it's
> running the hosted engine..)

Example of the real world scenario:
we have 8 storage domains over 4 clusters (2 domains per cluster) with cca 1000 VMs per cluster. The master domain is running on the oldest storage (in fact this NFS cluster is small and not performing well). So we would like to move the master domain to some bigger NFS hardware without impact, and without migrating 1000 of VMs with over 24TB of live data.

Comment 12 Miles Aronnax 2020-02-19 21:34:48 UTC
Like ladislav, I have a real world senario:

Our RHV/ovirt environment was setup with stand-alone iSCSI boxes. A few months ago, we added a Ceph cluster (also connected through the iSCSI). Because the Ceph cluster has multiple physical boxes of redundancy, we want it to be the master instead of one of the standalone iSCSI boxes.

The 'put all other storage domains into maintenance' solution seems like a needless hassle. I would have to shutdown (or migrate) around half of our 700+ VMs. This would involve stopping computation for dozens of researchers and bringing down various other services -- which would likely impact end user perceptions of RHV/ovirt as an enterprise class product.

Comment 14 Brian Sipos 2020-03-30 16:34:10 UTC
It would be valuable to have a workaround to manually choose the master domain, even if not exposed on the web UI and even if it requires running VDSM commands and/or engine DB manipulation.

I need to decommission an old replica-3 storage domain and have several other replica-3 and replica-2+1 domains and would much prefer the master being one of the replica-3 domains.

Comment 15 Marina Kalinin 2020-06-30 20:17:02 UTC
Switching component to downstream due to customer tickets attached.

Comment 16 Marina Kalinin 2020-06-30 20:20:14 UTC
Hi Shani,

I am wondering, if this bug can be closed as duplicate of this one: bz#1836034?

Comment 17 shani 2020-07-01 08:49:51 UTC
Sounds like the same area but not the same issue.
IIUC, here you can't put in maintenance an SD which is used by hosted-engine (the operation is blocked), 
while on bz#1836034, the DC moved into maintenance and the remaining SD wasn't picked as a master, although it's active (not the result we'd expected).

Comment 18 Nir Soffer 2020-08-19 16:28:05 UTC
This is complex and risky change, and it is unlikely to be ready for 4.4.3.

Comment 21 shani 2020-10-14 21:56:57 UTC
Created attachment 1721626 [details]
switch master test cases logs and output

Comment 24 Ilan Zuckerman 2020-12-23 11:27:47 UTC
This scenario could not be verified on rhv-4.4.4-6

The engine fails to Deactivate storage domain (master) in a datacenter if HE VM resides on the same SD.

2020-12-23 12:04:25,709+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-16) [] Operation Failed
: [Cannot deactivate Storage. The storage selected contains the self hosted engine.]

Attaching engine + vdsm logs from this Automated TC (TestCase6222)

Comment 25 Ilan Zuckerman 2020-12-23 11:29:17 UTC
Created attachment 1741520 [details]
Failed to verify 23.12.20

Comment 26 shani 2020-12-23 11:41:34 UTC
Hi Ilan,
The set master command doesn't deactivate the current master.
The switch master storage domain operation is being executed while the domains are up and running.

It seems from the logs that the SwitchMasterStorageDomain command wasn't called at all.
Did you follow the step of using the command as described here: https://gerrit.ovirt.org/#/c/ovirt-engine/+/111228/?
  POST /ovirt-engine/api/datacenters/123/setmaster

  With a request body like this:

  <action>
    <storage_domain id="456"/>
  </action>

I believe you encountered this one while trying to deactivate your storage domain:
https://bugzilla.redhat.com/show_bug.cgi?id=1402789

Comment 27 Evelina Shames 2020-12-27 15:41:54 UTC
Verified on rhv-4.4.4-6 with the following:
curl -X POST -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal --cacert pki-resource.cer -T sd.xml <engine>/ovirt-engine/api/datacenters/<dc_id>/setmaster
sd.xml:
<action>
    <storage_domain id="sd_id"/>
</action>


Master moved from old sd to new sd.

Comment 28 Ilan Zuckerman 2020-12-28 14:47:19 UTC
After meeting with Evelina, and Shani I am removing the 'verified' flag and putting this to 'need_info' for the following reason:

After attempting to migrate the master to another SD from the same storage type (for example gluster1 to gluster2), The migration fails and the source SD being reconstructed.
At some point the SPM host is having a 'stuck' task. Clearing it does not resolve the issue.

[root@storage-ge5-vdsm1 ~]# vdsm-client Host getAllTasksStatuses
{
    "6fb0c398-a64a-4b1c-b00c-fa328ab9696f": {
        "code": 100,
        "message": "value=(1, 0, b'', b'') abortedcode=100",
        "taskID": "6fb0c398-a64a-4b1c-b00c-fa328ab9696f",
        "taskResult": "cleanSuccess",
        "taskState": "finished"
    }
}
[root@storage-ge5-vdsm1 ~]# 
[root@storage-ge5-vdsm1 ~]# 
[root@storage-ge5-vdsm1 ~]# vdsm-client Task clear taskID=6fb0c398-a64a-4b1c-b00c-fa328ab9696f



Steps to reproduce:
Issue the command to migrate SD for a few times (from one SD to another) on the same storage type.
This particular issue was discovered when using gluster SD, and NFS.
At 2nd or third attempt you will hit this issue. Just tail the logs.

Adding SPM vdsm log + engine log

Engine:

2020-12-28 14:25:49,669+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] Failed in 'HSMGetAllTasksStatusesVDS' method
2020-12-28 14:25:49,678+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: value=(1, 0, b'', b'') abortedcode=100



SPM host:

2020-12-28 14:27:31,149+0200 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='4552b855-99a7-4c96-9584-b2c11f382bf0') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-33>", line 2, in connectStoragePool
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1058, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1098, in _connectStoragePool
    masterVersion, domainsMap)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1078, in _updateStoragePool
    pool.refresh(msdUUID, masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1372, in refresh
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1294, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1519, in setMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
vdsm.storage.exception.StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=a0f7d7f8-65bb-4185-9986-f68697ff4ad6, pool=dfad0b2f-c1e9-4a0d-9ede-30414f6bee36'

Comment 29 Ilan Zuckerman 2020-12-30 09:23:29 UTC
Opened a new bug about this issue described in comment #28
https://bugzilla.redhat.com/show_bug.cgi?id=1911597

Comment 30 Shir Fishbain 2020-12-30 10:44:11 UTC
The customers assigned to this RFE are using RHHI and gluster, due to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1911597 we can't confirm and verify it! 
The functionality isn't perfect when setting Master SD from gluster > gluster SD only, tasks in engine DB and SPM host leads to an infinite (or long) reconstruction of all SDs (screenshot attached).
Reconstructing master domain on Data Center error appears when moving master SD from gluster > gluster SD.(screenshot attached)
We must revert the patches related to this RFE ASAP until bug 1911597 is fixed. Please do it.

Comment 31 Shir Fishbain 2020-12-30 10:47:20 UTC
Created attachment 1743181 [details]
reconstruction of all SDs

Comment 32 Shir Fishbain 2020-12-30 10:49:16 UTC
Created attachment 1743182 [details]
Reconstructing master domain on Data Center error

Comment 38 Evelina Shames 2021-01-11 09:07:31 UTC
Verified blocked operation for gluster on rhv-4.4.4-7 with the following:
curl -X POST -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal --cacert pki-resource.cer -T sd.xml <engine>/ovirt-engine/api/datacenters/<dc_id>/setmaster
sd.xml:
<action>
    <storage_domain id="sd_id"/>
</action>

Result:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fault>
    <detail>[Cannot switch master storage domain. Switch master storage domain is not supported for gluster-based domains.]</detail>
    <reason>Operation Failed</reason>
</fault>

Moving to 'Verified'

Comment 44 errata-xmlrpc 2021-02-02 14:00:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV-M (ovirt-engine) 4.4.z security, bug fix, enhancement upd[ovirt-4.4.4] 0-day), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0383


Note You need to log in before you can comment on or make changes to this bug.