Description of problem: In order to move master storage domain role to another storage domain, we need to put the master to maintenance. In case hosted engine is using that storage - operation is blocked with the following message: 2018-05-05 04:00:23,859+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-56) [] Operation Failed: [Cannot deactivate Storage. The storage selected contains the self hosted engine.] Right now this is blocking some of our automation tests (mainly reconstruct master). It would help a lot if move master role would not require maintenance.
What's the use case? Automation purely?
(In reply to Yaniv Kaul from comment #1) > What's the use case? Automation purely? Well, I don't think it's purely automation issue, though I cannot think of a good real-world scenario. It's just something that might pop up from the field in the near future. In case you have a storage domain that holds the hosted engine VM disks and it's also the master storage domain you will never be able to switch this master role - because you can't put this storage domain to maintenance (it's running the hosted engine..)
I wonder what happens if you create (via the API) a domain with 'master' set.
We've shifted our automation environments to HE (ansibled deployed). So without the ability to change master role for a storage domain without maintenance, reconstruct master cannot be tested. Therefore, marking as AutomationBlocker.
A use case for this is SHE -> baremetal migration. Which is currently broken on fresh 4.2 SHE deployments as hosted_storage is master. It is also useful to have this for SHE disaster recovery/backup+restore and storage domain migration. Currently they are all broken if hosted_storage is master.
+1 to this. With the HE storage domain being the master in almost all production instances of RHV at this point, not having the ability to change the master role without putting the hosted_storage domain into maintenance causes issues if another domain needs to be chosen for one reason or another.
(In reply to Natalie Gavrielov from comment #3) > (In reply to Yaniv Kaul from comment #1) > > What's the use case? Automation purely? > > Well, I don't think it's purely automation issue, though I cannot think of a > good real-world scenario. > It's just something that might pop up from the field in the near future. > In case you have a storage domain that holds the hosted engine VM disks and > it's also the master storage domain you will never be able to switch this > master role - because you can't put this storage domain to maintenance (it's > running the hosted engine..) Example of the real world scenario: we have 8 storage domains over 4 clusters (2 domains per cluster) with cca 1000 VMs per cluster. The master domain is running on the oldest storage (in fact this NFS cluster is small and not performing well). So we would like to move the master domain to some bigger NFS hardware without impact, and without migrating 1000 of VMs with over 24TB of live data.
Like ladislav, I have a real world senario: Our RHV/ovirt environment was setup with stand-alone iSCSI boxes. A few months ago, we added a Ceph cluster (also connected through the iSCSI). Because the Ceph cluster has multiple physical boxes of redundancy, we want it to be the master instead of one of the standalone iSCSI boxes. The 'put all other storage domains into maintenance' solution seems like a needless hassle. I would have to shutdown (or migrate) around half of our 700+ VMs. This would involve stopping computation for dozens of researchers and bringing down various other services -- which would likely impact end user perceptions of RHV/ovirt as an enterprise class product.
It would be valuable to have a workaround to manually choose the master domain, even if not exposed on the web UI and even if it requires running VDSM commands and/or engine DB manipulation. I need to decommission an old replica-3 storage domain and have several other replica-3 and replica-2+1 domains and would much prefer the master being one of the replica-3 domains.
Switching component to downstream due to customer tickets attached.
Hi Shani, I am wondering, if this bug can be closed as duplicate of this one: bz#1836034?
Sounds like the same area but not the same issue. IIUC, here you can't put in maintenance an SD which is used by hosted-engine (the operation is blocked), while on bz#1836034, the DC moved into maintenance and the remaining SD wasn't picked as a master, although it's active (not the result we'd expected).
This is complex and risky change, and it is unlikely to be ready for 4.4.3.
Created attachment 1721626 [details] switch master test cases logs and output
This scenario could not be verified on rhv-4.4.4-6 The engine fails to Deactivate storage domain (master) in a datacenter if HE VM resides on the same SD. 2020-12-23 12:04:25,709+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-16) [] Operation Failed : [Cannot deactivate Storage. The storage selected contains the self hosted engine.] Attaching engine + vdsm logs from this Automated TC (TestCase6222)
Created attachment 1741520 [details] Failed to verify 23.12.20
Hi Ilan, The set master command doesn't deactivate the current master. The switch master storage domain operation is being executed while the domains are up and running. It seems from the logs that the SwitchMasterStorageDomain command wasn't called at all. Did you follow the step of using the command as described here: https://gerrit.ovirt.org/#/c/ovirt-engine/+/111228/? POST /ovirt-engine/api/datacenters/123/setmaster With a request body like this: <action> <storage_domain id="456"/> </action> I believe you encountered this one while trying to deactivate your storage domain: https://bugzilla.redhat.com/show_bug.cgi?id=1402789
Verified on rhv-4.4.4-6 with the following: curl -X POST -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal --cacert pki-resource.cer -T sd.xml <engine>/ovirt-engine/api/datacenters/<dc_id>/setmaster sd.xml: <action> <storage_domain id="sd_id"/> </action> Master moved from old sd to new sd.
After meeting with Evelina, and Shani I am removing the 'verified' flag and putting this to 'need_info' for the following reason: After attempting to migrate the master to another SD from the same storage type (for example gluster1 to gluster2), The migration fails and the source SD being reconstructed. At some point the SPM host is having a 'stuck' task. Clearing it does not resolve the issue. [root@storage-ge5-vdsm1 ~]# vdsm-client Host getAllTasksStatuses { "6fb0c398-a64a-4b1c-b00c-fa328ab9696f": { "code": 100, "message": "value=(1, 0, b'', b'') abortedcode=100", "taskID": "6fb0c398-a64a-4b1c-b00c-fa328ab9696f", "taskResult": "cleanSuccess", "taskState": "finished" } } [root@storage-ge5-vdsm1 ~]# [root@storage-ge5-vdsm1 ~]# [root@storage-ge5-vdsm1 ~]# vdsm-client Task clear taskID=6fb0c398-a64a-4b1c-b00c-fa328ab9696f Steps to reproduce: Issue the command to migrate SD for a few times (from one SD to another) on the same storage type. This particular issue was discovered when using gluster SD, and NFS. At 2nd or third attempt you will hit this issue. Just tail the logs. Adding SPM vdsm log + engine log Engine: 2020-12-28 14:25:49,669+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] Failed in 'HSMGetAllTasksStatusesVDS' method 2020-12-28 14:25:49,678+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: value=(1, 0, b'', b'') abortedcode=100 SPM host: 2020-12-28 14:27:31,149+0200 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='4552b855-99a7-4c96-9584-b2c11f382bf0') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-33>", line 2, in connectStoragePool File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1058, in connectStoragePool spUUID, hostID, msdUUID, masterVersion, domainsMap) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1098, in _connectStoragePool masterVersion, domainsMap) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1078, in _updateStoragePool pool.refresh(msdUUID, masterVersion) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1372, in refresh self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1294, in __rebuild self.setMasterDomain(msdUUID, masterVersion) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1519, in setMasterDomain raise se.StoragePoolWrongMaster(self.spUUID, msdUUID) vdsm.storage.exception.StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=a0f7d7f8-65bb-4185-9986-f68697ff4ad6, pool=dfad0b2f-c1e9-4a0d-9ede-30414f6bee36'
Opened a new bug about this issue described in comment #28 https://bugzilla.redhat.com/show_bug.cgi?id=1911597
The customers assigned to this RFE are using RHHI and gluster, due to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1911597 we can't confirm and verify it! The functionality isn't perfect when setting Master SD from gluster > gluster SD only, tasks in engine DB and SPM host leads to an infinite (or long) reconstruction of all SDs (screenshot attached). Reconstructing master domain on Data Center error appears when moving master SD from gluster > gluster SD.(screenshot attached) We must revert the patches related to this RFE ASAP until bug 1911597 is fixed. Please do it.
Created attachment 1743181 [details] reconstruction of all SDs
Created attachment 1743182 [details] Reconstructing master domain on Data Center error
Verified blocked operation for gluster on rhv-4.4.4-7 with the following: curl -X POST -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal --cacert pki-resource.cer -T sd.xml <engine>/ovirt-engine/api/datacenters/<dc_id>/setmaster sd.xml: <action> <storage_domain id="sd_id"/> </action> Result: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <fault> <detail>[Cannot switch master storage domain. Switch master storage domain is not supported for gluster-based domains.]</detail> <reason>Operation Failed</reason> </fault> Moving to 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV-M (ovirt-engine) 4.4.z security, bug fix, enhancement upd[ovirt-4.4.4] 0-day), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0383