Created attachment 1133480 [details] engine and vdsm logs Description of problem: Since the hosted-engine storage domain is imported to the first initialized DC as a regular data domain, reconstruct scenarios are being done wrongly while this domain is the only active data domain to take master. Version-Release number of selected component (if applicable): rhevm-3.6.3.4-0.1.el6.noarch ovirt-hosted-engine-setup-1.3.3.4-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch vdsm-4.17.23-0.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. On a hosted-engine env., with an initialize DC with one master domain and the hosted-engine storage domain, both active 2. Create an export domain and attach it to the DC 3. Put the master domain in maintenance Actual results: The master domain is moved to maintenance successfully. The DC turns to maintenance and the hosted-engine storage domain and the export domain remain active. Expected results: 2 issues: 1) If the hosted-engine storage domain cannot take master, putting the current master domain to maintenance, while there are no other data domain that can take the master role, other than the hosted-engine one, should be trigger a warning that the DC will become maintenance. 2) If there is an active export domain in the pool, putting the current master domain to maintenance, while there are no other data domain that can take the master role, other than the hosted-engine one, should be blocked in CanDoAction. Additional info: engine and vdsm logs 2016-03-06 13:59:25,870 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-6) [191b3e9] START, DeactivateStorageDomainVDSCommand( DeactivateStorageD omainVDSCommandParameters:{runAsync='true', storagePoolId='00000002-0002-0002-0002-0000000003df', ignoreFailoverLimit='false', storageDomainId='4b71eae6-ce01-48d5-950f-633a673ae722', masterDomainId='82afad4a-9a43- 4b09-881e-e0467fc2a77a', masterVersion='7'}), log id: 537b1ea9
This bug was introduced by the fix to bug 1298697. Frankly, none of this flow makes sense to me - if the HE domain is a data domain and part of the pool, there's no reason it should not be able to take the master.
(In reply to Allon Mureinik from comment #1) > This bug was introduced by the fix to bug 1298697. > > Frankly, none of this flow makes sense to me - if the HE domain is a data > domain and part of the pool, there's no reason it should not be able to take > the master. See Bug 1298697. In bootstrap mode the HE domain will have vdsm to start monitoring it before its connected to pool So when the engine will be up, it will fail to connect it to pool. I think leaving without the HE domain as master is a sane compromise. Otherwise we would have more bugs around the master domain already being under monitoring AND not connected to pool.
Domain monitoring and being the master have nothing to do with each other. IMHO, this hack solves one bug, but introduces a dozen others.
(In reply to Allon Mureinik from comment #3) > Domain monitoring and being the master have nothing to do with each other. > IMHO, this hack solves one bug, but introduces a dozen others. What's the next step?
(In reply to Yaniv Kaul from comment #4) > (In reply to Allon Mureinik from comment #3) > > Domain monitoring and being the master have nothing to do with each other. > > IMHO, this hack solves one bug, but introduces a dozen others. > > What's the next step? Having the HE stakeholers decide what to do with the domain. It's either a data domain in the engine that happens to have a special disk on it AND NOTHING ELSE SPECIAL ABOUT IT (e.g., can be master, can be upgraded, has OVF_STORES, whatever), or it's removed completely from the engine. Whenever we try to have the cake and eat it it blows up in our collective faces.
(In reply to Allon Mureinik from comment #5) > (In reply to Yaniv Kaul from comment #4) > > (In reply to Allon Mureinik from comment #3) > > > Domain monitoring and being the master have nothing to do with each other. > > > IMHO, this hack solves one bug, but introduces a dozen others. > > > > What's the next step? > > Having the HE stakeholers decide what to do with the domain. > It's either a data domain in the engine that happens to have a special disk > on it AND NOTHING ELSE SPECIAL ABOUT IT (e.g., can be master, can be > upgraded, has OVF_STORES, whatever), or it's removed completely from the > engine. > Whenever we try to have the cake and eat it it blows up in our collective > faces. Martin?
There are couple of considerations here: 1) We have two disks on the domain that have to stay there and are important for synchronization (must not be touched, deleted, moved, ...) 2) We mount the storage before the engine starts and we have two sanlock leases on it (agent id and hosted engine disk) 3) Any attempt on disconnecting the storage kills the engine VM (can happen during maintenance flow, taking SPM role..) 4) The engine VM disks are somewhat sensitive to high load of the storage device, but the user can probably take care of that if we document that properly 5) The current hosted engine editing feature requires that the engine sees the domain as the OVF writer can't push data to it otherwise So the domain is special in how we use it, but it does not necessarily have to be special in what it contains. And it has to be visible from the engine at least according to the current state of things.
(In reply to Martin Sivák from comment #7) > There are couple of considerations here: > > 1) We have two disks on the domain that have to stay there and are important > for synchronization (must not be touched, deleted, moved, ...) > 2) We mount the storage before the engine starts and we have two sanlock > leases on it (agent id and hosted engine disk) > 3) Any attempt on disconnecting the storage kills the engine VM (can happen > during maintenance flow, taking SPM role..) Taking SPM role disconnects a connected storage? Because it was not connected via Engine? > 4) The engine VM disks are somewhat sensitive to high load of the storage > device, but the user can probably take care of that if we document that > properly > 5) The current hosted engine editing feature requires that the engine sees > the domain as the OVF writer can't push data to it otherwise > > So the domain is special in how we use it, but it does not necessarily have > to be special in what it contains. And it has to be visible from the engine > at least according to the current state of things. Alon?
(In reply to Yaniv Kaul from comment #8) > (In reply to Martin Sivák from comment #7) > > There are couple of considerations here: > > > > 1) We have two disks on the domain that have to stay there and are important > > for synchronization (must not be touched, deleted, moved, ...) > > 2) We mount the storage before the engine starts and we have two sanlock > > leases on it (agent id and hosted engine disk) > > 3) Any attempt on disconnecting the storage kills the engine VM (can happen > > during maintenance flow, taking SPM role..) > > Taking SPM role disconnects a connected storage? Because it was not > connected via Engine? What? If anything, it connects to the storage... > > > 4) The engine VM disks are somewhat sensitive to high load of the storage > > device, but the user can probably take care of that if we document that > > properly > > 5) The current hosted engine editing feature requires that the engine sees > > the domain as the OVF writer can't push data to it otherwise > > > > So the domain is special in how we use it, but it does not necessarily have > > to be special in what it contains. And it has to be visible from the engine > > at least according to the current state of things. > > Alon? what's the question?
This bug had requires_doc_text flag, yet no documentation text was provided. Please add the documentation text and only then set this flag.
Can you reply on comment #9? BTW we have decided to start the process to make the HE storage into a standard storage domain and remove limitations for 4.2 and if it is low risk we can consider some steps in 4.1 already.
To what question?
Currently this is all related to the one master issue: Does it have to be a special domain and who (part of code, team) owns it? We have a tracker bug for all the hosted engine related storage flows & API questions to answer those: https://bugzilla.redhat.com/show_bug.cgi?id=1393902 We can't decide this without proper design done together with the storage team as we do not know all the "hidden" storage flows that commonly happen with standard storage domains (SPM selection being one of them).
Should be solved with zero node deployment please test with that.
Hosted engine storage domain is being added as master automatically to the hosted engine's DC. Used: ovirt-hosted-engine-setup-2.2.9-1.el7ev.noarch rhvm-4.2.1.5-0.1.el7.noarch vdsm-4.20.17-1.el7ev.x86_64
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.