Bug 1315074 - [engine-backend] Hosted-engine storage domain is a data domain that cannot take master
[engine-backend] Hosted-engine storage domain is a data domain that cannot ta...
Status: NEW
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage (Show other bugs)
3.6.3.3
x86_64 Unspecified
high Severity high (vote)
: ovirt-4.2.0
: ---
Assigned To: Tal Nisan
Raz Tamir
:
Depends On:
Blocks: 1393902 1400127
  Show dependency treegraph
 
Reported: 2016-03-06 07:34 EST by Elad
Modified: 2017-01-02 09:30 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.2+


Attachments (Terms of Use)
engine and vdsm logs (4.01 MB, application/x-gzip)
2016-03-06 07:34 EST, Elad
no flags Details

  None (edit)
Description Elad 2016-03-06 07:34:31 EST
Created attachment 1133480 [details]
engine and vdsm logs

Description of problem:
Since the hosted-engine storage domain is imported to the first initialized DC as a regular data domain, reconstruct scenarios are being done wrongly while this domain is the only active data domain to take master. 

Version-Release number of selected component (if applicable):
rhevm-3.6.3.4-0.1.el6.noarch
ovirt-hosted-engine-setup-1.3.3.4-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch
vdsm-4.17.23-0.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. On a hosted-engine env., with an initialize DC with one master domain and the hosted-engine storage domain, both active 
2. Create an export domain and attach it to the DC
3. Put the master domain in maintenance


Actual results:
The master domain is moved to maintenance successfully. The DC turns to maintenance and the hosted-engine storage domain and the export domain remain active.

Expected results:
2 issues:
1) If the hosted-engine storage domain cannot take master, putting the current master domain to maintenance, while there are no other data domain that can take the master role, other than the hosted-engine one, should be trigger a warning that the DC will become maintenance.
2) If there is an active export domain in the pool, putting the current master domain to maintenance, while there are no other data domain that can take the master role, other than the hosted-engine one, should be blocked in CanDoAction.


Additional info: engine and vdsm logs

2016-03-06 13:59:25,870 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-6) [191b3e9] START, DeactivateStorageDomainVDSCommand( DeactivateStorageD
omainVDSCommandParameters:{runAsync='true', storagePoolId='00000002-0002-0002-0002-0000000003df', ignoreFailoverLimit='false', storageDomainId='4b71eae6-ce01-48d5-950f-633a673ae722', masterDomainId='82afad4a-9a43-
4b09-881e-e0467fc2a77a', masterVersion='7'}), log id: 537b1ea9
Comment 1 Allon Mureinik 2016-06-06 10:15:43 EDT
This bug was introduced by the fix to bug 1298697.

Frankly, none of this flow makes sense to me - if the HE domain is a data domain and part of the pool, there's no reason it should not be able to take the master.
Comment 2 Roy Golan 2016-06-07 04:56:40 EDT
(In reply to Allon Mureinik from comment #1)
> This bug was introduced by the fix to bug 1298697.
> 
> Frankly, none of this flow makes sense to me - if the HE domain is a data
> domain and part of the pool, there's no reason it should not be able to take
> the master.

See Bug 1298697. In bootstrap mode the HE domain will have vdsm to start monitoring it before its connected to pool So when the engine will be up, it will fail to connect it to pool. I think leaving without the HE domain as master is a sane compromise. Otherwise we would have more bugs around the master domain already being under monitoring AND not connected to pool.
Comment 3 Allon Mureinik 2016-06-07 06:04:51 EDT
Domain monitoring and being the master have nothing to do with each other. IMHO, this hack solves one bug, but introduces a dozen others.
Comment 4 Yaniv Kaul 2016-11-30 03:48:13 EST
(In reply to Allon Mureinik from comment #3)
> Domain monitoring and being the master have nothing to do with each other.
> IMHO, this hack solves one bug, but introduces a dozen others.

What's the next step?
Comment 5 Allon Mureinik 2016-11-30 08:21:24 EST
(In reply to Yaniv Kaul from comment #4)
> (In reply to Allon Mureinik from comment #3)
> > Domain monitoring and being the master have nothing to do with each other.
> > IMHO, this hack solves one bug, but introduces a dozen others.
> 
> What's the next step?

Having the HE stakeholers decide what to do with the domain.
It's either a data domain in the engine that happens to have a special disk on it AND NOTHING ELSE SPECIAL ABOUT IT (e.g., can be master, can be upgraded, has OVF_STORES, whatever), or it's removed completely from the engine.
Whenever we try to have the cake and eat it it blows up in our collective faces.
Comment 6 Yaniv Kaul 2016-11-30 08:23:49 EST
(In reply to Allon Mureinik from comment #5)
> (In reply to Yaniv Kaul from comment #4)
> > (In reply to Allon Mureinik from comment #3)
> > > Domain monitoring and being the master have nothing to do with each other.
> > > IMHO, this hack solves one bug, but introduces a dozen others.
> > 
> > What's the next step?
> 
> Having the HE stakeholers decide what to do with the domain.
> It's either a data domain in the engine that happens to have a special disk
> on it AND NOTHING ELSE SPECIAL ABOUT IT (e.g., can be master, can be
> upgraded, has OVF_STORES, whatever), or it's removed completely from the
> engine.
> Whenever we try to have the cake and eat it it blows up in our collective
> faces.

Martin?
Comment 7 Martin Sivák 2016-11-30 08:45:58 EST
There are couple of considerations here:

1) We have two disks on the domain that have to stay there and are important for synchronization (must not be touched, deleted, moved, ...)
2) We mount the storage before the engine starts and we have two sanlock leases on it (agent id and hosted engine disk)
3) Any attempt on disconnecting the storage kills the engine VM (can happen during maintenance flow, taking SPM role..)
4) The engine VM disks are somewhat sensitive to high load of the storage device, but the user can probably take care of that if we document that properly
5) The current hosted engine editing feature requires that the engine sees the domain as the OVF writer can't push data to it otherwise

So the domain is special in how we use it, but it does not necessarily have to be special in what it contains. And it has to be visible from the engine at least according to the current state of things.
Comment 8 Yaniv Kaul 2016-11-30 08:55:27 EST
(In reply to Martin Sivák from comment #7)
> There are couple of considerations here:
> 
> 1) We have two disks on the domain that have to stay there and are important
> for synchronization (must not be touched, deleted, moved, ...)
> 2) We mount the storage before the engine starts and we have two sanlock
> leases on it (agent id and hosted engine disk)
> 3) Any attempt on disconnecting the storage kills the engine VM (can happen
> during maintenance flow, taking SPM role..)

Taking SPM role disconnects a connected storage? Because it was not connected via Engine?

> 4) The engine VM disks are somewhat sensitive to high load of the storage
> device, but the user can probably take care of that if we document that
> properly
> 5) The current hosted engine editing feature requires that the engine sees
> the domain as the OVF writer can't push data to it otherwise
> 
> So the domain is special in how we use it, but it does not necessarily have
> to be special in what it contains. And it has to be visible from the engine
> at least according to the current state of things.

Alon?
Comment 9 Allon Mureinik 2016-11-30 09:33:16 EST
(In reply to Yaniv Kaul from comment #8)
> (In reply to Martin Sivák from comment #7)
> > There are couple of considerations here:
> > 
> > 1) We have two disks on the domain that have to stay there and are important
> > for synchronization (must not be touched, deleted, moved, ...)
> > 2) We mount the storage before the engine starts and we have two sanlock
> > leases on it (agent id and hosted engine disk)
> > 3) Any attempt on disconnecting the storage kills the engine VM (can happen
> > during maintenance flow, taking SPM role..)
> 
> Taking SPM role disconnects a connected storage? Because it was not
> connected via Engine?
What?
If anything, it connects to the storage...

> 
> > 4) The engine VM disks are somewhat sensitive to high load of the storage
> > device, but the user can probably take care of that if we document that
> > properly
> > 5) The current hosted engine editing feature requires that the engine sees
> > the domain as the OVF writer can't push data to it otherwise
> > 
> > So the domain is special in how we use it, but it does not necessarily have
> > to be special in what it contains. And it has to be visible from the engine
> > at least according to the current state of things.
> 
> Alon?
what's the question?
Comment 10 Yaniv Lavi (Dary) 2016-12-14 11:21:36 EST
This bug had requires_doc_text flag, yet no documentation text was provided. Please add the documentation text and only then set this flag.
Comment 11 Yaniv Lavi (Dary) 2016-12-28 11:18:31 EST
Can you reply on comment #9?

BTW we have decided to start the process to make the HE storage into a standard storage domain and remove limitations for 4.2 and if it is low risk we can consider some steps in 4.1 already.
Comment 12 Martin Sivák 2017-01-02 07:09:01 EST
To what question?
Comment 13 Martin Sivák 2017-01-02 07:53:40 EST
Currently this is all related to the one master issue: Does it have to be a special domain and who (part of code, team) owns it?

We have a tracker bug for all the hosted engine related storage flows & API questions to answer those: https://bugzilla.redhat.com/show_bug.cgi?id=1393902

We can't decide this without proper design done together with the storage team as we do not know all the "hidden" storage flows that commonly happen with standard storage domains (SPM selection being one of them).

Note You need to log in before you can comment on or make changes to this bug.