1315074 – [engine-backend] Hosted-engine storage domain is a data domain that cannot take master

Bug 1315074 - [engine-backend] Hosted-engine storage domain is a data domain that cannot take master

Summary: [engine-backend] Hosted-engine storage domain is a data domain that cannot ta...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	3.6.3.3
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Tal Nisan
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1455169
Blocks:	1393902 1400127
TreeView+	depends on / blocked

Reported:	2016-03-06 12:34 UTC by Elad
Modified:	2018-02-12 10:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-02-12 10:11:21 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-4.2+

Attachments	(Terms of Use)
engine and vdsm logs (4.01 MB, application/x-gzip) 2016-03-06 12:34 UTC, Elad	no flags	Details
View All

Description Elad 2016-03-06 12:34:31 UTC

Created attachment 1133480 [details]
engine and vdsm logs

Description of problem:
Since the hosted-engine storage domain is imported to the first initialized DC as a regular data domain, reconstruct scenarios are being done wrongly while this domain is the only active data domain to take master. 

Version-Release number of selected component (if applicable):
rhevm-3.6.3.4-0.1.el6.noarch
ovirt-hosted-engine-setup-1.3.3.4-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch
vdsm-4.17.23-0.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. On a hosted-engine env., with an initialize DC with one master domain and the hosted-engine storage domain, both active 
2. Create an export domain and attach it to the DC
3. Put the master domain in maintenance


Actual results:
The master domain is moved to maintenance successfully. The DC turns to maintenance and the hosted-engine storage domain and the export domain remain active.

Expected results:
2 issues:
1) If the hosted-engine storage domain cannot take master, putting the current master domain to maintenance, while there are no other data domain that can take the master role, other than the hosted-engine one, should be trigger a warning that the DC will become maintenance.
2) If there is an active export domain in the pool, putting the current master domain to maintenance, while there are no other data domain that can take the master role, other than the hosted-engine one, should be blocked in CanDoAction.


Additional info: engine and vdsm logs

2016-03-06 13:59:25,870 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-6) [191b3e9] START, DeactivateStorageDomainVDSCommand( DeactivateStorageD
omainVDSCommandParameters:{runAsync='true', storagePoolId='00000002-0002-0002-0002-0000000003df', ignoreFailoverLimit='false', storageDomainId='4b71eae6-ce01-48d5-950f-633a673ae722', masterDomainId='82afad4a-9a43-
4b09-881e-e0467fc2a77a', masterVersion='7'}), log id: 537b1ea9

Comment 1 Allon Mureinik 2016-06-06 14:15:43 UTC

This bug was introduced by the fix to bug 1298697.

Frankly, none of this flow makes sense to me - if the HE domain is a data domain and part of the pool, there's no reason it should not be able to take the master.

Comment 2 Roy Golan 2016-06-07 08:56:40 UTC

(In reply to Allon Mureinik from comment #1)
> This bug was introduced by the fix to bug 1298697.
> 
> Frankly, none of this flow makes sense to me - if the HE domain is a data
> domain and part of the pool, there's no reason it should not be able to take
> the master.

See Bug 1298697. In bootstrap mode the HE domain will have vdsm to start monitoring it before its connected to pool So when the engine will be up, it will fail to connect it to pool. I think leaving without the HE domain as master is a sane compromise. Otherwise we would have more bugs around the master domain already being under monitoring AND not connected to pool.

Comment 3 Allon Mureinik 2016-06-07 10:04:51 UTC

Domain monitoring and being the master have nothing to do with each other. IMHO, this hack solves one bug, but introduces a dozen others.

Comment 4 Yaniv Kaul 2016-11-30 08:48:13 UTC

(In reply to Allon Mureinik from comment #3)
> Domain monitoring and being the master have nothing to do with each other.
> IMHO, this hack solves one bug, but introduces a dozen others.

What's the next step?

Comment 5 Allon Mureinik 2016-11-30 13:21:24 UTC

(In reply to Yaniv Kaul from comment #4)
> (In reply to Allon Mureinik from comment #3)
> > Domain monitoring and being the master have nothing to do with each other.
> > IMHO, this hack solves one bug, but introduces a dozen others.
> 
> What's the next step?

Having the HE stakeholers decide what to do with the domain.
It's either a data domain in the engine that happens to have a special disk on it AND NOTHING ELSE SPECIAL ABOUT IT (e.g., can be master, can be upgraded, has OVF_STORES, whatever), or it's removed completely from the engine.
Whenever we try to have the cake and eat it it blows up in our collective faces.

Comment 6 Yaniv Kaul 2016-11-30 13:23:49 UTC

(In reply to Allon Mureinik from comment #5)
> (In reply to Yaniv Kaul from comment #4)
> > (In reply to Allon Mureinik from comment #3)
> > > Domain monitoring and being the master have nothing to do with each other.
> > > IMHO, this hack solves one bug, but introduces a dozen others.
> > 
> > What's the next step?
> 
> Having the HE stakeholers decide what to do with the domain.
> It's either a data domain in the engine that happens to have a special disk
> on it AND NOTHING ELSE SPECIAL ABOUT IT (e.g., can be master, can be
> upgraded, has OVF_STORES, whatever), or it's removed completely from the
> engine.
> Whenever we try to have the cake and eat it it blows up in our collective
> faces.

Martin?

Comment 7 Martin Sivák 2016-11-30 13:45:58 UTC

There are couple of considerations here:

1) We have two disks on the domain that have to stay there and are important for synchronization (must not be touched, deleted, moved, ...)
2) We mount the storage before the engine starts and we have two sanlock leases on it (agent id and hosted engine disk)
3) Any attempt on disconnecting the storage kills the engine VM (can happen during maintenance flow, taking SPM role..)
4) The engine VM disks are somewhat sensitive to high load of the storage device, but the user can probably take care of that if we document that properly
5) The current hosted engine editing feature requires that the engine sees the domain as the OVF writer can't push data to it otherwise

So the domain is special in how we use it, but it does not necessarily have to be special in what it contains. And it has to be visible from the engine at least according to the current state of things.

Comment 8 Yaniv Kaul 2016-11-30 13:55:27 UTC

(In reply to Martin Sivák from comment #7)
> There are couple of considerations here:
> 
> 1) We have two disks on the domain that have to stay there and are important
> for synchronization (must not be touched, deleted, moved, ...)
> 2) We mount the storage before the engine starts and we have two sanlock
> leases on it (agent id and hosted engine disk)
> 3) Any attempt on disconnecting the storage kills the engine VM (can happen
> during maintenance flow, taking SPM role..)

Taking SPM role disconnects a connected storage? Because it was not connected via Engine?

> 4) The engine VM disks are somewhat sensitive to high load of the storage
> device, but the user can probably take care of that if we document that
> properly
> 5) The current hosted engine editing feature requires that the engine sees
> the domain as the OVF writer can't push data to it otherwise
> 
> So the domain is special in how we use it, but it does not necessarily have
> to be special in what it contains. And it has to be visible from the engine
> at least according to the current state of things.

Alon?

Comment 9 Allon Mureinik 2016-11-30 14:33:16 UTC

(In reply to Yaniv Kaul from comment #8)
> (In reply to Martin Sivák from comment #7)
> > There are couple of considerations here:
> > 
> > 1) We have two disks on the domain that have to stay there and are important
> > for synchronization (must not be touched, deleted, moved, ...)
> > 2) We mount the storage before the engine starts and we have two sanlock
> > leases on it (agent id and hosted engine disk)
> > 3) Any attempt on disconnecting the storage kills the engine VM (can happen
> > during maintenance flow, taking SPM role..)
> 
> Taking SPM role disconnects a connected storage? Because it was not
> connected via Engine?
What?
If anything, it connects to the storage...

> 
> > 4) The engine VM disks are somewhat sensitive to high load of the storage
> > device, but the user can probably take care of that if we document that
> > properly
> > 5) The current hosted engine editing feature requires that the engine sees
> > the domain as the OVF writer can't push data to it otherwise
> > 
> > So the domain is special in how we use it, but it does not necessarily have
> > to be special in what it contains. And it has to be visible from the engine
> > at least according to the current state of things.
> 
> Alon?
what's the question?

Comment 10 Yaniv Lavi 2016-12-14 16:21:36 UTC

This bug had requires_doc_text flag, yet no documentation text was provided. Please add the documentation text and only then set this flag.

Comment 11 Yaniv Lavi 2016-12-28 16:18:31 UTC

Can you reply on comment #9?

BTW we have decided to start the process to make the HE storage into a standard storage domain and remove limitations for 4.2 and if it is low risk we can consider some steps in 4.1 already.

Comment 12 Martin Sivák 2017-01-02 12:09:01 UTC

To what question?

Comment 13 Martin Sivák 2017-01-02 12:53:40 UTC

Currently this is all related to the one master issue: Does it have to be a special domain and who (part of code, team) owns it?

We have a tracker bug for all the hosted engine related storage flows & API questions to answer those: https://bugzilla.redhat.com/show_bug.cgi?id=1393902

We can't decide this without proper design done together with the storage team as we do not know all the "hidden" storage flows that commonly happen with standard storage domains (SPM selection being one of them).

Comment 14 Yaniv Lavi 2017-12-27 13:13:41 UTC

Should be solved with zero node deployment please test with that.

Comment 15 Elad 2018-02-11 12:14:16 UTC

Hosted engine storage domain is being added as master automatically to the hosted engine's DC.

Used:
ovirt-hosted-engine-setup-2.2.9-1.el7ev.noarch
rhvm-4.2.1.5-0.1.el7.noarch
vdsm-4.20.17-1.el7ev.x86_64

Comment 16 Sandro Bonazzola 2018-02-12 10:11:21 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.