Bug 1093924

Summary:	Connect to storage and refresh pool when a domain returns visible
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Federico Simoncelli <fsimonce>
Component:	ovirt-engine	Assignee:	Liron Aravot <laravot>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Kevin Alon Goldblatt <kgoldbla>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.4.0	CC:	amureini, gklein, iheim, lpeer, ogofen, rbalakri, Rhev-m-bugs, scohen, tnisan, yeylon
Target Milestone:	---	Keywords:	ZStream
Target Release:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	storage
Fixed In Version:	vt1.3	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1102782 (view as bug list)		Environment:
Last Closed:	2015-02-16 19:08:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1119852, 1121420
Bug Blocks:	1102782, 1142923, 1156165

Description Federico Simoncelli 2014-05-03 11:35:36 UTC

Description of problem:
When a data domain is inactive it is possible to activate new hosts even if they're not able to connect to the relevant storage.

When the storage will become visible again these new hosts will be moved to NonOperational (since they failed to connect at the activation time).

Currently this is resolved after another 5 minutes with the autorecovery but preventing this situation in the first place would best.

The solution would be to try and reconnect to the storage and send a refreshStoragePool when the domain moves from Inactive to Active.
This approach would fix also bug 1086210 (and few other similar ones).

Version-Release number of selected component (if applicable):
rhevm-backend-3.4.0-0.16.rc.el6ev.noarch.rpm

How reproducible:
100%

Steps to Reproduce:
I tested this using NFS storage domains so I suggest to start reproducing with those and then move to block domains.

1. activate 1 host and 2 data storage domains: DomainA (master) DomainB (regular)
2. block connectivity to DomainB (no reconstructMaster), wait for the domain to become Inactive
3. activate a second host (it must not be able to reach DomainB as well)
4. restore connectivity to DomainB on both hosts
5. the second host is not connected to DomainB and in 5 minutes will be moved to Inactive

Actual results:
The second host is not connected to DomainB and in 5 minutes will be moved to Inactive

Expected results:
Engine, as soon ad DomainB is visible again, should make sure that the hosts are connected to DomainB and send a refreshStoragePool.

Additional info:

Comment 1 Liron Aravot 2014-05-04 10:25:00 UTC

On that scenario, the host shouldn't move to non operational as we still have domain monitoring on the unreachable domain (which will fail to produce it).
So the host will remain UP even after the domain connection returns in that scenario, as then we'll manage to produce the domain.

The issue is that we won't have link to the domain so operations related to it (like accessing disks on that domain from that specific host) should fail. 

Fede, seems to me like the solution is to create the links always which will solve that issue and will also lead us to be in the same situation on hosts in which the link was created before the domain became unreachable and between hosts that were later on connected to the pool. calling refresh from the engine on each domain that returns for all the hosts just to rebuild the links seems unneeded.

Comment 2 Federico Simoncelli 2014-05-04 11:37:48 UTC

(In reply to Liron Aravot from comment #1)
> On that scenario, the host shouldn't move to non operational as we still
> have domain monitoring on the unreachable domain (which will fail to produce
> it).

I am not sure what is "that" scenario but I assume it's the one this bug is referring to.

The monitoring domain is there but the mountpoint is *not* mounted because it's not reachable when "mount" (connectStorageServer) is issued on the second host.

> So the host will remain UP even after the domain connection returns in that
> scenario, as then we'll manage to produce the domain.

We won't be able to produce the domain if the connectStorageServer won't be issued once again since the host failed to mount it when it was unreachable.

> The issue is that we won't have link to the domain so operations related to
> it (like accessing disks on that domain from that specific host) should
> fail. 

That's a different problem that is not worth solving because even if we have the links the share is not mounted.

> Fede, seems to me like the solution is to create the links always which will
> solve that issue and will also lead us to be in the same situation on hosts
> in which the link was created before the domain became unreachable and
> between hosts that were later on connected to the pool. calling refresh from
> the engine on each domain that returns for all the hosts just to rebuild the
> links seems unneeded.

Agreed, calling refreshStoragePool without connectStorageServer may solve bug 1086210 but not the one of this bz (unneeded).

Although since we need to cover this bz scenario (which includes bug 1086210) we may as well fix them both at once.

Comment 6 Kevin Alon Goldblatt 2014-08-17 08:51:49 UTC

Ran the scenario from above. Both hosts connect successfully to the storage when it becomes available again. Moving to Verify

Comment 7 Kevin Alon Goldblatt 2014-08-17 08:56:25 UTC

The GetStoragePool function now updates the status of the domain.

Comment 8 Allon Mureinik 2015-02-16 19:08:49 UTC

RHEV-M 3.5.0 has been released, closing this bug.