Bug 975018

Summary: engine: when we activate a domain which is not visible from spm only, we change spm but do not set host as nonOperational right away
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED NOTABUG QA Contact:
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, acanan, acathrow, amureini, iheim, jkt, laravot, lpeer, Rhev-m-bugs, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-08 07:09:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-06-17 12:31:19 UTC
Created attachment 762016 [details]
logs

Description of problem:

I activated a non-master domain which was not visible from the spm only. 
engine changed spm to the other host but did not set the host to nonOperational right away even though the domain is reported in problem on the old spm. 

Version-Release number of selected component (if applicable):

sf18

How reproducible:

100%

Steps to Reproduce:
1. in a two hosts cluster, create two iscsi storage domains, each domain should be created from a lun located on a different storage server. 
2. put the non-master domain in maintenance 
3. block connectivity to the non-master domain from the spm only
4. activate the non-master domain

Actual results:

we change spm and activate the domain but only mark host as non operational 

Expected results:

if we know that there is a problem with the host and we change the spm we should also mark it as non-operational 

Additional info: logs

2013-06-17 15:19:53,153 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [3e8356b6] ActivateStorage Domain. Before Connect all hosts to pool. Time:6/17/13 3:19 PM


2013-06-17 15:21:54,781 ERROR [org.ovirt.engine.core.bll.storage.ISCSIStorageHelper] (pool-4-thread-48) [3e8356b6] The connection with details 10.35.64.10 Dafna-32-03 (LUN 1Dafna-32-031366808) failed because of error code 465 and error message is: failed to setup iscsi subsystem


2013-06-17 15:21:58,361 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-4-thread-46) [3e8356b6] START, SpmStopVDSCommand(HostName = cougar01, HostId =
 4497d431-7c5e-4924-96e0-3f9cdbf826e5, storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb), log id: 16f4ef2e

2013-06-17 15:22:06,188 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-46) [3e8356b6] FINISH, ActivateStorageDomainVDSCommand, log id: 55d80d91

in which time the domain is activated and the hosts are both up. 

Activate is sent again: 

2013-06-17 15:22:06,188 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [3e8356b6] ActivateStorage Domain. After Connect all hosts to pool. Time:6/17/13 3:22 PM


domain reported in problem: 

2013-06-17 15:22:24,569 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) domain 38755249-4bb3-4841-bf5b-05f4a521514d:Dafna-32-03 in problem. vds: cougar01


2013-06-17 15:27:24,602 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-46) vds cougar01 reported domain 38755249-4bb3-4841-bf5b-05f4a521514d:Dafna-32-03 as in problem, moving the vds to status NonOperational

Comment 1 Liron Aravot 2013-07-07 17:22:30 UTC
Currently the logic is that when DATA domain is reported in problem by a host, the engine "waits" X minutes before deciding on how to proceed - if one host reported the domain is problematic, its being moved to NonOp, if all the domains reported the domain as problematic, the domain moves to Inactive.

In this case,the host should move to non op X minutes after the domain would be activated by other hosts and it won't see it.

The current behaviour seems fine to me - I'm not in favour of having this optimization of putting the host to Non-op right away instead of after X minutes.  

the timer time is configurable through StorageDomainFalureTimeoutInMinutes config values, so IMO, if wanted - it can be changed.

Allon?

Comment 2 Ayal Baron 2013-07-08 07:09:52 UTC
Indeed, in order to determine if the issue is host or domain related we have a timer to see the status from all hosts.  Failure to activate will migrate the spm to a different host, in the meantime a domain in maintenance is not monitored anywhere so we cannot determine whether the problem is in the host or at the domain level, hence moving the host to non-op is wrong.
If after migrating SPM user successfully activates the domain then it will be monitored by all hosts and then hosts which don't have access to it will turn non-op.
Note that what you're asking for here is to move a working host with running VM to non-op because a domain which is in maintenance (has no effect on currently running VMs) failed to activate.