Bug 975018 - engine: when we activate a domain which is not visible from spm only, we change spm but do not set host as nonOperational right away
engine: when we activate a domain which is not visible from spm only, we chan...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Linux
unspecified Severity medium
: ---
: 3.3.0
Assigned To: Nobody's working on this, feel free to take it
storage
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-17 08:31 EDT by Dafna Ron
Modified: 2016-02-10 15:23 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-07-08 03:09:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (1.73 MB, application/x-gzip)
2013-06-17 08:31 EDT, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2013-06-17 08:31:19 EDT
Created attachment 762016 [details]
logs

Description of problem:

I activated a non-master domain which was not visible from the spm only. 
engine changed spm to the other host but did not set the host to nonOperational right away even though the domain is reported in problem on the old spm. 

Version-Release number of selected component (if applicable):

sf18

How reproducible:

100%

Steps to Reproduce:
1. in a two hosts cluster, create two iscsi storage domains, each domain should be created from a lun located on a different storage server. 
2. put the non-master domain in maintenance 
3. block connectivity to the non-master domain from the spm only
4. activate the non-master domain

Actual results:

we change spm and activate the domain but only mark host as non operational 

Expected results:

if we know that there is a problem with the host and we change the spm we should also mark it as non-operational 

Additional info: logs

2013-06-17 15:19:53,153 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [3e8356b6] ActivateStorage Domain. Before Connect all hosts to pool. Time:6/17/13 3:19 PM


2013-06-17 15:21:54,781 ERROR [org.ovirt.engine.core.bll.storage.ISCSIStorageHelper] (pool-4-thread-48) [3e8356b6] The connection with details 10.35.64.10 Dafna-32-03 (LUN 1Dafna-32-031366808) failed because of error code 465 and error message is: failed to setup iscsi subsystem


2013-06-17 15:21:58,361 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-4-thread-46) [3e8356b6] START, SpmStopVDSCommand(HostName = cougar01, HostId =
 4497d431-7c5e-4924-96e0-3f9cdbf826e5, storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb), log id: 16f4ef2e

2013-06-17 15:22:06,188 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-46) [3e8356b6] FINISH, ActivateStorageDomainVDSCommand, log id: 55d80d91

in which time the domain is activated and the hosts are both up. 

Activate is sent again: 

2013-06-17 15:22:06,188 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [3e8356b6] ActivateStorage Domain. After Connect all hosts to pool. Time:6/17/13 3:22 PM


domain reported in problem: 

2013-06-17 15:22:24,569 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) domain 38755249-4bb3-4841-bf5b-05f4a521514d:Dafna-32-03 in problem. vds: cougar01


2013-06-17 15:27:24,602 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-46) vds cougar01 reported domain 38755249-4bb3-4841-bf5b-05f4a521514d:Dafna-32-03 as in problem, moving the vds to status NonOperational
Comment 1 Liron Aravot 2013-07-07 13:22:30 EDT
Currently the logic is that when DATA domain is reported in problem by a host, the engine "waits" X minutes before deciding on how to proceed - if one host reported the domain is problematic, its being moved to NonOp, if all the domains reported the domain as problematic, the domain moves to Inactive.

In this case,the host should move to non op X minutes after the domain would be activated by other hosts and it won't see it.

The current behaviour seems fine to me - I'm not in favour of having this optimization of putting the host to Non-op right away instead of after X minutes.  

the timer time is configurable through StorageDomainFalureTimeoutInMinutes config values, so IMO, if wanted - it can be changed.

Allon?
Comment 2 Ayal Baron 2013-07-08 03:09:52 EDT
Indeed, in order to determine if the issue is host or domain related we have a timer to see the status from all hosts.  Failure to activate will migrate the spm to a different host, in the meantime a domain in maintenance is not monitored anywhere so we cannot determine whether the problem is in the host or at the domain level, hence moving the host to non-op is wrong.
If after migrating SPM user successfully activates the domain then it will be monitored by all hosts and then hosts which don't have access to it will turn non-op.
Note that what you're asking for here is to move a working host with running VM to non-op because a domain which is in maintenance (has no effect on currently running VMs) failed to activate.

Note You need to log in before you can comment on or make changes to this bug.