Red Hat Bugzilla – Bug 975018
engine: when we activate a domain which is not visible from spm only, we change spm but do not set host as nonOperational right away
Last modified: 2016-02-10 15:23:26 EST
Created attachment 762016 [details]
Description of problem:
I activated a non-master domain which was not visible from the spm only.
engine changed spm to the other host but did not set the host to nonOperational right away even though the domain is reported in problem on the old spm.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. in a two hosts cluster, create two iscsi storage domains, each domain should be created from a lun located on a different storage server.
2. put the non-master domain in maintenance
3. block connectivity to the non-master domain from the spm only
4. activate the non-master domain
we change spm and activate the domain but only mark host as non operational
if we know that there is a problem with the host and we change the spm we should also mark it as non-operational
Additional info: logs
2013-06-17 15:19:53,153 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [3e8356b6] ActivateStorage Domain. Before Connect all hosts to pool. Time:6/17/13 3:19 PM
2013-06-17 15:21:54,781 ERROR [org.ovirt.engine.core.bll.storage.ISCSIStorageHelper] (pool-4-thread-48) [3e8356b6] The connection with details 10.35.64.10 Dafna-32-03 (LUN 1Dafna-32-031366808) failed because of error code 465 and error message is: failed to setup iscsi subsystem
2013-06-17 15:21:58,361 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (pool-4-thread-46) [3e8356b6] START, SpmStopVDSCommand(HostName = cougar01, HostId =
4497d431-7c5e-4924-96e0-3f9cdbf826e5, storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb), log id: 16f4ef2e
2013-06-17 15:22:06,188 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-46) [3e8356b6] FINISH, ActivateStorageDomainVDSCommand, log id: 55d80d91
in which time the domain is activated and the hosts are both up.
Activate is sent again:
2013-06-17 15:22:06,188 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-46) [3e8356b6] ActivateStorage Domain. After Connect all hosts to pool. Time:6/17/13 3:22 PM
domain reported in problem:
2013-06-17 15:22:24,569 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) domain 38755249-4bb3-4841-bf5b-05f4a521514d:Dafna-32-03 in problem. vds: cougar01
2013-06-17 15:27:24,602 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-46) vds cougar01 reported domain 38755249-4bb3-4841-bf5b-05f4a521514d:Dafna-32-03 as in problem, moving the vds to status NonOperational
Currently the logic is that when DATA domain is reported in problem by a host, the engine "waits" X minutes before deciding on how to proceed - if one host reported the domain is problematic, its being moved to NonOp, if all the domains reported the domain as problematic, the domain moves to Inactive.
In this case,the host should move to non op X minutes after the domain would be activated by other hosts and it won't see it.
The current behaviour seems fine to me - I'm not in favour of having this optimization of putting the host to Non-op right away instead of after X minutes.
the timer time is configurable through StorageDomainFalureTimeoutInMinutes config values, so IMO, if wanted - it can be changed.
Indeed, in order to determine if the issue is host or domain related we have a timer to see the status from all hosts. Failure to activate will migrate the spm to a different host, in the meantime a domain in maintenance is not monitored anywhere so we cannot determine whether the problem is in the host or at the domain level, hence moving the host to non-op is wrong.
If after migrating SPM user successfully activates the domain then it will be monitored by all hosts and then hosts which don't have access to it will turn non-op.
Note that what you're asking for here is to move a working host with running VM to non-op because a domain which is in maintenance (has no effect on currently running VMs) failed to activate.