Bug 965197

Summary: engine: when trying to manually activate a domain when the storage is unknown and the storage is still unavailble the host becomes non-operational
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED NOTABUG QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, acanan, acathrow, amureini, dron, hateya, iheim, jkt, laravot, lpeer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-09 15:53:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-05-20 16:41:49 UTC
Created attachment 750655 [details]
logs

Description of problem:

if you block the storage -> when the domain becomes unknow aticate it manaully -> host becomes non-operational

Version-Release number of selected component (if applicable):

sf17

How reproducible:

100%

Steps to Reproduce:
1. in a 3 hosts cluster with 2 iscsi domains block connectivity to both domains from all hosts
2. when the storage becomes unknown -> activate master domain manually
3.

Actual results:

one of the hosts is set as non-operational

Expected results:

hosts sould always remains in up state unless only one cannot see the storage. 

Additional info: logs

Comment 1 Ayal Baron 2013-05-21 05:57:47 UTC
what's the end result? i.e. what happens after 5 minutes?

Comment 3 Liron Aravot 2013-07-08 17:25:27 UTC
The behaviour seems to be here as expected -
as soon as the storage is blocked, the spm should fence itself - as during the initVdsOnUp flow, the domain status is still Locked  the host moves to non operational as expected after failing to connect to the pool - on the next run, as the domain status is compensated back to UNKNOWN - the host moves to status UP. that's the expected behaviour atm.

When we attempt to activate the domain it's unknown - it's being locked

2013-05-20 19:31:31,623 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-4-thread-43) [3400d2de] START, ActivateStorageDomainVDSComman
d( storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 38755249-4bb3-4841-bf5b-05f4a521514d), l
og id: 6e3a4d55

-------------------------------------------------
2013-05-20 19:31:47,243 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-39) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = 
4497d431-7c5e-4924-96e0-3f9cdbf826e5 : cougar01, VDS Network Error, continuing.
java.net.ConnectException: Connection refused

---------------------------------------------------

InitVdsOnUp failure, as the master domain is currently locked (by the activation) the host doesn't proceed with the flow.
----------------------------------------------------
013-05-20 19:32:02,684 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-4-thread-44) START, ConnectStoragePoolVDSCommand(HostName = cougar01, HostId = 4497d431-7c5e-4924-96e0-3f9cdbf826e5, storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb, vds_spm_id = 1, masterDomainId = 38755249-4bb3-4841-bf5b-05f4a521514d, masterVersion = 523), log id: 1fe6a986

2013-05-20 19:32:14,173 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-44) Could not connect host cougar01 to pool iSCSI
2013-05-20 19:32:14,187 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-47) [3662bbf2] Running command: SetNonOperationalVdsCommand int
ernal: true. Entities affected :  ID: 4497d431-7c5e-4924-96e0-3f9cdbf826e5 Type: VDS
2013-05-20 19:32:14,190 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-47) [3662bbf2] START, SetVdsStatusVDSCommand(HostName = cougar
01, HostId = 4497d431-7c5e-4924-96e0-3f9cdbf826e5, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: c6d68a1
2013-05-20 19:32:14,192 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-47) [3662bbf2] FINISH, SetVdsStatusVDSCommand, log id: 
----------------------------------------------------


After failure in the activation - the domain status is returned to be UNKNOWN
-----------------------------------------------------
2013-05-20 19:32:21,232 ERROR [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-43) [3400d2de] Command org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand throw Vdc Bll exception. With error message VdcBLLException: Cannot allocate IRS server
2013-05-20 19:32:21,235 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-4-thread-43) [3400d2de] Command [id=10446986-610d-4d80-84f9-0e663b75ec7a]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot [id=storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb, storageId = 38755249-4bb3-4841-bf5b-05f4a521514d, status=Unknown].
-----------------------------------------------------

Activation doesn't fail although failing to connect to the pool because of the domain status is unknown/inactive 

-----------------------------------------------------
2013-05-20 19:35:02,686 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-4-thread-43) START, ConnectStoragePoolVDSCommand(HostName = cougar01, HostId = 4497d431-7c5e-4924-96e0-3f9cdbf826e5, storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb, vds_spm_id = 1, masterDomainId = 38755249-4bb3-4841-bf5b-05f4a521514d, masterVersion = 523), log id: 59847605
2013-05-20 19:35:05,182 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-43) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb, msdUUID=38755249-4bb3-4841-bf5b-05f4a521514d']]
2013-05-20 19:35:05,182 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-43) HostName = cougar01
2013-05-20 19:35:05,182 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-43) Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb, msdUUID=38755249-4bb3-4841-bf5b-05f4a521514d'
2013-05-20 19:35:05,182 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-4-thread-43) FINISH, ConnectStoragePoolVDSCommand, log id: 59847605
2013-05-20 19:35:05,182 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-43) Could not connect host cougar01 to pool iSCSI, as the master domain is in inactive/unknown status - not failing the operation
---------------------------------------------------------------------

Allon, seems to me like this one can be closed.

Comment 4 Allon Mureinik 2013-07-09 05:08:48 UTC
Closing based on comment 3.

QA guys - if I'm missing something, please reopen and enlaborate.

Comment 5 Dafna Ron 2013-07-09 07:54:02 UTC
hosts should stay active unless only one of them can't see the storage. 
This is the behaviour that was decided by devel. 
if the manual activation by the user changes the flow than a particular behaviour in which a user activates a domain manually was not handled correctly. 
reopening this bug

Comment 6 Ayal Baron 2013-07-09 09:59:53 UTC
(In reply to Dafna Ron from comment #5)
> hosts should stay active unless only one of them can't see the storage. 
> This is the behaviour that was decided by devel. 
> if the manual activation by the user changes the flow than a particular
> behaviour in which a user activates a domain manually was not handled
> correctly. 
> reopening this bug

This is not correct.
The host (cougar01) was spm and lost its lease so killed vdsm:
2013-05-20 19:31:58+0300 117008 [6758]: s27 kill 29988 sig 9 count 41

When coming back up it failed to connect to the pool

Thread-17::ERROR::2013-05-20 19:32:18,452::task::850::TaskManager.Task::...
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb, msdUUID=38755249-4bb3-4841-bf5b-05f4a521514d'

Hence it moves to non-op until the domain changes state.
That is the correct behaviour.

Comment 7 Dafna Ron 2013-07-09 11:19:39 UTC
(In reply to Ayal Baron from comment #6)
> (In reply to Dafna Ron from comment #5)
> > hosts should stay active unless only one of them can't see the storage. 
> > This is the behaviour that was decided by devel. 
> > if the manual activation by the user changes the flow than a particular
> > behaviour in which a user activates a domain manually was not handled
> > correctly. 
> > reopening this bug
> 
> This is not correct.
> The host (cougar01) was spm and lost its lease so killed vdsm:
> 2013-05-20 19:31:58+0300 117008 [6758]: s27 kill 29988 sig 9 count 41
> 
> When coming back up it failed to connect to the pool
> 
> Thread-17::ERROR::2013-05-20 19:32:18,452::task::850::TaskManager.Task::...
> StoragePoolMasterNotFound: Cannot find master domain:
> 'spUUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb,
> msdUUID=38755249-4bb3-4841-bf5b-05f4a521514d'
> 
> Hence it moves to non-op until the domain changes state.
> That is the correct behaviour.

spm no longer needs to change state to nonop when it fails to connect to pool.
That was the whole flow change

Comment 8 Ayal Baron 2013-07-09 15:53:10 UTC
(In reply to Dafna Ron from comment #7)
> (In reply to Ayal Baron from comment #6)
> > (In reply to Dafna Ron from comment #5)
> > > hosts should stay active unless only one of them can't see the storage. 
> > > This is the behaviour that was decided by devel. 
> > > if the manual activation by the user changes the flow than a particular
> > > behaviour in which a user activates a domain manually was not handled
> > > correctly. 
> > > reopening this bug
> > 
> > This is not correct.
> > The host (cougar01) was spm and lost its lease so killed vdsm:
> > 2013-05-20 19:31:58+0300 117008 [6758]: s27 kill 29988 sig 9 count 41
> > 
> > When coming back up it failed to connect to the pool
> > 
> > Thread-17::ERROR::2013-05-20 19:32:18,452::task::850::TaskManager.Task::...
> > StoragePoolMasterNotFound: Cannot find master domain:
> > 'spUUID=7fd33b43-a9f4-4eb7-a885-e9583a929ceb,
> > msdUUID=38755249-4bb3-4841-bf5b-05f4a521514d'
> > 
> > Hence it moves to non-op until the domain changes state.
> > That is the correct behaviour.
> 
> spm no longer needs to change state to nonop when it fails to connect to
> pool.
> That was the whole flow change

this has nothing to do with spm.
vdsm starts up not connected to the pool, engine runs initvdsonup which calls connectStoragePool, that failed, so host moves to non-op until the domain changes state to inactive.