Description of problem: When blocking connection from host to nfs server, and moving to maintenance the master (which is not nfs), nfs storage domain becomes master and then gets stuck on Locked status due to connectivity issues. Master does not initiate failover to other active storage domains in the environment. Nothing left to do on the master storage domain except destroy, and nothing can be performed in dc (adding new storage, adding new hosts, re-initializing DC). Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch vdsm-4.18.999-1184.git090267e.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: 1. block connection from host to nfs server using iptables 2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider) Actual results: Master starts a failover to the nfs storage domain and gets stuck there, because after a few minutes, all nfs storage domains become inactive, and master cannot perform another failover to another storage domain. Expected results: Another active storage domain should become master and nfs storage domains should become inactive. Additional info: I had one host and 4 storage domains - 1 gluster (the original master storage domain), 1 iscsi and 2 nfs. engine.log 2016-12-18 11:41:36,761+02 INFO [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler10) [1236ae77] Running command: DeactivateStorageDomainCommand internal: true. Entities affected : ID: 148f588c-544f-4346-bf70-8f0ee820e14b Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2016-12-18 11:41:36,915+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='00000001-0001-0001-0001-000000000311', ignoreFailoverLimit='false', storageDomainId='148f588c-544f-4346-bf70-8f0ee820e14b', masterDomainId='47147d3b-cb7e-4017-a26c-361c9a83fa3c', masterVersion='4'}), log id: 67572671 2016-12-18 11:41:43,342+02 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler2) [] Setting new tasks map. The map contains now 3 tasks 2016-12-18 11:41:49,424+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' recovered from problem. vds: 'blond-vdsh' 2016-12-18 11:41:49,425+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2016-12-18 11:42:13,343+02 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler5) [39c2c635] Setting new tasks map. The map contains now 2 tasks 2016-12-18 11:42:37,322+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] Failed in 'DeactivateStorageDomainVDS' method 2016-12-18 11:42:37,336+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [1236ae77] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',) 2016-12-18 11:42:37,337+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler10) [1236ae77] IrsBroker::Failed::DeactivateStorageDomainVDS: IRSGenericException: IRSErrorException: Failed to DeactivateStorageDomainVDS, error = Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',), code = 358 2016-12-18 11:42:37,355+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] FINISH, DeactivateStorageDomainVDSCommand, log id: 67572671
Created attachment 1233079 [details] logs zip engine.log vdsm.log
Does it happen when the NFS is already non-operational as well, or only in the window of time when a failover happens but NFS is not yet detected as problematic?
In the UI the nfs storage domains are not yet detected as problematic, they are still in an active mode.
(In reply to Lilach Zitnitski from comment #3) > In the UI the nfs storage domains are not yet detected as problematic, they > are still in an active mode. But this is exactly my question - does this happen only in the brief window of time where the NFS was not yet detected as problematic?
Yes, that's what I meant. It happens right after blocking connection from host to nfs server, and before the storage domains become inactive in the UI.
*** Bug 1399477 has been marked as a duplicate of this bug. ***
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.
-------------------------------------- Tested with the following code: ---------------------------------------- vdsm-4.19.4-1.el7ev.x86_64 rhevm-4.1.0.3-0.1.el7.noarch Tested with the following scenario: Steps to Reproduce: Steps to Reproduce: 1. block connection from host to nfs server using iptables 2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider) Actual results: Master storage domain fails to another active storage domain, doesn't move straight to the nfs sd's and gets stuck there. Moving to VERIFIED!