+++ This bug was initially created as a clone of Bug #1405772 +++ Description of problem: When blocking connection from host to nfs server, and moving to maintenance the master (which is not nfs), nfs storage domain becomes master and then gets stuck on Locked status due to connectivity issues. Master does not initiate failover to other active storage domains in the environment. Nothing left to do on the master storage domain except destroy, and nothing can be performed in dc (adding new storage, adding new hosts, re-initializing DC). Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch vdsm-4.18.999-1184.git090267e.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: 1. block connection from host to nfs server using iptables 2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider) Actual results: Master starts a failover to the nfs storage domain and gets stuck there, because after a few minutes, all nfs storage domains become inactive, and master cannot perform another failover to another storage domain. Expected results: Another active storage domain should become master and nfs storage domains should become inactive. Additional info: I had one host and 4 storage domains - 1 gluster (the original master storage domain), 1 iscsi and 2 nfs. engine.log 2016-12-18 11:41:36,761+02 INFO [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler10) [1236ae77] Running command: DeactivateStorageDomainCommand internal: true. Entities affected : ID: 148f588c-544f-4346-bf70-8f0ee820e14b Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2016-12-18 11:41:36,915+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='00000001-0001-0001-0001-000000000311', ignoreFailoverLimit='false', storageDomainId='148f588c-544f-4346-bf70-8f0ee820e14b', masterDomainId='47147d3b-cb7e-4017-a26c-361c9a83fa3c', masterVersion='4'}), log id: 67572671 2016-12-18 11:41:43,342+02 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler2) [] Setting new tasks map. The map contains now 3 tasks 2016-12-18 11:41:49,424+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' recovered from problem. vds: 'blond-vdsh' 2016-12-18 11:41:49,425+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2016-12-18 11:42:13,343+02 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler5) [39c2c635] Setting new tasks map. The map contains now 2 tasks 2016-12-18 11:42:37,322+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] Failed in 'DeactivateStorageDomainVDS' method 2016-12-18 11:42:37,336+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [1236ae77] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',) 2016-12-18 11:42:37,337+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler10) [1236ae77] IrsBroker::Failed::DeactivateStorageDomainVDS: IRSGenericException: IRSErrorException: Failed to DeactivateStorageDomainVDS, error = Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',), code = 358 2016-12-18 11:42:37,355+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] FINISH, DeactivateStorageDomainVDSCommand, log id: 67572671 --- Additional comment from Lilach Zitnitski on 2016-12-18 12:27 IST --- engine.log vdsm.log --- Additional comment from Yaniv Kaul on 2016-12-19 11:32:06 IST --- Does it happen when the NFS is already non-operational as well, or only in the window of time when a failover happens but NFS is not yet detected as problematic? --- Additional comment from Lilach Zitnitski on 2016-12-19 11:34:06 IST --- In the UI the nfs storage domains are not yet detected as problematic, they are still in an active mode. --- Additional comment from Yaniv Kaul on 2016-12-19 11:49:49 IST --- (In reply to Lilach Zitnitski from comment #3) > In the UI the nfs storage domains are not yet detected as problematic, they > are still in an active mode. But this is exactly my question - does this happen only in the brief window of time where the NFS was not yet detected as problematic? --- Additional comment from Lilach Zitnitski on 2016-12-19 11:52:55 IST --- Yes, that's what I meant. It happens right after blocking connection from host to nfs server, and before the storage domains become inactive in the UI. --- Additional comment from Liron Aravot on 2017-01-08 15:55:03 IST --- --- Additional comment from Sandro Bonazzola on 2017-01-25 09:54:20 IST --- 4.0.6 has been the last oVirt 4.0 release, please re-target this bug.
-------------------------------------- Tested with the following code: ---------------------------------------- rhevm-4.0.7-0.1.el7ev.noarch vdsm-4.18.22-1.el7ev.x86_64 Tested with the following scenario: Steps to Reproduce: 1. block connection from host to nfs server using iptables 2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider) Actual results: different and active storage domain becomes master, after a few minutes nfs storage domain becomes inactive. Expected results: Moving to VERIFIED!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0542.html