Bug 1416342

Summary:	Master failover fails and SD remains locked when blocking connection between host and nfs storage
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Tal Nisan <tnisan>
Component:	ovirt-engine	Assignee:	Liron Aravot <laravot>
Status:	CLOSED ERRATA	QA Contact:	Lilach Zitnitski <lzitnits>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.0.3	CC:	bugs, gklein, knarra, laravot, lsurette, lzitnits, ratamir, rbalakri, Rhev-m-bugs, srevivo, tnisan, ykaul, ylavi
Target Milestone:	ovirt-4.0.7
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1405772	Environment:
Last Closed:	2017-03-16 15:32:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1405772
Bug Blocks:

Description Tal Nisan 2017-01-25 10:33:34 UTC

+++ This bug was initially created as a clone of Bug #1405772 +++

Description of problem:
When blocking connection from host to nfs server, and moving to maintenance the master (which is not nfs), nfs storage domain becomes master and then gets stuck on Locked status due to connectivity issues. Master does not initiate failover to other active storage domains in the environment. 
Nothing left to do on the master storage domain except destroy, and nothing can be performed in dc (adding new storage, adding new hosts, re-initializing DC).

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch
vdsm-4.18.999-1184.git090267e.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. block connection from host to nfs server using iptables
2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider)


Actual results:
Master starts a failover to the nfs storage domain and gets stuck there, because after a few minutes, all nfs storage domains become inactive, and master cannot perform another failover to another storage domain. 

Expected results:
Another active storage domain should become master and nfs storage domains should become inactive. 

Additional info:
I had one host and 4 storage domains - 1 gluster (the original master storage domain), 1 iscsi and 2 nfs.

engine.log

2016-12-18 11:41:36,761+02 INFO  [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler10) [1236ae77] Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: 148f588c-544f-4346-bf70-8f0ee820e14b Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2016-12-18 11:41:36,915+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='00000001-0001-0001-0001-000000000311', ignoreFailoverLimit='false', storageDomainId='148f588c-544f-4346-bf70-8f0ee820e14b', masterDomainId='47147d3b-cb7e-4017-a26c-361c9a83fa3c', masterVersion='4'}), log id: 67572671
2016-12-18 11:41:43,342+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler2) [] Setting new tasks map. The map contains now 3 tasks
2016-12-18 11:41:49,424+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' recovered from problem. vds: 'blond-vdsh'
2016-12-18 11:41:49,425+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2016-12-18 11:42:13,343+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler5) [39c2c635] Setting new tasks map. The map contains now 2 tasks
2016-12-18 11:42:37,322+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] Failed in 'DeactivateStorageDomainVDS' method
2016-12-18 11:42:37,336+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [1236ae77] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',)
2016-12-18 11:42:37,337+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler10) [1236ae77] IrsBroker::Failed::DeactivateStorageDomainVDS: IRSGenericException: IRSErrorException: Failed to DeactivateStorageDomainVDS, error = Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',), code = 358
2016-12-18 11:42:37,355+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] FINISH, DeactivateStorageDomainVDSCommand, log id: 67572671

--- Additional comment from Lilach Zitnitski on 2016-12-18 12:27 IST ---

engine.log
vdsm.log

--- Additional comment from Yaniv Kaul on 2016-12-19 11:32:06 IST ---

Does it happen when the NFS is already non-operational as well, or only in the window of time when a failover happens but NFS is not yet detected as problematic?

--- Additional comment from Lilach Zitnitski on 2016-12-19 11:34:06 IST ---

In the UI the nfs storage domains are not yet detected as problematic, they are still in an active mode.

--- Additional comment from Yaniv Kaul on 2016-12-19 11:49:49 IST ---

(In reply to Lilach Zitnitski from comment #3)
> In the UI the nfs storage domains are not yet detected as problematic, they
> are still in an active mode.

But this is exactly my question - does this happen only in the brief window of time where the NFS was not yet detected as problematic?

--- Additional comment from Lilach Zitnitski on 2016-12-19 11:52:55 IST ---

Yes, that's what I meant. 
It happens right after blocking connection from host to nfs server, and before the storage domains become inactive in the UI.

--- Additional comment from Liron Aravot on 2017-01-08 15:55:03 IST ---



--- Additional comment from Sandro Bonazzola on 2017-01-25 09:54:20 IST ---

4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 1 Lilach Zitnitski 2017-01-26 16:16:45 UTC

--------------------------------------
Tested with the following code:
----------------------------------------
rhevm-4.0.7-0.1.el7ev.noarch
vdsm-4.18.22-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. block connection from host to nfs server using iptables
2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider)

Actual results:
different and active storage domain becomes master, after a few minutes nfs storage domain becomes inactive.

Expected results:

Moving to VERIFIED!

Comment 3 errata-xmlrpc 2017-03-16 15:32:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html