1416342 – Master failover fails and SD remains locked when blocking connection between host and nfs storage

Bug 1416342 - Master failover fails and SD remains locked when blocking connection between host and nfs storage

Summary: Master failover fails and SD remains locked when blocking connection between ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.3
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.0.7
Target Release:	---
Assignee:	Liron Aravot
QA Contact:	Lilach Zitnitski
Docs Contact:
URL:
Whiteboard:
Depends On:	1405772
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-25 10:33 UTC by Tal Nisan
Modified:	2017-03-16 15:32 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1405772
Environment:
Last Closed:	2017-03-16 15:32:00 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0542	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0.7	2017-03-16 19:25:04 UTC
oVirt gerrit	68746	None	None	None	2017-01-25 10:33:33 UTC
oVirt gerrit	68932	None	None	None	2017-01-25 10:33:33 UTC
oVirt gerrit	68965	None	None	None	2017-01-25 10:33:33 UTC

Description Tal Nisan 2017-01-25 10:33:34 UTC

+++ This bug was initially created as a clone of Bug #1405772 +++

Description of problem:
When blocking connection from host to nfs server, and moving to maintenance the master (which is not nfs), nfs storage domain becomes master and then gets stuck on Locked status due to connectivity issues. Master does not initiate failover to other active storage domains in the environment. 
Nothing left to do on the master storage domain except destroy, and nothing can be performed in dc (adding new storage, adding new hosts, re-initializing DC).

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch
vdsm-4.18.999-1184.git090267e.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. block connection from host to nfs server using iptables
2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider)


Actual results:
Master starts a failover to the nfs storage domain and gets stuck there, because after a few minutes, all nfs storage domains become inactive, and master cannot perform another failover to another storage domain. 

Expected results:
Another active storage domain should become master and nfs storage domains should become inactive. 

Additional info:
I had one host and 4 storage domains - 1 gluster (the original master storage domain), 1 iscsi and 2 nfs.

engine.log

2016-12-18 11:41:36,761+02 INFO  [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler10) [1236ae77] Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: 148f588c-544f-4346-bf70-8f0ee820e14b Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2016-12-18 11:41:36,915+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='00000001-0001-0001-0001-000000000311', ignoreFailoverLimit='false', storageDomainId='148f588c-544f-4346-bf70-8f0ee820e14b', masterDomainId='47147d3b-cb7e-4017-a26c-361c9a83fa3c', masterVersion='4'}), log id: 67572671
2016-12-18 11:41:43,342+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler2) [] Setting new tasks map. The map contains now 3 tasks
2016-12-18 11:41:49,424+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' recovered from problem. vds: 'blond-vdsh'
2016-12-18 11:41:49,425+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2016-12-18 11:42:13,343+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler5) [39c2c635] Setting new tasks map. The map contains now 2 tasks
2016-12-18 11:42:37,322+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] Failed in 'DeactivateStorageDomainVDS' method
2016-12-18 11:42:37,336+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [1236ae77] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',)
2016-12-18 11:42:37,337+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler10) [1236ae77] IrsBroker::Failed::DeactivateStorageDomainVDS: IRSGenericException: IRSErrorException: Failed to DeactivateStorageDomainVDS, error = Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',), code = 358
2016-12-18 11:42:37,355+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] FINISH, DeactivateStorageDomainVDSCommand, log id: 67572671

--- Additional comment from Lilach Zitnitski on 2016-12-18 12:27 IST ---

engine.log
vdsm.log

--- Additional comment from Yaniv Kaul on 2016-12-19 11:32:06 IST ---

Does it happen when the NFS is already non-operational as well, or only in the window of time when a failover happens but NFS is not yet detected as problematic?

--- Additional comment from Lilach Zitnitski on 2016-12-19 11:34:06 IST ---

In the UI the nfs storage domains are not yet detected as problematic, they are still in an active mode.

--- Additional comment from Yaniv Kaul on 2016-12-19 11:49:49 IST ---

(In reply to Lilach Zitnitski from comment #3)
> In the UI the nfs storage domains are not yet detected as problematic, they
> are still in an active mode.

But this is exactly my question - does this happen only in the brief window of time where the NFS was not yet detected as problematic?

--- Additional comment from Lilach Zitnitski on 2016-12-19 11:52:55 IST ---

Yes, that's what I meant. 
It happens right after blocking connection from host to nfs server, and before the storage domains become inactive in the UI.

--- Additional comment from Liron Aravot on 2017-01-08 15:55:03 IST ---



--- Additional comment from Sandro Bonazzola on 2017-01-25 09:54:20 IST ---

4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 1 Lilach Zitnitski 2017-01-26 16:16:45 UTC

--------------------------------------
Tested with the following code:
----------------------------------------
rhevm-4.0.7-0.1.el7ev.noarch
vdsm-4.18.22-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. block connection from host to nfs server using iptables
2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider)

Actual results:
different and active storage domain becomes master, after a few minutes nfs storage domain becomes inactive.

Expected results:

Moving to VERIFIED!

Comment 3 errata-xmlrpc 2017-03-16 15:32:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html

Note You need to log in before you can comment on or make changes to this bug.