Bug 1405772 - Master failover fails and SD remains locked when blocking connection between host and nfs storage
Summary: Master failover fails and SD remains locked when blocking connection between ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.0
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.0-beta
: ---
Assignee: Liron Aravot
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
: 1399477 (view as bug list)
Depends On:
Blocks: 1416342
TreeView+ depends on / blocked
 
Reported: 2016-12-18 10:26 UTC by Lilach Zitnitski
Modified: 2017-02-15 14:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1416342 (view as bug list)
Environment:
Last Closed: 2017-02-15 14:47:43 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+
rule-engine: planning_ack+
rule-engine: devel_ack+
ratamir: testing_ack+


Attachments (Terms of Use)
logs zip (127.48 KB, application/zip)
2016-12-18 10:27 UTC, Lilach Zitnitski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 68746 0 master MERGED core: DeactivateStorageDomainWithOvfUpdate - context passing 2016-12-19 17:27:02 UTC
oVirt gerrit 68932 0 ovirt-engine-4.1 MERGED core: DeactivateStorageDomainWithOvfUpdate - context passing 2016-12-22 13:08:54 UTC
oVirt gerrit 68965 0 ovirt-engine-4.0 MERGED core: DeactivateStorageDomainWithOvfUpdate - context passing 2016-12-22 13:09:56 UTC

Description Lilach Zitnitski 2016-12-18 10:26:47 UTC
Description of problem:
When blocking connection from host to nfs server, and moving to maintenance the master (which is not nfs), nfs storage domain becomes master and then gets stuck on Locked status due to connectivity issues. Master does not initiate failover to other active storage domains in the environment. 
Nothing left to do on the master storage domain except destroy, and nothing can be performed in dc (adding new storage, adding new hosts, re-initializing DC).

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch
vdsm-4.18.999-1184.git090267e.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. block connection from host to nfs server using iptables
2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider)


Actual results:
Master starts a failover to the nfs storage domain and gets stuck there, because after a few minutes, all nfs storage domains become inactive, and master cannot perform another failover to another storage domain. 

Expected results:
Another active storage domain should become master and nfs storage domains should become inactive. 

Additional info:
I had one host and 4 storage domains - 1 gluster (the original master storage domain), 1 iscsi and 2 nfs.

engine.log

2016-12-18 11:41:36,761+02 INFO  [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler10) [1236ae77] Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: 148f588c-544f-4346-bf70-8f0ee820e14b Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2016-12-18 11:41:36,915+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='00000001-0001-0001-0001-000000000311', ignoreFailoverLimit='false', storageDomainId='148f588c-544f-4346-bf70-8f0ee820e14b', masterDomainId='47147d3b-cb7e-4017-a26c-361c9a83fa3c', masterVersion='4'}), log id: 67572671
2016-12-18 11:41:43,342+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler2) [] Setting new tasks map. The map contains now 3 tasks
2016-12-18 11:41:49,424+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' recovered from problem. vds: 'blond-vdsh'
2016-12-18 11:41:49,425+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-35) [] Domain '47147d3b-cb7e-4017-a26c-361c9a83fa3c:data_nfs1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2016-12-18 11:42:13,343+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler5) [39c2c635] Setting new tasks map. The map contains now 2 tasks
2016-12-18 11:42:37,322+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] Failed in 'DeactivateStorageDomainVDS' method
2016-12-18 11:42:37,336+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [1236ae77] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',)
2016-12-18 11:42:37,337+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler10) [1236ae77] IrsBroker::Failed::DeactivateStorageDomainVDS: IRSGenericException: IRSErrorException: Failed to DeactivateStorageDomainVDS, error = Storage domain does not exist: (u'47147d3b-cb7e-4017-a26c-361c9a83fa3c',), code = 358
2016-12-18 11:42:37,355+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (DefaultQuartzScheduler10) [1236ae77] FINISH, DeactivateStorageDomainVDSCommand, log id: 67572671

Comment 1 Lilach Zitnitski 2016-12-18 10:27:23 UTC
Created attachment 1233079 [details]
logs zip

engine.log
vdsm.log

Comment 2 Yaniv Kaul 2016-12-19 09:32:06 UTC
Does it happen when the NFS is already non-operational as well, or only in the window of time when a failover happens but NFS is not yet detected as problematic?

Comment 3 Lilach Zitnitski 2016-12-19 09:34:06 UTC
In the UI the nfs storage domains are not yet detected as problematic, they are still in an active mode.

Comment 4 Yaniv Kaul 2016-12-19 09:49:49 UTC
(In reply to Lilach Zitnitski from comment #3)
> In the UI the nfs storage domains are not yet detected as problematic, they
> are still in an active mode.

But this is exactly my question - does this happen only in the brief window of time where the NFS was not yet detected as problematic?

Comment 5 Lilach Zitnitski 2016-12-19 09:52:55 UTC
Yes, that's what I meant. 
It happens right after blocking connection from host to nfs server, and before the storage domains become inactive in the UI.

Comment 6 Liron Aravot 2017-01-08 13:55:03 UTC
*** Bug 1399477 has been marked as a duplicate of this bug. ***

Comment 7 Sandro Bonazzola 2017-01-25 07:54:20 UTC
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 8 Lilach Zitnitski 2017-02-01 14:45:42 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
vdsm-4.19.4-1.el7ev.x86_64
rhevm-4.1.0.3-0.1.el7.noarch

Tested with the following scenario:

Steps to Reproduce:
Steps to Reproduce:
1. block connection from host to nfs server using iptables
2. while in the UI the storage domains are still active, move to maintenance the master storage domain (has to be on different storage provider)

Actual results:
Master storage domain fails to another active storage domain, doesn't move straight to the nfs sd's and gets stuck there. 

Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.