Bug 1394687 - DC gets non-responding when detaching inactive ISO domain
Summary: DC gets non-responding when detaching inactive ISO domain
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.18.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.1
: 4.19.5
Assignee: Liron Aravot
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
Depends On:
Blocks: 1418020
TreeView+ depends on / blocked
 
Reported: 2016-11-14 09:04 UTC by Roman Hodain
Modified: 2017-04-21 09:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Failure to detach a Storage Domain from the data center leads to SPM failover (assignment of the SPM role to a different host). Consequence: The SPM role is assigned to a different host, which delays execution of operations that are performed by it. Fix: Failure to detach a storage domain shouldn't cause to SPM failover on any error. Result: No SPM failover is performed.
Clone Of:
: 1418020 (view as bug list)
Environment:
Last Closed: 2017-04-21 09:37:31 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: planning_ack+
rule-engine: devel_ack+
ratamir: testing_ack+


Attachments (Terms of Use)
vdsm-SPM logs (1.37 MB, text/plain)
2016-11-14 09:04 UTC, Roman Hodain
no flags Details
Engine logs (13.27 MB, text/plain)
2016-11-14 09:05 UTC, Roman Hodain
no flags Details
vdsm-SPM logs (1.05 MB, application/x-xz)
2016-11-14 09:10 UTC, Roman Hodain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 71479 0 master MERGED vdsbroker: no failover on detachSD 2017-02-05 09:17:09 UTC
oVirt gerrit 71664 0 ovirt-engine-4.1 MERGED vdsbroker: no failover on detachSD 2017-02-05 16:10:13 UTC

Description Roman Hodain 2016-11-14 09:04:02 UTC
Created attachment 1220302 [details]
vdsm-SPM logs

Description of problem:
When an ISO domain is marked as inactive as it is not visible by any host (NFS server failure). The attempt to detach the iso SD causes the DC to get non-responding and reinitialize again.

Version-Release number of selected component (if applicable):
vdsm-4.18.13-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create and ISO SD
2. Stop the nfs server
3. Wait for a while
4. Start the nfs server
5. Click detach on the ISO domain

Actual results:
DC get reinitialised

Expected results:
SD is detached 

Additional info:
When trying to connect to the mount point on the hypervisor. The following is reported:
# cd /rhev/data-center/mnt/sbr-virt-rhv-nested:_exports_rhviso01
-bash: cd: /rhev/data-center/mnt/sbr-virt-rhv-nested:_exports_rhviso01: Stale file handle

Comment 1 Roman Hodain 2016-11-14 09:05:17 UTC
Created attachment 1220303 [details]
Engine logs

Comment 3 Roman Hodain 2016-11-14 09:10:54 UTC
Created attachment 1220305 [details]
vdsm-SPM logs

Comment 4 Liron Aravot 2016-11-21 12:54:52 UTC
A storage domain may get to Inactive status even if it's still accessible (because of its reported stats), so ovirt lets the user to attempt and detach it.
The "regular" detach operation requires access to the domain as it modifies its metadata - which will cause us to fail in case the domain isn't available.
If the domain is unavailable, the detach operation fails and a failover (assignment of the SPM role to a different host to perform retry) occurs.

In order to solve the issue:
1. We can disable the failover for the detach operation, failover on that flow should help rarely.

2. Additionally, looking forward it might be useful to save the reason for the domain being Inactive.

Comment 5 Roman Hodain 2016-12-02 08:39:24 UTC
(In reply to Liron Aravot from comment #4)
> A storage domain may get to Inactive status even if it's still accessible
> (because of its reported stats), so ovirt lets the user to attempt and
> detach it.
> The "regular" detach operation requires access to the domain as it modifies
> its metadata - which will cause us to fail in case the domain isn't
> available.
> If the domain is unavailable, the detach operation fails and a failover
> (assignment of the SPM role to a different host to perform retry) occurs.
> 
> In order to solve the issue:
> 1. We can disable the failover for the detach operation, failover on that
> flow should help rarely.
> 
> 2. Additionally, looking forward it might be useful to save the reason for
> the domain being Inactive.

That sound reasonable.

Thanks.

Comment 6 Liron Aravot 2017-01-29 16:47:18 UTC
In addition to the fix to the detach operation as described in comment 4 (avoiding failover), I'll use this BZ to make changes to the detach flow.
oVirt should block detaching a domain until it's deactivated as the hosts in the DC may still access it (even by the domain monitoring).

Comment 7 Lilach Zitnitski 2017-02-14 08:31:13 UTC
The steps to reproduce are not very clear - stop the nfs server you mean make it inactive by moving it to maintenance, blocking the connection using iptables?

Comment 8 Roman Hodain 2017-02-23 10:44:28 UTC
Yes basically:

1. Create and ISO domain and make it iu in the env.
2. use iptables on the NFS share and DROP all traffic coming from the hypervisors
3. Wait until the domain is marked as inactive.
4. Try to detach the ISO domain.

If the DC remains up and do not go to down state at all then it is working.

Comment 9 Lilach Zitnitski 2017-02-23 13:22:12 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
vdsm-4.19.6-1.el7ev.x86_64
ovirt-engine-4.1.1.2-0.1.el7.noarch

Tested with the following scenario:

Steps to Reproduce:
1. Create and ISO domain and make it iu in the env.
2. use iptables on the NFS share and DROP all traffic coming from the hypervisors
3. Wait until the domain is marked as inactive.
4. Try to detach the ISO domain.

Actual results:
DC remains up

Expected results:

Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.