Bug 1394687

Summary: DC gets non-responding when detaching inactive ISO domain
Product: [oVirt] vdsm Reporter: Roman Hodain <rhodain>
Component: CoreAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Lilach Zitnitski <lzitnits>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.18.13CC: amureini, bugs, rhodain, tnisan
Target Milestone: ovirt-4.1.1Flags: rule-engine: ovirt-4.1+
rule-engine: planning_ack+
rule-engine: devel_ack+
ratamir: testing_ack+
Target Release: 4.19.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Failure to detach a Storage Domain from the data center leads to SPM failover (assignment of the SPM role to a different host). Consequence: The SPM role is assigned to a different host, which delays execution of operations that are performed by it. Fix: Failure to detach a storage domain shouldn't cause to SPM failover on any error. Result: No SPM failover is performed.
Story Points: ---
Clone Of:
: 1418020 (view as bug list) Environment:
Last Closed: 2017-04-21 09:37:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1418020    
Attachments:
Description Flags
vdsm-SPM logs
none
Engine logs
none
vdsm-SPM logs none

Description Roman Hodain 2016-11-14 09:04:02 UTC
Created attachment 1220302 [details]
vdsm-SPM logs

Description of problem:
When an ISO domain is marked as inactive as it is not visible by any host (NFS server failure). The attempt to detach the iso SD causes the DC to get non-responding and reinitialize again.

Version-Release number of selected component (if applicable):
vdsm-4.18.13-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create and ISO SD
2. Stop the nfs server
3. Wait for a while
4. Start the nfs server
5. Click detach on the ISO domain

Actual results:
DC get reinitialised

Expected results:
SD is detached 

Additional info:
When trying to connect to the mount point on the hypervisor. The following is reported:
# cd /rhev/data-center/mnt/sbr-virt-rhv-nested:_exports_rhviso01
-bash: cd: /rhev/data-center/mnt/sbr-virt-rhv-nested:_exports_rhviso01: Stale file handle

Comment 1 Roman Hodain 2016-11-14 09:05:17 UTC
Created attachment 1220303 [details]
Engine logs

Comment 3 Roman Hodain 2016-11-14 09:10:54 UTC
Created attachment 1220305 [details]
vdsm-SPM logs

Comment 4 Liron Aravot 2016-11-21 12:54:52 UTC
A storage domain may get to Inactive status even if it's still accessible (because of its reported stats), so ovirt lets the user to attempt and detach it.
The "regular" detach operation requires access to the domain as it modifies its metadata - which will cause us to fail in case the domain isn't available.
If the domain is unavailable, the detach operation fails and a failover (assignment of the SPM role to a different host to perform retry) occurs.

In order to solve the issue:
1. We can disable the failover for the detach operation, failover on that flow should help rarely.

2. Additionally, looking forward it might be useful to save the reason for the domain being Inactive.

Comment 5 Roman Hodain 2016-12-02 08:39:24 UTC
(In reply to Liron Aravot from comment #4)
> A storage domain may get to Inactive status even if it's still accessible
> (because of its reported stats), so ovirt lets the user to attempt and
> detach it.
> The "regular" detach operation requires access to the domain as it modifies
> its metadata - which will cause us to fail in case the domain isn't
> available.
> If the domain is unavailable, the detach operation fails and a failover
> (assignment of the SPM role to a different host to perform retry) occurs.
> 
> In order to solve the issue:
> 1. We can disable the failover for the detach operation, failover on that
> flow should help rarely.
> 
> 2. Additionally, looking forward it might be useful to save the reason for
> the domain being Inactive.

That sound reasonable.

Thanks.

Comment 6 Liron Aravot 2017-01-29 16:47:18 UTC
In addition to the fix to the detach operation as described in comment 4 (avoiding failover), I'll use this BZ to make changes to the detach flow.
oVirt should block detaching a domain until it's deactivated as the hosts in the DC may still access it (even by the domain monitoring).

Comment 7 Lilach Zitnitski 2017-02-14 08:31:13 UTC
The steps to reproduce are not very clear - stop the nfs server you mean make it inactive by moving it to maintenance, blocking the connection using iptables?

Comment 8 Roman Hodain 2017-02-23 10:44:28 UTC
Yes basically:

1. Create and ISO domain and make it iu in the env.
2. use iptables on the NFS share and DROP all traffic coming from the hypervisors
3. Wait until the domain is marked as inactive.
4. Try to detach the ISO domain.

If the DC remains up and do not go to down state at all then it is working.

Comment 9 Lilach Zitnitski 2017-02-23 13:22:12 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
vdsm-4.19.6-1.el7ev.x86_64
ovirt-engine-4.1.1.2-0.1.el7.noarch

Tested with the following scenario:

Steps to Reproduce:
1. Create and ISO domain and make it iu in the env.
2. use iptables on the NFS share and DROP all traffic coming from the hypervisors
3. Wait until the domain is marked as inactive.
4. Try to detach the ISO domain.

Actual results:
DC remains up

Expected results:

Moving to VERIFIED!