Bug 1394687
Summary: | DC gets non-responding when detaching inactive ISO domain | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Roman Hodain <rhodain> | ||||||||
Component: | Core | Assignee: | Liron Aravot <laravot> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Lilach Zitnitski <lzitnits> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.18.13 | CC: | amureini, bugs, rhodain, tnisan | ||||||||
Target Milestone: | ovirt-4.1.1 | Flags: | rule-engine:
ovirt-4.1+
rule-engine: planning_ack+ rule-engine: devel_ack+ ratamir: testing_ack+ |
||||||||
Target Release: | 4.19.5 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Cause:
Failure to detach a Storage Domain from the data center leads to SPM failover (assignment of the SPM role to a different host).
Consequence:
The SPM role is assigned to a different host, which delays execution of operations that are performed by it.
Fix:
Failure to detach a storage domain shouldn't cause to SPM failover on any error.
Result:
No SPM failover is performed.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1418020 (view as bug list) | Environment: | |||||||||
Last Closed: | 2017-04-21 09:37:31 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1418020 | ||||||||||
Attachments: |
|
Created attachment 1220303 [details]
Engine logs
Created attachment 1220305 [details]
vdsm-SPM logs
A storage domain may get to Inactive status even if it's still accessible (because of its reported stats), so ovirt lets the user to attempt and detach it. The "regular" detach operation requires access to the domain as it modifies its metadata - which will cause us to fail in case the domain isn't available. If the domain is unavailable, the detach operation fails and a failover (assignment of the SPM role to a different host to perform retry) occurs. In order to solve the issue: 1. We can disable the failover for the detach operation, failover on that flow should help rarely. 2. Additionally, looking forward it might be useful to save the reason for the domain being Inactive. (In reply to Liron Aravot from comment #4) > A storage domain may get to Inactive status even if it's still accessible > (because of its reported stats), so ovirt lets the user to attempt and > detach it. > The "regular" detach operation requires access to the domain as it modifies > its metadata - which will cause us to fail in case the domain isn't > available. > If the domain is unavailable, the detach operation fails and a failover > (assignment of the SPM role to a different host to perform retry) occurs. > > In order to solve the issue: > 1. We can disable the failover for the detach operation, failover on that > flow should help rarely. > > 2. Additionally, looking forward it might be useful to save the reason for > the domain being Inactive. That sound reasonable. Thanks. In addition to the fix to the detach operation as described in comment 4 (avoiding failover), I'll use this BZ to make changes to the detach flow. oVirt should block detaching a domain until it's deactivated as the hosts in the DC may still access it (even by the domain monitoring). The steps to reproduce are not very clear - stop the nfs server you mean make it inactive by moving it to maintenance, blocking the connection using iptables? Yes basically: 1. Create and ISO domain and make it iu in the env. 2. use iptables on the NFS share and DROP all traffic coming from the hypervisors 3. Wait until the domain is marked as inactive. 4. Try to detach the ISO domain. If the DC remains up and do not go to down state at all then it is working. -------------------------------------- Tested with the following code: ---------------------------------------- vdsm-4.19.6-1.el7ev.x86_64 ovirt-engine-4.1.1.2-0.1.el7.noarch Tested with the following scenario: Steps to Reproduce: 1. Create and ISO domain and make it iu in the env. 2. use iptables on the NFS share and DROP all traffic coming from the hypervisors 3. Wait until the domain is marked as inactive. 4. Try to detach the ISO domain. Actual results: DC remains up Expected results: Moving to VERIFIED! |
Created attachment 1220302 [details] vdsm-SPM logs Description of problem: When an ISO domain is marked as inactive as it is not visible by any host (NFS server failure). The attempt to detach the iso SD causes the DC to get non-responding and reinitialize again. Version-Release number of selected component (if applicable): vdsm-4.18.13-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create and ISO SD 2. Stop the nfs server 3. Wait for a while 4. Start the nfs server 5. Click detach on the ISO domain Actual results: DC get reinitialised Expected results: SD is detached Additional info: When trying to connect to the mount point on the hypervisor. The following is reported: # cd /rhev/data-center/mnt/sbr-virt-rhv-nested:_exports_rhviso01 -bash: cd: /rhev/data-center/mnt/sbr-virt-rhv-nested:_exports_rhviso01: Stale file handle