1324076 – SD metadata indicates that its attached to DC while it is not, preventing remove and format storage

Bug 1324076 - SD metadata indicates that its attached to DC while it is not, preventing remove and format storage

Summary: SD metadata indicates that its attached to DC while it is not, preventing rem...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	3.6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.0.1
Target Release:	4.0.0
Assignee:	Liron Aravot
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-04-05 13:09 UTC by Nelly Credi
Modified:	2016-07-19 06:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-07-19 06:26:11 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.0.z+ rule-engine: planning_ack+ rule-engine: devel_ack+ acanan: testing_ack+

Attachments	(Terms of Use)
engine and hosts logs (1.68 MB, application/x-gzip) 2016-04-05 13:09 UTC, Nelly Credi	no flags	Details
engine and vdsm logs (1.23 MB, application/x-gzip) 2016-04-27 10:40 UTC, Raz Tamir	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	59338	0	master	MERGED	core: DomainPoolMap race - domain in 'detaching' as attached	2016-06-19 14:04:18 UTC
oVirt gerrit	59440	0	ovirt-engine-4.0	MERGED	core: DomainPoolMap race - domain in 'detaching' as attached	2016-06-20 07:20:21 UTC

Description Nelly Credi 2016-04-05 13:09:11 UTC

Created attachment 1143836 [details]
engine and hosts logs

Description of problem:
Storage cannot be formatted because it indicates that its attached to a DC while it is not


Version-Release number of selected component (if applicable):


How reproducible:
50%

Steps to Reproduce:
1. assunming env with: dc, cluster, host & multiple sds (I never saw it happening on master sd)
2. move sd to maintenance 
3. detach sd
4. remove sd (check 'format' option)

Actual results:
getting an error:
Error while executing action.... The storage domain metadata indicates that it is attached to a data center hence cannot be formatted.... 

Expected results:
should be able to remove with format option

Additional info:
it is usually resolved by reattaching the sd and repeat the steps to remove

it happened at around 15:30, the SD name was nfs_2

Comment 1 Liron Aravot 2016-04-06 08:25:21 UTC

The detach operation failed because of network error that caused the host to be detected as non responsive. Therefore when performing the detach command the host (SPM) wasn't connected to the storage server which led to a failure, so later on the remove sd failed because the domain wasn't detached (as expected).

The test should verify that the domain was detached before attempting to remove,
Closing as NOTABUG.

Comment 2 Nelly Credi 2016-04-06 08:39:18 UTC

please look at the screenshot
the sd is indicated as detached in the UI, so Im assuming it is.
if it failed, it should not appear as detached, also in this case the remove button would have been disabled
so if the problem here is the indication of the sd as detached when its not, it should be fixed

Comment 3 Liron Aravot 2016-04-06 08:55:12 UTC

Ok, I didn't get the the SD appears as detached.

The cause here is a race condition, because the host went non responsive a connectStoragePool() with the domain map is sent to it as part of the InitVdsOnUp flow. the connect is performed during the detach operation which fails, as the sent domain map doesn't contain the domain the engine automatically detaches it.

targeting to V4 as there are very slims chances of encountering it, the relevant code is very sensitive and there's a solution.

thanks,
Liron.

Comment 4 Red Hat Bugzilla Rules Engine 2016-04-06 08:55:17 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 5 Tal Nisan 2016-04-06 08:57:08 UTC

Nelly, is it blocking the automation?
If not I'll leave it in 4.1 as it seems like a corner case and has a workaround

Comment 6 Nelly Credi 2016-04-06 12:51:25 UTC

this is failing the golden environment cleaner test quite often. its not an automation blocker, but we may miss other bugs if the cleaner is not fully executed. can we maybe make it to 4.0?

Comment 7 Yaniv Kaul 2016-04-26 10:32:42 UTC

Tal - please move back to 4.0 and have someone look at it promptly (based on comment 6 above - it fails GE).

Comment 8 Liron Aravot 2016-04-27 07:35:52 UTC

Nelly, as the race happens only when the host becomes non responsive, can you try and see if there's something wrong the env cleaning?

if the host wouldn't become non-responsive before/during the detach (which shouldn't occur often) we wouldn't encounter that bug.

thanks,
Liron.

Comment 9 Liron Aravot 2016-04-27 07:36:36 UTC

if the host doesn't become non responsive usually on the clean phase, perhaps there is other relevant scenario as well that should be handled on a separate bug.

Comment 10 Nelly Credi 2016-04-27 10:20:00 UTC

Well I didnt see that the host became non responsive and I encountered this issue manually as well. 
maybe ratamir can add more info as he saw it in his tests as well

also keep in mind that the clean phase is working well in 3.5, so I dont believe there is anything wrong with the test flow

I dont know if its related, but there was also a bug (it was closed on wontfix if I remember correctly) that when detaching a storage, the other sds move to 'unknown' state and the DC is also in bad state, so maybe it affects the hosts as well?

Comment 11 Raz Tamir 2016-04-27 10:40:20 UTC

Created attachment 1151286 [details]
engine and vdsm logs

I also see this issue from time to time. There is no specific flow I can think of that cause this issue because we see it randomly in different test plans.
I'm attaching logs from today where it happened manually

Comment 12 Allon Mureinik 2016-06-20 10:52:47 UTC

This is solved for the next 4.0.z milestone.

Comment 13 Raz Tamir 2016-07-03 19:01:33 UTC

Nelly,
Since we don't have specific steps to reproduce, I suggest that we will see if this reproduced in the next few days, and if not I will move it to verify.
Let me know if you see this issue again.

Thanks

Comment 14 Raz Tamir 2016-07-10 14:09:45 UTC

Verified on rhevm-4.0.2-0.2.rc1.el7ev.noarch

Comment 15 Nelly Credi 2016-07-11 11:17:04 UTC

as Raz updated it looks good in the automation - we cleaned all sds a few times and it worked well (used to fail every time)

Comment 16 Allon Mureinik 2016-07-11 11:49:23 UTC

Excellent. Thanks Raz and Nelly.

Comment 17 Sandro Bonazzola 2016-07-19 06:26:11 UTC

Since the problem described in this bug report should be
resolved in oVirt 4.0.1 released on July 19th 2016, it has been closed with a
resolution of CURRENT RELEASE.

For information on the release, and how to update to this release, follow the link below.

If the solution does not work for you, open a new bug report.

http://www.ovirt.org/release/4.0.1/

Note You need to log in before you can comment on or make changes to this bug.