Created attachment 1143836 [details] engine and hosts logs Description of problem: Storage cannot be formatted because it indicates that its attached to a DC while it is not Version-Release number of selected component (if applicable): How reproducible: 50% Steps to Reproduce: 1. assunming env with: dc, cluster, host & multiple sds (I never saw it happening on master sd) 2. move sd to maintenance 3. detach sd 4. remove sd (check 'format' option) Actual results: getting an error: Error while executing action.... The storage domain metadata indicates that it is attached to a data center hence cannot be formatted.... Expected results: should be able to remove with format option Additional info: it is usually resolved by reattaching the sd and repeat the steps to remove it happened at around 15:30, the SD name was nfs_2
The detach operation failed because of network error that caused the host to be detected as non responsive. Therefore when performing the detach command the host (SPM) wasn't connected to the storage server which led to a failure, so later on the remove sd failed because the domain wasn't detached (as expected). The test should verify that the domain was detached before attempting to remove, Closing as NOTABUG.
please look at the screenshot the sd is indicated as detached in the UI, so Im assuming it is. if it failed, it should not appear as detached, also in this case the remove button would have been disabled so if the problem here is the indication of the sd as detached when its not, it should be fixed
Ok, I didn't get the the SD appears as detached. The cause here is a race condition, because the host went non responsive a connectStoragePool() with the domain map is sent to it as part of the InitVdsOnUp flow. the connect is performed during the detach operation which fails, as the sent domain map doesn't contain the domain the engine automatically detaches it. targeting to V4 as there are very slims chances of encountering it, the relevant code is very sensitive and there's a solution. thanks, Liron.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Nelly, is it blocking the automation? If not I'll leave it in 4.1 as it seems like a corner case and has a workaround
this is failing the golden environment cleaner test quite often. its not an automation blocker, but we may miss other bugs if the cleaner is not fully executed. can we maybe make it to 4.0?
Tal - please move back to 4.0 and have someone look at it promptly (based on comment 6 above - it fails GE).
Nelly, as the race happens only when the host becomes non responsive, can you try and see if there's something wrong the env cleaning? if the host wouldn't become non-responsive before/during the detach (which shouldn't occur often) we wouldn't encounter that bug. thanks, Liron.
if the host doesn't become non responsive usually on the clean phase, perhaps there is other relevant scenario as well that should be handled on a separate bug.
Well I didnt see that the host became non responsive and I encountered this issue manually as well. maybe ratamir can add more info as he saw it in his tests as well also keep in mind that the clean phase is working well in 3.5, so I dont believe there is anything wrong with the test flow I dont know if its related, but there was also a bug (it was closed on wontfix if I remember correctly) that when detaching a storage, the other sds move to 'unknown' state and the DC is also in bad state, so maybe it affects the hosts as well?
Created attachment 1151286 [details] engine and vdsm logs I also see this issue from time to time. There is no specific flow I can think of that cause this issue because we see it randomly in different test plans. I'm attaching logs from today where it happened manually
This is solved for the next 4.0.z milestone.
Nelly, Since we don't have specific steps to reproduce, I suggest that we will see if this reproduced in the next few days, and if not I will move it to verify. Let me know if you see this issue again. Thanks
Verified on rhevm-4.0.2-0.2.rc1.el7ev.noarch
as Raz updated it looks good in the automation - we cleaned all sds a few times and it worked well (used to fail every time)
Excellent. Thanks Raz and Nelly.
Since the problem described in this bug report should be resolved in oVirt 4.0.1 released on July 19th 2016, it has been closed with a resolution of CURRENT RELEASE. For information on the release, and how to update to this release, follow the link below. If the solution does not work for you, open a new bug report. http://www.ovirt.org/release/4.0.1/