Bug 1254936

Summary: Deactivate storage domain sometimes fails with error "Storage domain does not exist"
Product: [oVirt] ovirt-engine Reporter: Raz Tamir <ratamir>
Component: GeneralAssignee: Liron Aravot <laravot>
Status: CLOSED DUPLICATE QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: ---CC: amureini, bugs, gklein, laravot, lsurette, mlipchuk, ncredi, nsoffer, ratamir, rbalakri, srevivo, tnisan, ykaul, ylavi
Target Milestone: ovirt-3.6.6Keywords: Automation, AutomationBlocker
Target Release: ---Flags: amureini: ovirt-3.6.z?
amureini: ovirt-4.0.0?
rule-engine: blocker?
ylavi: planning_ack?
ylavi: devel_ack?
ylavi: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-27 11:46:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm and engine logs
none
engine and vdsm logs
none
automation full log
none
script for reproduce bug
none
manually reproduced - engine logs
none
manually reproduced - hosts logs1
none
manually reproduced - hosts logs2
none
manually reproduced - hosts logs3
none
manually reproduced - hosts logs4 none

Description Raz Tamir 2015-08-19 09:32:57 UTC
Created attachment 1064718 [details]
vdsm and engine logs

Description of problem:
Sometimes when trying to deactivate storage domain, the operation fails but no message appears.

The storage domain status turns to 'PREPARING_FOR_MAINTENANCE' and switch back to 'ACTIVE'.


Version-Release number of selected component (if applicable):
ovirt-engine-3.6.0-0.0.master.20150804111407.git122a3a0.el6.noarch
vdsm-4.17.0-1239.git6575e3f.el7.noarch

How reproducible:
Sometimes

Steps to Reproduce:
1. Deactivate storage domain
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raz Tamir 2015-10-08 19:08:30 UTC
Update: 90% of the time, the domain type is ISO domain

Comment 2 Vered Volansky 2015-10-18 11:10:13 UTC
There doesn't seem to be any error message/exception or compensation in the engine log. The vdsm log is dated differently, which razes the question if it's in fact relevant. In any case, nothing helpful there as well.
The flows for Data and ISO domains are different.
ISO will never be master and Data can be either master or not.
Please reproduce with logs the three options above, since each has a different flow, and specify how to reproduce, since I can't.
If there is no error message/exception in the log please specify explicitly the message beginning the flawed flow (assuming there are good flows in the same log).

Comment 3 Raz Tamir 2015-10-18 16:07:10 UTC
Hi Vered,
I intentionally wrote in the summery "fails without any warning" because I couldn't find anything useful too.

The flow: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-11576

Comment 4 Raz Tamir 2015-10-18 17:22:34 UTC
The flow:
1. Create a new ISO domain.
2. Upload iso file to it
3. Attach the iso to a VM
4. Detach the iso file from the VM
5. Detach the ISO domain from DC

Comment 5 Red Hat Bugzilla Rules Engine 2015-10-19 11:01:19 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 Maor 2015-10-25 14:29:47 UTC
This bug could not reproduce on my env.
Talked with Raz to add the full relevant logs once this bug gets reproduced in the automatic tests.
Meanwhile Adding a needinfo on Raz once this gets reproduced with the automatic tests.

Comment 7 Raz Tamir 2015-10-27 17:05:43 UTC
Created attachment 1086929 [details]
engine and vdsm logs

This reproduced again, now not on iso domain.
attaching logs

Comment 8 Maor 2015-10-28 09:15:01 UTC
What is the id/name of the Storage Domain?

Comment 9 Raz Tamir 2015-10-28 09:48:24 UTC
The id: fd75a6b4-f107-4a29-9438-6230223c5a5f . Storage domain name: iscsi_0

I'm attaching our automation full run log so you will be able to see that at '2015-10-27 18:58:10,540' - Status of element iscsi_0 is 'LOCKED' expected statuses are ['MAINTENANCE']

and at '2015-10-27 18:58:30,822' - Status of element iscsi_0 is 'ACTIVE' expected statuses are ['MAINTENANCE']

Comment 10 Raz Tamir 2015-10-28 09:49:05 UTC
Created attachment 1087192 [details]
automation full log

Comment 11 Yaniv Lavi 2015-10-29 12:30:24 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 12 Raz Tamir 2015-11-30 08:52:17 UTC
Created attachment 1100389 [details]
script for reproduce bug

Comment 13 Raz Tamir 2015-11-30 08:59:45 UTC
Still reproduced and cause failures in our automation
I'm attaching a script to make it easier to reproduce

Comment 14 Maor 2015-11-30 13:31:11 UTC
Thanks for the script,

I will to run this on my env and hopefully reproduce this

Comment 15 Raz Tamir 2015-12-06 09:21:23 UTC
Complete features are blocked due to this bug.
Marking this as AutomationBlocker

Comment 16 Yaniv Lavi 2015-12-06 14:08:43 UTC
Any updates?

Comment 17 Maor 2015-12-06 19:04:01 UTC
no relevant update yet.
I've managed to reproduce this, currently looks like a race issue, need more time to investigate.

Comment 18 Maor 2015-12-06 21:50:55 UTC
(In reply to Maor from comment #17)
> no relevant update yet.
> I've managed to reproduce this, currently looks like a race issue, need more
> time to investigate.

After digging into it a bit more, it looks that the python script uses subprocess.Popen which spawn a new process, the problem is that the activate process is not finished yet and we already call deactivate. That is why the script fails.

If this script reflects the automatic tests, then this might be the problem, if the script uses spawn process, the automatic test should check the status of the storage and only when it is active the engine can deactivate the Storage Domain once again.

Raz, can u confirm this is also how the automatic test behaves?

Comment 19 Raz Tamir 2015-12-07 09:32:46 UTC
Maor,
This is not how our automation works.
In general, every time we deactivate a storage domain, we first wait for any async tasks running.

Comment 20 Maor 2015-12-07 09:41:47 UTC
so that is the problem, if the script use spawn processes, add a while loop until the Storage Domain is active again, same thing was done when deactivating a Storage Domain.

Comment 21 Raz Tamir 2015-12-07 09:53:26 UTC
It wasn't clear so, every time we deactivate a storage domain, we first wait for any async tasks running and the storage domain status is active

Comment 22 Red Hat Bugzilla Rules Engine 2015-12-08 11:59:24 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 23 Aharon Canan 2015-12-08 15:20:26 UTC
Although it probably won't affect costumers like affecting us, this is blocking many of our automation cases and we need a fix for it.

Also, seems like Maor saw the problem so I prefer to get a fix for this bug asap.

Comment 24 Allon Mureinik 2015-12-08 16:23:55 UTC
Guys, this is not a regression - this is the same behavior RHEV has had for several versions.

The automation scenario is simply wrong, as I already advised via email:

"""
OK, I think this is the issue, then.

Deactivating a domain, while it may take a while, is not an async task (similar, e.g., to running a VM).

Instead of polling for async-tasks, could you just poll the domain until it changes to the right status (with a timeout, probably, if it isn't done in, let's say, 3 minutes)?
"""

Comment 25 Raz Tamir 2015-12-08 17:56:04 UTC
Hi Allon,
I reply in the mail:
"I'm wait for any async tasks before I deactivate the storage domain to make sure there is no background operation running, for example upload OVF file.
After I'm sure the domain is active and no async tasks are running, I deactivate the storage domain and wait until it's status changes to maintenance.
When doing so, the status changes to 'preparing_for_maintenance' and after few second changes back to 'active' instead of 'maintenance' and this is what makes our test to fail because we expect for'maintenance'."

After digging a bit with Maor, we discover:
From engine.log:
2015-12-06 11:13:49,555 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-12) [] domain '7aedf01a-60f1-462d-9cbf-9fe905f92862:test_4818_default' in problem. vds: 'host_mixed_2'

From vdsm.log:
Thread-1980::ERROR::2015-12-06 11:13:42,969::sdc::145::Storage.StorageDomainCache::(_findDomain) domain 7aedf01a-60f1-462d-9cbf-9fe905f92862 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'7aedf01a-60f1-462d-9cbf-9fe905f92862',)
Thread-1980::ERROR::2015-12-06 11:13:42,970::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain 7aedf01a-60f1-462d-9cbf-9fe905f92862
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 246, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 767, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 313, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'7aedf01a-60f1-462d-9cbf-9fe905f92862',)
Thread-1980::INFO::2015-12-06 11:13:42,971::monitor::273::Storage.Monitor::(_notifyStatusChanges) Domain 7aedf01a-60f1-462d-9cbf-9fe905f92862 became INVALID

Comment 26 Nelly Credi 2015-12-10 09:35:14 UTC
Created attachment 1104276 [details]
manually reproduced - engine logs

I saw this issue manually a few times.
in the 2nd or 3rd try the sd finally moved to maintenance.
At this point I detached and when I tried to remove it, I got an error saying that this storage if connected to a DC.
This got resolved by attaching the SDs (approving that its ok to reconnect it, because its connected to another DC), detaching again & removing

attached is the engine log
I will also attach hosts

Comment 27 Nelly Credi 2015-12-10 09:37:29 UTC
Created attachment 1104277 [details]
manually reproduced - hosts logs1

Comment 28 Nelly Credi 2015-12-10 09:38:59 UTC
Created attachment 1104279 [details]
manually reproduced - hosts logs2

Comment 29 Nelly Credi 2015-12-10 09:44:00 UTC
Created attachment 1104281 [details]
manually reproduced - hosts logs3

Comment 30 Nelly Credi 2015-12-10 09:45:34 UTC
Created attachment 1104285 [details]
manually reproduced - hosts logs4

Comment 31 Maor 2015-12-28 09:35:09 UTC
After digging into the VDSM code, it looks like a race in the monitoring process.
When deactivating a Storage Domain the storage domain should be umounted and the monitor on it should be stopped.
It looks like the stop monitoring method is used as a SamplingMethod and the monitoring process might start monitor Storage Domain after it is already umounted but before the monitor process stops.

We can try to fix the issue in VDSM or maybe consider to do a workaround in the engine at the monitoring process.
either way the fix might be a nasty one.
Any suggestions?

Comment 32 Allon Mureinik 2016-01-04 10:25:49 UTC
(In reply to Maor from comment #31)
> After digging into the VDSM code, it looks like a race in the monitoring
> process.
> When deactivating a Storage Domain the storage domain should be umounted and
> the monitor on it should be stopped.
> It looks like the stop monitoring method is used as a SamplingMethod and the
> monitoring process might start monitor Storage Domain after it is already
> umounted but before the monitor process stops.
> 
> We can try to fix the issue in VDSM or maybe consider to do a workaround in
> the engine at the monitoring process.
> either way the fix might be a nasty one.
> Any suggestions?
This should be fixed in VDSM. Having a valid deactivate fail arbitrarily is not something we're willing to live with.

Comment 33 Yaniv Kaul 2016-04-26 10:28:10 UTC
Raising severity to High, as it affects QE automation.

Comment 34 Allon Mureinik 2016-04-26 10:58:06 UTC
Liron, isn't this a subset of a fix you're already working on?

Comment 35 Liron Aravot 2016-04-27 11:46:57 UTC
This bug is a duplicate of bug 1311121, marking as such.

*** This bug has been marked as a duplicate of bug 1311121 ***