Bug 1254936 - Deactivate storage domain sometimes fails with error "Storage domain does not exist"
Deactivate storage domain sometimes fails with error "Storage domain does not...
Status: CLOSED DUPLICATE of bug 1311121
Product: ovirt-engine
Classification: oVirt
Component: General (Show other bugs)
---
Unspecified Unspecified
unspecified Severity high (vote)
: ovirt-3.6.6
: ---
Assigned To: Liron Aravot
Aharon Canan
: Automation, AutomationBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-19 05:32 EDT by Raz Tamir
Modified: 2016-04-27 07:53 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-27 07:46:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amureini: ovirt‑3.6.z?
amureini: ovirt‑4.0.0?
rule-engine: blocker?
ylavi: planning_ack?
ylavi: devel_ack?
ylavi: testing_ack?


Attachments (Terms of Use)
vdsm and engine logs (1.38 MB, application/x-gzip)
2015-08-19 05:32 EDT, Raz Tamir
no flags Details
engine and vdsm logs (668.28 KB, application/x-bzip)
2015-10-27 13:05 EDT, Raz Tamir
no flags Details
automation full log (1.81 MB, text/plain)
2015-10-28 05:49 EDT, Raz Tamir
no flags Details
script for reproduce bug (1.78 KB, text/x-python)
2015-11-30 03:52 EST, Raz Tamir
no flags Details
manually reproduced - engine logs (11.81 MB, application/x-gzip)
2015-12-10 04:35 EST, Nelly Credi
no flags Details
manually reproduced - hosts logs1 (15.23 MB, application/x-xz)
2015-12-10 04:37 EST, Nelly Credi
no flags Details
manually reproduced - hosts logs2 (10.89 MB, application/x-xz)
2015-12-10 04:38 EST, Nelly Credi
no flags Details
manually reproduced - hosts logs3 (13.57 MB, application/x-xz)
2015-12-10 04:44 EST, Nelly Credi
no flags Details
manually reproduced - hosts logs4 (14.52 MB, application/x-xz)
2015-12-10 04:45 EST, Nelly Credi
no flags Details

  None (edit)
Description Raz Tamir 2015-08-19 05:32:57 EDT
Created attachment 1064718 [details]
vdsm and engine logs

Description of problem:
Sometimes when trying to deactivate storage domain, the operation fails but no message appears.

The storage domain status turns to 'PREPARING_FOR_MAINTENANCE' and switch back to 'ACTIVE'.


Version-Release number of selected component (if applicable):
ovirt-engine-3.6.0-0.0.master.20150804111407.git122a3a0.el6.noarch
vdsm-4.17.0-1239.git6575e3f.el7.noarch

How reproducible:
Sometimes

Steps to Reproduce:
1. Deactivate storage domain
2.
3.

Actual results:


Expected results:


Additional info:
Comment 1 Raz Tamir 2015-10-08 15:08:30 EDT
Update: 90% of the time, the domain type is ISO domain
Comment 2 Vered Volansky 2015-10-18 07:10:13 EDT
There doesn't seem to be any error message/exception or compensation in the engine log. The vdsm log is dated differently, which razes the question if it's in fact relevant. In any case, nothing helpful there as well.
The flows for Data and ISO domains are different.
ISO will never be master and Data can be either master or not.
Please reproduce with logs the three options above, since each has a different flow, and specify how to reproduce, since I can't.
If there is no error message/exception in the log please specify explicitly the message beginning the flawed flow (assuming there are good flows in the same log).
Comment 3 Raz Tamir 2015-10-18 12:07:10 EDT
Hi Vered,
I intentionally wrote in the summery "fails without any warning" because I couldn't find anything useful too.

The flow: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-11576
Comment 4 Raz Tamir 2015-10-18 13:22:34 EDT
The flow:
1. Create a new ISO domain.
2. Upload iso file to it
3. Attach the iso to a VM
4. Detach the iso file from the VM
5. Detach the ISO domain from DC
Comment 5 Red Hat Bugzilla Rules Engine 2015-10-19 07:01:19 EDT
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 6 Maor 2015-10-25 10:29:47 EDT
This bug could not reproduce on my env.
Talked with Raz to add the full relevant logs once this bug gets reproduced in the automatic tests.
Meanwhile Adding a needinfo on Raz once this gets reproduced with the automatic tests.
Comment 7 Raz Tamir 2015-10-27 13:05 EDT
Created attachment 1086929 [details]
engine and vdsm logs

This reproduced again, now not on iso domain.
attaching logs
Comment 8 Maor 2015-10-28 05:15:01 EDT
What is the id/name of the Storage Domain?
Comment 9 Raz Tamir 2015-10-28 05:48:24 EDT
The id: fd75a6b4-f107-4a29-9438-6230223c5a5f . Storage domain name: iscsi_0

I'm attaching our automation full run log so you will be able to see that at '2015-10-27 18:58:10,540' - Status of element iscsi_0 is 'LOCKED' expected statuses are ['MAINTENANCE']

and at '2015-10-27 18:58:30,822' - Status of element iscsi_0 is 'ACTIVE' expected statuses are ['MAINTENANCE']
Comment 10 Raz Tamir 2015-10-28 05:49 EDT
Created attachment 1087192 [details]
automation full log
Comment 11 Yaniv Lavi (Dary) 2015-10-29 08:30:24 EDT
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Comment 12 Raz Tamir 2015-11-30 03:52 EST
Created attachment 1100389 [details]
script for reproduce bug
Comment 13 Raz Tamir 2015-11-30 03:59:45 EST
Still reproduced and cause failures in our automation
I'm attaching a script to make it easier to reproduce
Comment 14 Maor 2015-11-30 08:31:11 EST
Thanks for the script,

I will to run this on my env and hopefully reproduce this
Comment 15 Raz Tamir 2015-12-06 04:21:23 EST
Complete features are blocked due to this bug.
Marking this as AutomationBlocker
Comment 16 Yaniv Lavi (Dary) 2015-12-06 09:08:43 EST
Any updates?
Comment 17 Maor 2015-12-06 14:04:01 EST
no relevant update yet.
I've managed to reproduce this, currently looks like a race issue, need more time to investigate.
Comment 18 Maor 2015-12-06 16:50:55 EST
(In reply to Maor from comment #17)
> no relevant update yet.
> I've managed to reproduce this, currently looks like a race issue, need more
> time to investigate.

After digging into it a bit more, it looks that the python script uses subprocess.Popen which spawn a new process, the problem is that the activate process is not finished yet and we already call deactivate. That is why the script fails.

If this script reflects the automatic tests, then this might be the problem, if the script uses spawn process, the automatic test should check the status of the storage and only when it is active the engine can deactivate the Storage Domain once again.

Raz, can u confirm this is also how the automatic test behaves?
Comment 19 Raz Tamir 2015-12-07 04:32:46 EST
Maor,
This is not how our automation works.
In general, every time we deactivate a storage domain, we first wait for any async tasks running.
Comment 20 Maor 2015-12-07 04:41:47 EST
so that is the problem, if the script use spawn processes, add a while loop until the Storage Domain is active again, same thing was done when deactivating a Storage Domain.
Comment 21 Raz Tamir 2015-12-07 04:53:26 EST
It wasn't clear so, every time we deactivate a storage domain, we first wait for any async tasks running and the storage domain status is active
Comment 22 Red Hat Bugzilla Rules Engine 2015-12-08 06:59:24 EST
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Comment 23 Aharon Canan 2015-12-08 10:20:26 EST
Although it probably won't affect costumers like affecting us, this is blocking many of our automation cases and we need a fix for it.

Also, seems like Maor saw the problem so I prefer to get a fix for this bug asap.
Comment 24 Allon Mureinik 2015-12-08 11:23:55 EST
Guys, this is not a regression - this is the same behavior RHEV has had for several versions.

The automation scenario is simply wrong, as I already advised via email:

"""
OK, I think this is the issue, then.

Deactivating a domain, while it may take a while, is not an async task (similar, e.g., to running a VM).

Instead of polling for async-tasks, could you just poll the domain until it changes to the right status (with a timeout, probably, if it isn't done in, let's say, 3 minutes)?
"""
Comment 25 Raz Tamir 2015-12-08 12:56:04 EST
Hi Allon,
I reply in the mail:
"I'm wait for any async tasks before I deactivate the storage domain to make sure there is no background operation running, for example upload OVF file.
After I'm sure the domain is active and no async tasks are running, I deactivate the storage domain and wait until it's status changes to maintenance.
When doing so, the status changes to 'preparing_for_maintenance' and after few second changes back to 'active' instead of 'maintenance' and this is what makes our test to fail because we expect for'maintenance'."

After digging a bit with Maor, we discover:
From engine.log:
2015-12-06 11:13:49,555 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-12) [] domain '7aedf01a-60f1-462d-9cbf-9fe905f92862:test_4818_default' in problem. vds: 'host_mixed_2'

From vdsm.log:
Thread-1980::ERROR::2015-12-06 11:13:42,969::sdc::145::Storage.StorageDomainCache::(_findDomain) domain 7aedf01a-60f1-462d-9cbf-9fe905f92862 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'7aedf01a-60f1-462d-9cbf-9fe905f92862',)
Thread-1980::ERROR::2015-12-06 11:13:42,970::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain 7aedf01a-60f1-462d-9cbf-9fe905f92862
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 246, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 767, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 313, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'7aedf01a-60f1-462d-9cbf-9fe905f92862',)
Thread-1980::INFO::2015-12-06 11:13:42,971::monitor::273::Storage.Monitor::(_notifyStatusChanges) Domain 7aedf01a-60f1-462d-9cbf-9fe905f92862 became INVALID
Comment 26 Nelly Credi 2015-12-10 04:35 EST
Created attachment 1104276 [details]
manually reproduced - engine logs

I saw this issue manually a few times.
in the 2nd or 3rd try the sd finally moved to maintenance.
At this point I detached and when I tried to remove it, I got an error saying that this storage if connected to a DC.
This got resolved by attaching the SDs (approving that its ok to reconnect it, because its connected to another DC), detaching again & removing

attached is the engine log
I will also attach hosts
Comment 27 Nelly Credi 2015-12-10 04:37 EST
Created attachment 1104277 [details]
manually reproduced - hosts logs1
Comment 28 Nelly Credi 2015-12-10 04:38 EST
Created attachment 1104279 [details]
manually reproduced - hosts logs2
Comment 29 Nelly Credi 2015-12-10 04:44 EST
Created attachment 1104281 [details]
manually reproduced - hosts logs3
Comment 30 Nelly Credi 2015-12-10 04:45 EST
Created attachment 1104285 [details]
manually reproduced - hosts logs4
Comment 31 Maor 2015-12-28 04:35:09 EST
After digging into the VDSM code, it looks like a race in the monitoring process.
When deactivating a Storage Domain the storage domain should be umounted and the monitor on it should be stopped.
It looks like the stop monitoring method is used as a SamplingMethod and the monitoring process might start monitor Storage Domain after it is already umounted but before the monitor process stops.

We can try to fix the issue in VDSM or maybe consider to do a workaround in the engine at the monitoring process.
either way the fix might be a nasty one.
Any suggestions?
Comment 32 Allon Mureinik 2016-01-04 05:25:49 EST
(In reply to Maor from comment #31)
> After digging into the VDSM code, it looks like a race in the monitoring
> process.
> When deactivating a Storage Domain the storage domain should be umounted and
> the monitor on it should be stopped.
> It looks like the stop monitoring method is used as a SamplingMethod and the
> monitoring process might start monitor Storage Domain after it is already
> umounted but before the monitor process stops.
> 
> We can try to fix the issue in VDSM or maybe consider to do a workaround in
> the engine at the monitoring process.
> either way the fix might be a nasty one.
> Any suggestions?
This should be fixed in VDSM. Having a valid deactivate fail arbitrarily is not something we're willing to live with.
Comment 33 Yaniv Kaul 2016-04-26 06:28:10 EDT
Raising severity to High, as it affects QE automation.
Comment 34 Allon Mureinik 2016-04-26 06:58:06 EDT
Liron, isn't this a subset of a fix you're already working on?
Comment 35 Liron Aravot 2016-04-27 07:46:57 EDT
This bug is a duplicate of bug 1311121, marking as such.

*** This bug has been marked as a duplicate of bug 1311121 ***

Note You need to log in before you can comment on or make changes to this bug.