Bug 1254936
Summary: | Deactivate storage domain sometimes fails with error "Storage domain does not exist" | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Raz Tamir <ratamir> | ||||||||||||||||||||
Component: | General | Assignee: | Liron Aravot <laravot> | ||||||||||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Aharon Canan <acanan> | ||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||||
Version: | --- | CC: | amureini, bugs, gklein, laravot, lsurette, mlipchuk, ncredi, nsoffer, ratamir, rbalakri, srevivo, tnisan, ykaul, ylavi | ||||||||||||||||||||
Target Milestone: | ovirt-3.6.6 | Keywords: | Automation, AutomationBlocker | ||||||||||||||||||||
Target Release: | --- | Flags: | amureini:
ovirt-3.6.z?
amureini: ovirt-4.0.0? rule-engine: blocker? ylavi: planning_ack? ylavi: devel_ack? ylavi: testing_ack? |
||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2016-04-27 11:46:57 UTC | Type: | Bug | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Attachments: |
|
Update: 90% of the time, the domain type is ISO domain There doesn't seem to be any error message/exception or compensation in the engine log. The vdsm log is dated differently, which razes the question if it's in fact relevant. In any case, nothing helpful there as well. The flows for Data and ISO domains are different. ISO will never be master and Data can be either master or not. Please reproduce with logs the three options above, since each has a different flow, and specify how to reproduce, since I can't. If there is no error message/exception in the log please specify explicitly the message beginning the flawed flow (assuming there are good flows in the same log). Hi Vered, I intentionally wrote in the summery "fails without any warning" because I couldn't find anything useful too. The flow: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-11576 The flow: 1. Create a new ISO domain. 2. Upload iso file to it 3. Attach the iso to a VM 4. Detach the iso file from the VM 5. Detach the ISO domain from DC Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. This bug could not reproduce on my env. Talked with Raz to add the full relevant logs once this bug gets reproduced in the automatic tests. Meanwhile Adding a needinfo on Raz once this gets reproduced with the automatic tests. Created attachment 1086929 [details]
engine and vdsm logs
This reproduced again, now not on iso domain.
attaching logs
What is the id/name of the Storage Domain? The id: fd75a6b4-f107-4a29-9438-6230223c5a5f . Storage domain name: iscsi_0 I'm attaching our automation full run log so you will be able to see that at '2015-10-27 18:58:10,540' - Status of element iscsi_0 is 'LOCKED' expected statuses are ['MAINTENANCE'] and at '2015-10-27 18:58:30,822' - Status of element iscsi_0 is 'ACTIVE' expected statuses are ['MAINTENANCE'] Created attachment 1087192 [details]
automation full log
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone. Created attachment 1100389 [details]
script for reproduce bug
Still reproduced and cause failures in our automation I'm attaching a script to make it easier to reproduce Thanks for the script, I will to run this on my env and hopefully reproduce this Complete features are blocked due to this bug. Marking this as AutomationBlocker Any updates? no relevant update yet. I've managed to reproduce this, currently looks like a race issue, need more time to investigate. (In reply to Maor from comment #17) > no relevant update yet. > I've managed to reproduce this, currently looks like a race issue, need more > time to investigate. After digging into it a bit more, it looks that the python script uses subprocess.Popen which spawn a new process, the problem is that the activate process is not finished yet and we already call deactivate. That is why the script fails. If this script reflects the automatic tests, then this might be the problem, if the script uses spawn process, the automatic test should check the status of the storage and only when it is active the engine can deactivate the Storage Domain once again. Raz, can u confirm this is also how the automatic test behaves? Maor, This is not how our automation works. In general, every time we deactivate a storage domain, we first wait for any async tasks running. so that is the problem, if the script use spawn processes, add a while loop until the Storage Domain is active again, same thing was done when deactivating a Storage Domain. It wasn't clear so, every time we deactivate a storage domain, we first wait for any async tasks running and the storage domain status is active This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. Although it probably won't affect costumers like affecting us, this is blocking many of our automation cases and we need a fix for it. Also, seems like Maor saw the problem so I prefer to get a fix for this bug asap. Guys, this is not a regression - this is the same behavior RHEV has had for several versions. The automation scenario is simply wrong, as I already advised via email: """ OK, I think this is the issue, then. Deactivating a domain, while it may take a while, is not an async task (similar, e.g., to running a VM). Instead of polling for async-tasks, could you just poll the domain until it changes to the right status (with a timeout, probably, if it isn't done in, let's say, 3 minutes)? """ Hi Allon, I reply in the mail: "I'm wait for any async tasks before I deactivate the storage domain to make sure there is no background operation running, for example upload OVF file. After I'm sure the domain is active and no async tasks are running, I deactivate the storage domain and wait until it's status changes to maintenance. When doing so, the status changes to 'preparing_for_maintenance' and after few second changes back to 'active' instead of 'maintenance' and this is what makes our test to fail because we expect for'maintenance'." After digging a bit with Maor, we discover: From engine.log: 2015-12-06 11:13:49,555 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-12) [] domain '7aedf01a-60f1-462d-9cbf-9fe905f92862:test_4818_default' in problem. vds: 'host_mixed_2' From vdsm.log: Thread-1980::ERROR::2015-12-06 11:13:42,969::sdc::145::Storage.StorageDomainCache::(_findDomain) domain 7aedf01a-60f1-462d-9cbf-9fe905f92862 not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'7aedf01a-60f1-462d-9cbf-9fe905f92862',) Thread-1980::ERROR::2015-12-06 11:13:42,970::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain 7aedf01a-60f1-462d-9cbf-9fe905f92862 Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 246, in _monitorDomain self._performDomainSelftest() File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 767, in wrapper value = meth(self, *a, **kw) File "/usr/share/vdsm/storage/monitor.py", line 313, in _performDomainSelftest self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'7aedf01a-60f1-462d-9cbf-9fe905f92862',) Thread-1980::INFO::2015-12-06 11:13:42,971::monitor::273::Storage.Monitor::(_notifyStatusChanges) Domain 7aedf01a-60f1-462d-9cbf-9fe905f92862 became INVALID Created attachment 1104276 [details]
manually reproduced - engine logs
I saw this issue manually a few times.
in the 2nd or 3rd try the sd finally moved to maintenance.
At this point I detached and when I tried to remove it, I got an error saying that this storage if connected to a DC.
This got resolved by attaching the SDs (approving that its ok to reconnect it, because its connected to another DC), detaching again & removing
attached is the engine log
I will also attach hosts
Created attachment 1104277 [details]
manually reproduced - hosts logs1
Created attachment 1104279 [details]
manually reproduced - hosts logs2
Created attachment 1104281 [details]
manually reproduced - hosts logs3
Created attachment 1104285 [details]
manually reproduced - hosts logs4
After digging into the VDSM code, it looks like a race in the monitoring process. When deactivating a Storage Domain the storage domain should be umounted and the monitor on it should be stopped. It looks like the stop monitoring method is used as a SamplingMethod and the monitoring process might start monitor Storage Domain after it is already umounted but before the monitor process stops. We can try to fix the issue in VDSM or maybe consider to do a workaround in the engine at the monitoring process. either way the fix might be a nasty one. Any suggestions? (In reply to Maor from comment #31) > After digging into the VDSM code, it looks like a race in the monitoring > process. > When deactivating a Storage Domain the storage domain should be umounted and > the monitor on it should be stopped. > It looks like the stop monitoring method is used as a SamplingMethod and the > monitoring process might start monitor Storage Domain after it is already > umounted but before the monitor process stops. > > We can try to fix the issue in VDSM or maybe consider to do a workaround in > the engine at the monitoring process. > either way the fix might be a nasty one. > Any suggestions? This should be fixed in VDSM. Having a valid deactivate fail arbitrarily is not something we're willing to live with. Raising severity to High, as it affects QE automation. Liron, isn't this a subset of a fix you're already working on? This bug is a duplicate of bug 1311121, marking as such. *** This bug has been marked as a duplicate of bug 1311121 *** |
Created attachment 1064718 [details] vdsm and engine logs Description of problem: Sometimes when trying to deactivate storage domain, the operation fails but no message appears. The storage domain status turns to 'PREPARING_FOR_MAINTENANCE' and switch back to 'ACTIVE'. Version-Release number of selected component (if applicable): ovirt-engine-3.6.0-0.0.master.20150804111407.git122a3a0.el6.noarch vdsm-4.17.0-1239.git6575e3f.el7.noarch How reproducible: Sometimes Steps to Reproduce: 1. Deactivate storage domain 2. 3. Actual results: Expected results: Additional info: