Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 875870

Summary: engine: remove of master storage domain after destroy of pool fails (threads interleaving - disconnectStoragePool and formatStorageDomain collides)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, thildred, yeylon, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: SF6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-11-12 17:53:09 UTC
Created attachment 643649 [details]
logs

Description of problem:

I removed DC and once the domain became unattached I removed it. 
we fail on format domain from vdsm on domain still connected to pool. 
it seems that there are interliving threads for formatStroageDomain, disconnectStroagePool and disconnectStorageServer. 

Version-Release number of selected component (if applicable):

si24.1
vdsm-4.9.6-42.0.el6_3.x86_64


How reproducible:


Steps to Reproduce:
1. have a DC with several domains -> deactivate, detach and remove the domains
2. put master domain in maintenance -> remove DC
3. once master domain is unattached -> remove it 
  
Actual results:

we fail on format storage with error from vdsm: annot format storage domain in connected pool

Expected results:

since pool is not really connected and it seems that engine is sending interliving threads we should wait for disconnectStroagePool to finish before sending format

Additional info: logs attached

hread-1926::ERROR::2012-11-12 17:39:55,143::task::853::TaskManager.Task::(_setError) Task=`1609469b-bf72-47c9-8982-e52e71dd3366`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2355, in formatStorageDomain
    raise se.CannotFormatStorageDomainInConnectedPool(sdUUID)
CannotFormatStorageDomainInConnectedPool: Cannot format storage domain in connected pool: ('02ee6a99-54e6-42fd-8199-9b14c127fe96',)
Thread-1926::DEBUG::2012-11-12 17:39:55,143::task::872::TaskManager.Task::(_run) Task=`1609469b-bf72-47c9-8982-e52e71dd3366`::Task._run: 1609469b-bf72-47c9-8982-e52e71dd3366 ('02ee6a99-54e6-42fd-8199-9b14c127fe96', False) {} failed - st
opping task
Thread-1926::DEBUG::2012-11-12 17:39:55,144::task::1199::TaskManager.Task::(stop) Task=`1609469b-bf72-47c9-8982-e52e71dd3366`::stopping in state preparing (force False)

Comment 1 Haim 2012-11-12 18:20:26 UTC
this is kind of interesting, the reason operation failed is due to inter leaving of several actions which runs in parallel:

formatStorageDomain:

Thread-1926::INFO::2012-11-12 17:39:49,052::logUtils::37::dispatcher::(wrapper) Run and protect: formatStorageDomain


disconnectStoragePool:

Thread-1928::INFO::2012-11-12 17:39:51,812::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool

format storage domain operation fails as it claims as if host already connected to pool but it doesn't, and never (in the time being) was, from somereason, engine decides to send disconnectStoragePool during formatStorageDomain which is unclear to me, and it fails on the following:


2350         vars.task.getExclusiveLock(STORAGE, sdUUID)
2351         for p in self.pools.values():
2352             # Avoid format if domain part of connected pool
2353             domDict = p.getDomains()
2354             if sdUUID in domDict.keys():
2355                 raise se.CannotFormatStorageDomainInConnectedPool(sdUUID)

i'm afraid it somehow fails to acquire the exclusive lock as disconnectStoragePool takes it before and not releasing.

Comment 2 Ayal Baron 2012-11-21 10:47:45 UTC
After a couple of minutes does the remove work?

Comment 3 Dafna Ron 2012-12-26 10:27:47 UTC
(In reply to comment #2)
> After a couple of minutes does the remove work?

if by a couple of minutes you mean that we wait a couple of minutes between removing the pool and removing the domain than the answer is yes. 

if you mean can we remove the domain after the failure than the answer is no - after we fail on format we will not be able to remove the domain without destroy.

Comment 4 Liron Aravot 2013-01-07 17:46:33 UTC
what happend here is that RemoveStorageDomain was executed before we ran DisconnectStoragePool for that host:

2012-11-12 17:39:47,382 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-11) [323e94db] START, FormatStorageDomainVDSComm
and(HostName = gold-vdsc, HostId = 24ef29b0-2c20-11e2-90a4-001a4a169741, storageDomainId=02ee6a99-54e6-42fd-8199-9b14c127fe96), log id: 5900f377


2012-11-12 17:39:53,034 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-4-thread-19) [98660e9] START, DisconnectStoragePoolVDSCommand
(HostName = gold-vdsc, HostId = 24ef29b0-2c20-11e2-90a4-001a4a169741, storagePoolId = c3354d0a-caec-4236-888a-f58d62dddbbf, vds_spm_id = 1), log id: 7e4a7087

2012-11-12 17:39:56,413 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-11) [323e94db] Failed in FormatStorageDomainVDS method
2012-11-12 17:39:56,414 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-11) [323e94db] Error code CannotFormatStorageDomainInConnect
edPool and error message VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Cannot format storage domain in connected pool: ('02ee6a99-54e6-42
fd-8199-9b14c127fe96',)
2012-11-12 17:39:56,419 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-11) [323e94db] Command org.ovirt.engine.core.vdsbroker.vdsbr
oker.FormatStorageDomainVDSCommand return value 


Generally, from looking at the code, this shouldn't happen - we remove the storage pool (= setting the domain status to unattaced) only after we complete the disconnectStoragePool for the hosts - (we see that the 'correct' disconnect pool operations were initiated here, just didn't complete).

therefore, i suspect the the bug might have occured because of possibly wrong calculated status of the domain that allowed you to remove it.

a while ago a patch was merged that refactored that whole 'calculate domain status' query and possibly solved that issue.
http://gerrit.ovirt.org/#/c/9511/

Regardless, doesn't reproduce at the moment.

Comment 6 Elad 2013-02-25 16:11:35 UTC
the bug is not reproducing, moving to verified.

Comment 7 Itamar Heim 2013-06-11 09:45:14 UTC
3.2 has been released

Comment 8 Itamar Heim 2013-06-11 09:45:16 UTC
3.2 has been released

Comment 9 Itamar Heim 2013-06-11 09:55:53 UTC
3.2 has been released