Bug 875870
| Summary: | engine: remove of master storage domain after destroy of pool fails (threads interleaving - disconnectStoragePool and formatStorageDomain collides) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||
| Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 3.1.0 | CC: | dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, thildred, yeylon, ykaul | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.2.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | storage | ||||||
| Fixed In Version: | SF6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
this is kind of interesting, the reason operation failed is due to inter leaving of several actions which runs in parallel: formatStorageDomain: Thread-1926::INFO::2012-11-12 17:39:49,052::logUtils::37::dispatcher::(wrapper) Run and protect: formatStorageDomain disconnectStoragePool: Thread-1928::INFO::2012-11-12 17:39:51,812::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool format storage domain operation fails as it claims as if host already connected to pool but it doesn't, and never (in the time being) was, from somereason, engine decides to send disconnectStoragePool during formatStorageDomain which is unclear to me, and it fails on the following: 2350 vars.task.getExclusiveLock(STORAGE, sdUUID) 2351 for p in self.pools.values(): 2352 # Avoid format if domain part of connected pool 2353 domDict = p.getDomains() 2354 if sdUUID in domDict.keys(): 2355 raise se.CannotFormatStorageDomainInConnectedPool(sdUUID) i'm afraid it somehow fails to acquire the exclusive lock as disconnectStoragePool takes it before and not releasing. After a couple of minutes does the remove work? (In reply to comment #2) > After a couple of minutes does the remove work? if by a couple of minutes you mean that we wait a couple of minutes between removing the pool and removing the domain than the answer is yes. if you mean can we remove the domain after the failure than the answer is no - after we fail on format we will not be able to remove the domain without destroy. what happend here is that RemoveStorageDomain was executed before we ran DisconnectStoragePool for that host:
2012-11-12 17:39:47,382 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-11) [323e94db] START, FormatStorageDomainVDSComm
and(HostName = gold-vdsc, HostId = 24ef29b0-2c20-11e2-90a4-001a4a169741, storageDomainId=02ee6a99-54e6-42fd-8199-9b14c127fe96), log id: 5900f377
2012-11-12 17:39:53,034 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-4-thread-19) [98660e9] START, DisconnectStoragePoolVDSCommand
(HostName = gold-vdsc, HostId = 24ef29b0-2c20-11e2-90a4-001a4a169741, storagePoolId = c3354d0a-caec-4236-888a-f58d62dddbbf, vds_spm_id = 1), log id: 7e4a7087
2012-11-12 17:39:56,413 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-11) [323e94db] Failed in FormatStorageDomainVDS method
2012-11-12 17:39:56,414 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-11) [323e94db] Error code CannotFormatStorageDomainInConnect
edPool and error message VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Cannot format storage domain in connected pool: ('02ee6a99-54e6-42
fd-8199-9b14c127fe96',)
2012-11-12 17:39:56,419 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-11) [323e94db] Command org.ovirt.engine.core.vdsbroker.vdsbr
oker.FormatStorageDomainVDSCommand return value
Generally, from looking at the code, this shouldn't happen - we remove the storage pool (= setting the domain status to unattaced) only after we complete the disconnectStoragePool for the hosts - (we see that the 'correct' disconnect pool operations were initiated here, just didn't complete).
therefore, i suspect the the bug might have occured because of possibly wrong calculated status of the domain that allowed you to remove it.
a while ago a patch was merged that refactored that whole 'calculate domain status' query and possibly solved that issue.
http://gerrit.ovirt.org/#/c/9511/
Regardless, doesn't reproduce at the moment.
the bug is not reproducing, moving to verified. 3.2 has been released 3.2 has been released 3.2 has been released |
Created attachment 643649 [details] logs Description of problem: I removed DC and once the domain became unattached I removed it. we fail on format domain from vdsm on domain still connected to pool. it seems that there are interliving threads for formatStroageDomain, disconnectStroagePool and disconnectStorageServer. Version-Release number of selected component (if applicable): si24.1 vdsm-4.9.6-42.0.el6_3.x86_64 How reproducible: Steps to Reproduce: 1. have a DC with several domains -> deactivate, detach and remove the domains 2. put master domain in maintenance -> remove DC 3. once master domain is unattached -> remove it Actual results: we fail on format storage with error from vdsm: annot format storage domain in connected pool Expected results: since pool is not really connected and it seems that engine is sending interliving threads we should wait for disconnectStroagePool to finish before sending format Additional info: logs attached hread-1926::ERROR::2012-11-12 17:39:55,143::task::853::TaskManager.Task::(_setError) Task=`1609469b-bf72-47c9-8982-e52e71dd3366`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2355, in formatStorageDomain raise se.CannotFormatStorageDomainInConnectedPool(sdUUID) CannotFormatStorageDomainInConnectedPool: Cannot format storage domain in connected pool: ('02ee6a99-54e6-42fd-8199-9b14c127fe96',) Thread-1926::DEBUG::2012-11-12 17:39:55,143::task::872::TaskManager.Task::(_run) Task=`1609469b-bf72-47c9-8982-e52e71dd3366`::Task._run: 1609469b-bf72-47c9-8982-e52e71dd3366 ('02ee6a99-54e6-42fd-8199-9b14c127fe96', False) {} failed - st opping task Thread-1926::DEBUG::2012-11-12 17:39:55,144::task::1199::TaskManager.Task::(stop) Task=`1609469b-bf72-47c9-8982-e52e71dd3366`::stopping in state preparing (force False)