Bug 806952

Summary: ovirt-engine-backend: we allow to remove DC when there is more than one domain attached
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Laszlo Hornyak <lhornyak>
Status: CLOSED CURRENTRELEASE QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: abaron, amureini, dfediuck, dyasny, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul
Target Milestone: ---Keywords: Regression, Reopened
Target Release: 3.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: si20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 20:02:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
reproduced again none

Description Dafna Ron 2012-03-26 15:14:23 UTC
Description of problem:

remove DC should only be used if we have the master domain attached (since it cannot be detached without removing the pool). 
we currently can remove the DC when there is more then one domain attached. 
I had several races in which the remove fails in backend and not in vds which causes the error "domain not in pool in vds"

However, setting this bug on urgent is because once I created a new DC and attached the removed domains, hsm is getting wrong master or version error and becomes non-operational (yes, HSM not SPM). 

SPM remains active and connected to pool and all domains. 


Version-Release number of selected component (if applicable):

ovirt-engine-backend-3.1.0_0001-3.el6.x86_64
vdsm-4.9.6-4.5.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create and attach several iscsi domains to a two host cluster
2. put all domains in maintenance and remove the DC
3. create a second DC and reattach all the domains to the new DC 
  
Actual results:

we get a wrong master domain error from hsm only and host becomes non-operational

Expected results:

we should not be allowed to remove the DC while there is more then the master domain attached

Additional info: full logs attached. 

2012-03-26 16:55:02,486 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-22) [4c105e7c] Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         324
mMessage                      Wrong Master domain or its version: 'SD=4b1984e6-feb6-4629-9e13-affda9c9ca41, pool=e4269e26-f1ac-4dfb-8407-8e191d736a05'


Thread-942::ERROR::2012-03-26 16:47:50,783::task::853::TaskManager.Task::(_setError) Task=`7345bbab-5a07-4072-8069-c5df986ed3d9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 813, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 855, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 641, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1107, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1451, in getMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=4b1984e6-feb6-4629-9e13-affda9c9ca41, pool=e4269e26-f1ac-4dfb-8407-8e191d736a05'

Comment 1 Dafna Ron 2012-03-26 15:39:04 UTC
Created attachment 572782 [details]
logs

Comment 4 Laszlo Hornyak 2012-07-12 11:07:57 UTC
"works for me"

I could not reproduce the issue with the latest git version from both vdsm and engine. I used two iscsi storages. (how much is several exaclty?)

Comment 5 Laszlo Hornyak 2012-07-17 11:23:07 UTC
Dafna, I could not reproduce this issue do far, can you help? Thx.

Comment 6 Dafna Ron 2012-08-02 12:34:12 UTC
Created attachment 601939 [details]
reproduced again

I reproduce on si12 again. 
* make sure you are working with correct packages and with rhel hosts only. 
* make sure that you have iscsi domains
* make sure that you have no objects under any of the domains (lean setup). 
* make sure that you have only 1 host in up state.

Comment 7 Laszlo Hornyak 2012-08-22 13:29:31 UTC
trying to reproduce. after trying to remove the DC, it failed to detach master domain and the DC got back to Up state

Comment 8 Laszlo Hornyak 2012-08-24 14:34:21 UTC
- the DC could not be removed with a simple 'remove', the master domain got stuck in locked state and the DC got back to 'Up' state, all domains inactivated, only the master Locked - this might be a bug
- with force remove, removing the DC worked fine, after that attaching the storage generated some errors, but at the end it managed to start up

2012-08-24 16:22:23,671 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (http--0.0.0.0-8080-2) [38c457c] starting spm on vds dev-164, storage pool iscsi02, prevId -1, LVER -1
2012-08-24 16:22:23,673 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (http--0.0.0.0-8080-2) [38c457c] START, SpmStartVDSCommand(vdsId = 73261da6-edf4-11e1-955d-b3bda0fd89ee, storagePoolId = 8fa4ef1a-c514-47be-bfe1-004e09cdec23, prevId=-1, prevLVER=-1, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 6fa7ed3
2012-08-24 16:22:23,681 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (http--0.0.0.0-8080-2) [38c457c] spmStart polling started: taskId = 064081b6-0603-44c6-b4dd-44b23eda8ba9
2012-08-24 16:22:24,693 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (http--0.0.0.0-8080-2) [38c457c] Failed in HSMGetTaskStatusVDS method
2012-08-24 16:22:24,695 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (http--0.0.0.0-8080-2) [38c457c] Error code GeneralException and error message VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = 'wait' is an invalid keyword argument for this function
2012-08-24 16:22:24,697 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (http--0.0.0.0-8080-2) [38c457c] spmStart polling ended: taskId = 064081b6-0603-44c6-b4dd-44b23eda8ba9 task status = finished
2012-08-24 16:22:24,698 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (http--0.0.0.0-8080-2) [38c457c] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = 'wait' is an invalid keyword argument for this function
2012-08-24 16:22:24,707 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (http--0.0.0.0-8080-2) [38c457c] spmStart polling ended. spm status: Free
2012-08-24 16:22:24,710 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (http--0.0.0.0-8080-2) [38c457c] START, HSMClearTaskVDSCommand(vdsId = 73261da6-edf4-11e1-955d-b3bda0fd89ee, taskId=064081b6-0603-44c6-b4dd-44b23eda8ba9), log id: 5eb97f68

Comment 9 Laszlo Hornyak 2012-08-24 14:59:48 UTC
related vdsm log:
Thread-2101::DEBUG::2012-08-24 16:52:03,767::taskManager::96::TaskManager::(getTaskStatus) Return. Response: {'code': 100, 'message': u"'wait' is an invalid keyword argument for this function", 'taskState': 'fi
nished', 'taskResult': 'cleanSuccess', 'taskID': 'e481fe99-0a11-4281-b316-a65ccae29528'}
Thread-2101::INFO::2012-08-24 16:52:03,767::logUtils::39::dispatcher::(wrapper) Run and protect: getTaskStatus, Return response: {'taskStatus': {'code': 100, 'message': u"'wait' is an invalid keyword argument f
or this function", 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': 'e481fe99-0a11-4281-b316-a65ccae29528'}}
Thread-2101::DEBUG::2012-08-24 16:52:03,767::task::1151::TaskManager.Task::(prepare) Task=`de9dbdf4-8105-418c-a094-b1446e1f4508`::finished: {'taskStatus': {'code': 100, 'message': u"'wait' is an invalid keyword
 argument for this function", 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': 'e481fe99-0a11-4281-b316-a65ccae29528'}}
Thread-2101::DEBUG::2012-08-24 16:52:03,767::task::568::TaskManager.Task::(_updateState) Task=`de9dbdf4-8105-418c-a094-b1446e1f4508`::moving from state preparing -> state finished
Thread-2101::DEBUG::2012-08-24 16:52:03,767::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-2101::DEBUG::2012-08-24 16:52:03,767::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}

Comment 10 Laszlo Hornyak 2012-08-24 15:41:20 UTC
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 307, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 250, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 427, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/safelease.py", line 170, in acquireHostId
    self._sdUUID, hostId, self._idsPath, wait=True):
TypeError: 'wait' is an invalid keyword argument for this function

Comment 11 Laszlo Hornyak 2012-08-27 15:34:16 UTC
http://gerrit.ovirt.org/7509

Comment 15 Dafna Ron 2012-10-14 12:52:16 UTC
verified on si20