Bug 818632

Summary: [vdsm-rhevm] vdsm: Cluster lock not locked for domain when running reconstructMaster
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED DUPLICATE QA Contact: Haim <hateya>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: abaron, acathrow, bazulay, danken, dyasny, fsimonce, iheim, mgoldboi, smizrahi, yeylon, ykaul
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 818577 Environment:
Last Closed: 2012-05-09 09:52:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 818577    
Bug Blocks:    

Description Dafna Ron 2012-05-03 15:25:00 UTC
this was also reproduced on vdsm-4.9.6-9.el6.x86_64 on build si3

Patch already exists and tested. 



+++ This bug was initially created as a clone of Bug #818577 +++

Created attachment 581846 [details]
log

Description of problem:

checking a patch I put master domain in maintenance I got a wrong master domain ERROR on spm. 
looking at logs we see the below: 

Exception: Cluster lock not locked for domain `79d5c099-dac2-44a6-9fb5-9a2f70da5388`, cannot release


Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:
1. migrate master domain to second domain
2.
3.
  
Actual results:

we get wrong master domain 

Expected results:

we should not get wrong master domain 

Additional info:

Thread-31::ERROR::2012-05-03 14:44:37,147::task::853::TaskManager.Task::(_setError) Task=`7c9afe3e-b3fe-4f61-9d22-374a6977d3ba`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1471, in reconstructMaster
    return pool.reconstructMaster(poolName, masterDom, domDict, masterVersion, safeLease)
  File "/usr/share/vdsm/storage/sp.py", line 772, in reconstructMaster
    self._releaseTemporaryClusterLock(msdUUID)
  File "/usr/share/vdsm/storage/sp.py", line 523, in _releaseTemporaryClusterLock
    msd.releaseClusterLock()
  File "/usr/share/vdsm/storage/sd.py", line 418, in releaseClusterLock
    self._clusterLock.release()
  File "/usr/share/vdsm/storage/safelease.py", line 111, in release
    raise Exception("Cluster lock not locked for domain `%s`, cannot release" % self._sdUUID)
Exception: Cluster lock not locked for domain `79d5c099-dac2-44a6-9fb5-9a2f70da5388`, cannot release
Thread-31::DEBUG::2012-05-03 14:44:37,148::task::872::TaskManager.Task::(_run) Task=`7c9afe3e-b3fe-4f61-9d22-374a6977d3ba`::Task._run: 7c9afe3e-b3fe-4f61-9d22-374a6977d3ba ('63dbb4ef-10e8-43ce-98bd-acfb46fccfbf', 'rhevm-iscsi', '79d5c099-dac2-44a6-9fb5-9a2f70da5388', {'10641fc4-3ad2-4538-b25b-6aa6e5c431b7': 'active', '79d5c099-dac2-44a6-9fb5-9a2f70da5388': 'active'}, 5, None, 5, 60, 10, 3) {} failed - stopping task
Thread-31::DEBUG::2012-05-03 14:44:37,149::task::1199::TaskManager.Task::(stop) Task=`7c9afe3e-b3fe-4f61-9d22-374a6977d3ba`::stopping in state preparing (force False)
Thread-31::DEBUG::2012-05-03 14:44:37,149::task::978::TaskManager.Task::(_decref) Task=`7c9afe3e-b3fe-4f61-9d22-374a6977d3ba`::ref 1 aborting True
Thread-31::INFO::2012-05-03 14:44:37,150::task::1157::TaskManager.Task::(prepare) Task=`7c9afe3e-b3fe-4f61-9d22-374a6977d3ba`::aborting: Task is aborted: u'Cluster lock not locked for domain `79d5c099-dac2-44a6-9fb5-9a2f70da5388`, cannot release' - code 100

--- Additional comment from dron on 2012-05-03 08:29:21 EDT ---

I was testing a custom vdsm for Eduardo's patch:
http://gerrit.ovirt.org/#change,4085

--- Additional comment from fsimonce on 2012-05-03 08:34:51 EDT ---

The change introduced with patch 4085 (in practice) is refreshing the domain objects more frequently (now also within the same method even if we're holding a weakref). Therefore we strictly need a fix to make the clusterlock (safelease) stateless:

http://gerrit.ovirt.org/3497

Comment 2 Dan Kenigsberg 2012-05-09 09:52:48 UTC

*** This bug has been marked as a duplicate of bug 818577 ***