Bug 882958

Summary: vdsm: sanlock cannot acquire cluster lock after wrong master domain or its version - pool cannot recover (posix)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: Leonid Natapov <lnatapov>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.2.0CC: abaron, amureini, bazulay, fsimonce, hateya, iheim, knesenko, lpeer, scohen, sgrinber, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: vdsm-4.10.2-11.0.el6ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 922807    
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-12-03 13:34:38 UTC
Created attachment 656630 [details]
logs

Description of problem:

with two domains under the pool, I put master domain in maintenance and sent refreshStoragePool in spm. 
after wrong master domain or its version error the sanlock cannot obtain lock with the following error: 

AcquireLockFailure: Cannot obtain lock: "id=e4d412b7-25f5-4948-bf24-45cab8de5816, rc=17, out=Cannot acquire cluster lock, err=(17, 'Sanlock resource not acquired', 'File exists')"


2012-12-03 15:26:19+0200 535121 [2184]: s38:r102 resource e4d412b7-25f5-4948-bf24-45cab8de5816:SDM:/rhev/data-center/mnt/filer01.qa.lab.tlv.redhat.com:_Daffi/e4d412b7-25f5-4948-bf24-45cab8de5816/dom_md/leases:1048576 for 3,12,21005
2012-12-03 15:26:19+0200 535121 [2184]: r102 acquire_token resource exists


Version-Release number of selected component (if applicable):

vdsm-4.9.6-44.0.el6_3.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create a pool with two nfs posix domains 
2. put the master domain in maintenance 
3. manually run refreshStoragePool on old master domain UUID. 
  
Actual results:

after wrong master domain or its version error sanlock cannot obtain lock 

Expected results:

we should send reconstruct master and recover. 

Additional info: logs

Comment 1 Allon Mureinik 2012-12-05 10:39:54 UTC
Fede, I understood from Haim that you already looked into this.

Can you add a comment here with your findings?

Comment 2 Allon Mureinik 2012-12-05 10:41:31 UTC
Dafna, do you have a reproducers that does not invlove vdsClient? Can you add a comment describing it please?

Thanks.

Comment 3 Dafna Ron 2012-12-05 10:54:06 UTC
it was a race that we noticed when attaching/detaching export/iso domains while putting a host in maintenance 
but... since its a race that only happened twice I found a better way to reproduce 100% of the times.

Comment 4 RHEL Program Management 2012-12-14 07:52:13 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 5 Federico Simoncelli 2012-12-21 16:54:06 UTC
Sending a refreshStoragePool with incorrect parameters (wrong master domain or its version) triggers a pool disconnection that jumps some initial validations (for example the validateNotSPM check) and goes directly to the deactivation without releasing the cluster lock.

Receiving a refreshStoragePool with a wrong master or version is quite problematic on the SPM, we should decide if vdsm should bail out releasing the spm and disconnecting from the storage pool, or if it should just report a big warning in the logs.

Anyway I'm still thinking if refreshStoragePool has any meaning used on the SPM, probably it's just a way to refresh the iscsi connections and clear the cache. For sure sending it (also to the SPM) during the master migration is quite problematic (even more if the old master is used).

Comment 6 Ayal Baron 2012-12-23 14:20:28 UTC
(In reply to comment #5)
> Sending a refreshStoragePool with incorrect parameters (wrong master domain
> or its version) triggers a pool disconnection that jumps some initial
> validations (for example the validateNotSPM check) and goes directly to the
> deactivation without releasing the cluster lock.

I think you mean connectStoragePool as refreshStoragePool is called after reconstructMaster which changes the master domain by definition.

> 
> Receiving a refreshStoragePool with a wrong master or version is quite
> problematic on the SPM, we should decide if vdsm should bail out releasing
> the spm and disconnecting from the storage pool, or if it should just report
> a big warning in the logs.
> 
> Anyway I'm still thinking if refreshStoragePool has any meaning used on the
> SPM, probably it's just a way to refresh the iscsi connections and clear the
> cache. For sure sending it (also to the SPM) during the master migration is
> quite problematic (even more if the old master is used).

Probably the loop that calls refresh on all hosts did not exclude the SPM in the engine.

Comment 7 Ayal Baron 2012-12-23 14:29:10 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Sending a refreshStoragePool with incorrect parameters (wrong master domain
> > or its version) triggers a pool disconnection that jumps some initial
> > validations (for example the validateNotSPM check) and goes directly to the
> > deactivation without releasing the cluster lock.
> 
> I think you mean connectStoragePool as refreshStoragePool is called after
> reconstructMaster which changes the master domain by definition.

scratch that, I thought you meant valid master which is different.  nm.

> 
> > 
> > Receiving a refreshStoragePool with a wrong master or version is quite
> > problematic on the SPM, we should decide if vdsm should bail out releasing
> > the spm and disconnecting from the storage pool, or if it should just report
> > a big warning in the logs.
> > 
> > Anyway I'm still thinking if refreshStoragePool has any meaning used on the
> > SPM, probably it's just a way to refresh the iscsi connections and clear the
> > cache. For sure sending it (also to the SPM) during the master migration is
> > quite problematic (even more if the old master is used).
> 
> Probably the loop that calls refresh on all hosts did not exclude the SPM in
> the engine.

refreshStoragePool has no meaning on the SPM whatsoever.

Comment 8 Federico Simoncelli 2013-01-02 15:38:42 UTC
commit 671b0bca4d9a671f108e31916469df5943e5db4e
Author: Federico Simoncelli <fsimonce>
Date:   Fri Dec 28 10:08:07 2012 -0500

    pool: ignore refreshStoragePool calls on the SPM
    
    The refreshStoragePool command is an HSM command and should not be
    issued (and executed) on the SPM. At the moment we just ignore it
    for legacy reasons but in the future vdsm could raise an exception.

http://gerrit.ovirt.org/#/c/10450/

Comment 9 Leonid Natapov 2013-03-14 12:37:38 UTC
vdsm-4.10.2-11.0.el6ev.x86_64. Tested according to described scenario. The refreshStoragePool command wasn't send.

Comment 10 Itamar Heim 2013-06-11 09:32:29 UTC
3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:33:19 UTC
3.2 has been released

Comment 12 Itamar Heim 2013-06-11 09:49:09 UTC
3.2 has been released