882958 – vdsm: sanlock cannot acquire cluster lock after wrong master domain or its version - pool cannot recover (posix)

Bug 882958 - vdsm: sanlock cannot acquire cluster lock after wrong master domain or its version - pool cannot recover (posix)

Summary: vdsm: sanlock cannot acquire cluster lock after wrong master domain or its ve...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Federico Simoncelli
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	922807
TreeView+	depends on / blocked

Reported:	2012-12-03 13:34 UTC by Dafna Ron
Modified:	2018-12-01 14:34 UTC (History)
CC List:	11 users (show)
Fixed In Version:	vdsm-4.10.2-11.0.el6ev
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (710.85 KB, application/x-gzip) 2012-12-03 13:34 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	10450	0	None	None	None	Never
oVirt gerrit	10452	0	None	ABANDONED	[wip] core: avoid RefreshStoragePool on the SPM	Never

Description Dafna Ron 2012-12-03 13:34:38 UTC

Created attachment 656630 [details]
logs

Description of problem:

with two domains under the pool, I put master domain in maintenance and sent refreshStoragePool in spm. 
after wrong master domain or its version error the sanlock cannot obtain lock with the following error: 

AcquireLockFailure: Cannot obtain lock: "id=e4d412b7-25f5-4948-bf24-45cab8de5816, rc=17, out=Cannot acquire cluster lock, err=(17, 'Sanlock resource not acquired', 'File exists')"


2012-12-03 15:26:19+0200 535121 [2184]: s38:r102 resource e4d412b7-25f5-4948-bf24-45cab8de5816:SDM:/rhev/data-center/mnt/filer01.qa.lab.tlv.redhat.com:_Daffi/e4d412b7-25f5-4948-bf24-45cab8de5816/dom_md/leases:1048576 for 3,12,21005
2012-12-03 15:26:19+0200 535121 [2184]: r102 acquire_token resource exists


Version-Release number of selected component (if applicable):

vdsm-4.9.6-44.0.el6_3.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create a pool with two nfs posix domains 
2. put the master domain in maintenance 
3. manually run refreshStoragePool on old master domain UUID. 
  
Actual results:

after wrong master domain or its version error sanlock cannot obtain lock 

Expected results:

we should send reconstruct master and recover. 

Additional info: logs

Comment 1 Allon Mureinik 2012-12-05 10:39:54 UTC

Fede, I understood from Haim that you already looked into this.

Can you add a comment here with your findings?

Comment 2 Allon Mureinik 2012-12-05 10:41:31 UTC

Dafna, do you have a reproducers that does not invlove vdsClient? Can you add a comment describing it please?

Thanks.

Comment 3 Dafna Ron 2012-12-05 10:54:06 UTC

it was a race that we noticed when attaching/detaching export/iso domains while putting a host in maintenance 
but... since its a race that only happened twice I found a better way to reproduce 100% of the times.

Comment 4 RHEL Program Management 2012-12-14 07:52:13 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 5 Federico Simoncelli 2012-12-21 16:54:06 UTC

Sending a refreshStoragePool with incorrect parameters (wrong master domain or its version) triggers a pool disconnection that jumps some initial validations (for example the validateNotSPM check) and goes directly to the deactivation without releasing the cluster lock.

Receiving a refreshStoragePool with a wrong master or version is quite problematic on the SPM, we should decide if vdsm should bail out releasing the spm and disconnecting from the storage pool, or if it should just report a big warning in the logs.

Anyway I'm still thinking if refreshStoragePool has any meaning used on the SPM, probably it's just a way to refresh the iscsi connections and clear the cache. For sure sending it (also to the SPM) during the master migration is quite problematic (even more if the old master is used).

Comment 6 Ayal Baron 2012-12-23 14:20:28 UTC

(In reply to comment #5)
> Sending a refreshStoragePool with incorrect parameters (wrong master domain
> or its version) triggers a pool disconnection that jumps some initial
> validations (for example the validateNotSPM check) and goes directly to the
> deactivation without releasing the cluster lock.

I think you mean connectStoragePool as refreshStoragePool is called after reconstructMaster which changes the master domain by definition.

> 
> Receiving a refreshStoragePool with a wrong master or version is quite
> problematic on the SPM, we should decide if vdsm should bail out releasing
> the spm and disconnecting from the storage pool, or if it should just report
> a big warning in the logs.
> 
> Anyway I'm still thinking if refreshStoragePool has any meaning used on the
> SPM, probably it's just a way to refresh the iscsi connections and clear the
> cache. For sure sending it (also to the SPM) during the master migration is
> quite problematic (even more if the old master is used).

Probably the loop that calls refresh on all hosts did not exclude the SPM in the engine.

Comment 7 Ayal Baron 2012-12-23 14:29:10 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Sending a refreshStoragePool with incorrect parameters (wrong master domain
> > or its version) triggers a pool disconnection that jumps some initial
> > validations (for example the validateNotSPM check) and goes directly to the
> > deactivation without releasing the cluster lock.
> 
> I think you mean connectStoragePool as refreshStoragePool is called after
> reconstructMaster which changes the master domain by definition.

scratch that, I thought you meant valid master which is different.  nm.

> 
> > 
> > Receiving a refreshStoragePool with a wrong master or version is quite
> > problematic on the SPM, we should decide if vdsm should bail out releasing
> > the spm and disconnecting from the storage pool, or if it should just report
> > a big warning in the logs.
> > 
> > Anyway I'm still thinking if refreshStoragePool has any meaning used on the
> > SPM, probably it's just a way to refresh the iscsi connections and clear the
> > cache. For sure sending it (also to the SPM) during the master migration is
> > quite problematic (even more if the old master is used).
> 
> Probably the loop that calls refresh on all hosts did not exclude the SPM in
> the engine.

refreshStoragePool has no meaning on the SPM whatsoever.

Comment 8 Federico Simoncelli 2013-01-02 15:38:42 UTC

commit 671b0bca4d9a671f108e31916469df5943e5db4e
Author: Federico Simoncelli <fsimonce>
Date:   Fri Dec 28 10:08:07 2012 -0500

    pool: ignore refreshStoragePool calls on the SPM
    
    The refreshStoragePool command is an HSM command and should not be
    issued (and executed) on the SPM. At the moment we just ignore it
    for legacy reasons but in the future vdsm could raise an exception.

http://gerrit.ovirt.org/#/c/10450/

Comment 9 Leonid Natapov 2013-03-14 12:37:38 UTC

vdsm-4.10.2-11.0.el6ev.x86_64. Tested according to described scenario. The refreshStoragePool command wasn't send.

Comment 10 Itamar Heim 2013-06-11 09:32:29 UTC

3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:33:19 UTC

3.2 has been released

Comment 12 Itamar Heim 2013-06-11 09:49:09 UTC

3.2 has been released

Note You need to log in before you can comment on or make changes to this bug.