Created attachment 656630 [details] logs Description of problem: with two domains under the pool, I put master domain in maintenance and sent refreshStoragePool in spm. after wrong master domain or its version error the sanlock cannot obtain lock with the following error: AcquireLockFailure: Cannot obtain lock: "id=e4d412b7-25f5-4948-bf24-45cab8de5816, rc=17, out=Cannot acquire cluster lock, err=(17, 'Sanlock resource not acquired', 'File exists')" 2012-12-03 15:26:19+0200 535121 [2184]: s38:r102 resource e4d412b7-25f5-4948-bf24-45cab8de5816:SDM:/rhev/data-center/mnt/filer01.qa.lab.tlv.redhat.com:_Daffi/e4d412b7-25f5-4948-bf24-45cab8de5816/dom_md/leases:1048576 for 3,12,21005 2012-12-03 15:26:19+0200 535121 [2184]: r102 acquire_token resource exists Version-Release number of selected component (if applicable): vdsm-4.9.6-44.0.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. create a pool with two nfs posix domains 2. put the master domain in maintenance 3. manually run refreshStoragePool on old master domain UUID. Actual results: after wrong master domain or its version error sanlock cannot obtain lock Expected results: we should send reconstruct master and recover. Additional info: logs
Fede, I understood from Haim that you already looked into this. Can you add a comment here with your findings?
Dafna, do you have a reproducers that does not invlove vdsClient? Can you add a comment describing it please? Thanks.
it was a race that we noticed when attaching/detaching export/iso domains while putting a host in maintenance but... since its a race that only happened twice I found a better way to reproduce 100% of the times.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Sending a refreshStoragePool with incorrect parameters (wrong master domain or its version) triggers a pool disconnection that jumps some initial validations (for example the validateNotSPM check) and goes directly to the deactivation without releasing the cluster lock. Receiving a refreshStoragePool with a wrong master or version is quite problematic on the SPM, we should decide if vdsm should bail out releasing the spm and disconnecting from the storage pool, or if it should just report a big warning in the logs. Anyway I'm still thinking if refreshStoragePool has any meaning used on the SPM, probably it's just a way to refresh the iscsi connections and clear the cache. For sure sending it (also to the SPM) during the master migration is quite problematic (even more if the old master is used).
(In reply to comment #5) > Sending a refreshStoragePool with incorrect parameters (wrong master domain > or its version) triggers a pool disconnection that jumps some initial > validations (for example the validateNotSPM check) and goes directly to the > deactivation without releasing the cluster lock. I think you mean connectStoragePool as refreshStoragePool is called after reconstructMaster which changes the master domain by definition. > > Receiving a refreshStoragePool with a wrong master or version is quite > problematic on the SPM, we should decide if vdsm should bail out releasing > the spm and disconnecting from the storage pool, or if it should just report > a big warning in the logs. > > Anyway I'm still thinking if refreshStoragePool has any meaning used on the > SPM, probably it's just a way to refresh the iscsi connections and clear the > cache. For sure sending it (also to the SPM) during the master migration is > quite problematic (even more if the old master is used). Probably the loop that calls refresh on all hosts did not exclude the SPM in the engine.
(In reply to comment #6) > (In reply to comment #5) > > Sending a refreshStoragePool with incorrect parameters (wrong master domain > > or its version) triggers a pool disconnection that jumps some initial > > validations (for example the validateNotSPM check) and goes directly to the > > deactivation without releasing the cluster lock. > > I think you mean connectStoragePool as refreshStoragePool is called after > reconstructMaster which changes the master domain by definition. scratch that, I thought you meant valid master which is different. nm. > > > > > Receiving a refreshStoragePool with a wrong master or version is quite > > problematic on the SPM, we should decide if vdsm should bail out releasing > > the spm and disconnecting from the storage pool, or if it should just report > > a big warning in the logs. > > > > Anyway I'm still thinking if refreshStoragePool has any meaning used on the > > SPM, probably it's just a way to refresh the iscsi connections and clear the > > cache. For sure sending it (also to the SPM) during the master migration is > > quite problematic (even more if the old master is used). > > Probably the loop that calls refresh on all hosts did not exclude the SPM in > the engine. refreshStoragePool has no meaning on the SPM whatsoever.
commit 671b0bca4d9a671f108e31916469df5943e5db4e Author: Federico Simoncelli <fsimonce> Date: Fri Dec 28 10:08:07 2012 -0500 pool: ignore refreshStoragePool calls on the SPM The refreshStoragePool command is an HSM command and should not be issued (and executed) on the SPM. At the moment we just ignore it for legacy reasons but in the future vdsm could raise an exception. http://gerrit.ovirt.org/#/c/10450/
vdsm-4.10.2-11.0.el6ev.x86_64. Tested according to described scenario. The refreshStoragePool command wasn't send.
3.2 has been released