Bug 968891

Summary: Attaching NFS storage domain failed with 'Sanlock lockspace add failure'
Product: Red Hat Enterprise Virtualization Manager Reporter: Katarzyna Jachim <kjachim>
Component: ovirt-engineAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED NOTABUG QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acathrow, amureini, dron, dyasny, iheim, lpeer, ncredi, Rhev-m-bugs, yeylon, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-05-30 12:32:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs (vdsm, sanlock, engine, server, /var/log/messages...) none

Description Katarzyna Jachim 2013-05-30 08:33:02 UTC
Created attachment 754716 [details]
logs (vdsm, sanlock, engine, server, /var/log/messages...)

Description of problem:

Attaching an NFS storage domain failed with the following error in vdsm.log:

Thread-111::ERROR::2013-05-30 01:20:18,469::task::850::TaskManager.Task::(_setError) Task=`98ba799f-078e-418b-832c-cb03181f328e`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1098, in attachStorageDomain
    pool.attachSD(sdUUID)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 932, in attachSD
    dom.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 445, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 188, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('795c691b-0aaf-4068-bb29-8ad969bc936a', SanlockException(-202, 'Sanlock lockspace add failure', 'Sanlock exception'))


Version-Release number of selected component (if applicable):
sf17.3
vdsm-4.10.2-22.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.5.x86_64

How reproducible:
happened once in automated tests: http://jenkins.qa.lab.tlv.redhat.com:8080/view/Storage/view/3.2/job/3.2-storage_nfs_options_sanity-nfs-rest/6/


Steps to Reproduce:
1. add, attach & activate master NFS storage domain 
2. add another NFS storage domain (NFS v3) 
3. try to attach the storage domain

I have no idea if the fact that it is a non-master domain or NFSv3 matters. It is definitely non-deterministic, re-run test passed.

Actual results:
Attach failed with sanlock exception

Expected results:
Storage domain should be attached

Additional info:

Comment 1 Dafna Ron 2013-05-30 12:32:02 UTC
its not a bug. 
sanlock is doing what its suppose to do. 
we have a race here. 
the create and connect of the Pool was still holding the lock on the master domain. 
when a second domain will be added sanlock cannot acquire lock and add domain will fail. 
closing as not a bug. 

below is the thread from sanlock log showing the race on the locks where this is the first domain: 28a3f1a9-7e8d-4b4b-8c69-1410f769aedc and this is the second domain: 795c691b-0aaf-4068-bb29-8ad969bc936a. 
 

2013-05-30 01:16:50+0300 22 [2592]: sanlock daemon started 2.6 host 104b6509-03c1-43c3-8d80-488158a1809e.puma19.scl
2013-05-30 01:18:42+0300 133 [2597]: s1 lockspace 28a3f1a9-7e8d-4b4b-8c69-1410f769aedc:250:/rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__36__484978/28a3f1a9-7e8d-4b4b-8c69-1410f769aedc/dom_md/ids:0
2013-05-30 01:19:03+0300 154 [2598]: s1:r1 resource 28a3f1a9-7e8d-4b4b-8c69-1410f769aedc:SDM:/rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__36__484978/28a3f1a9-7e8d-4b4b-8c69-1410f769aedc/dom_md/leases:1048576 for 3,12,3729
2013-05-30 01:19:10+0300 161 [2598]: s2 lockspace 28a3f1a9-7e8d-4b4b-8c69-1410f769aedc:1:/rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__36__484978/28a3f1a9-7e8d-4b4b-8c69-1410f769aedc/dom_md/ids:0
2013-05-30 01:19:31+0300 182 [2597]: s2:r2 resource 28a3f1a9-7e8d-4b4b-8c69-1410f769aedc:SDM:/rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__36__484978/28a3f1a9-7e8d-4b4b-8c69-1410f769aedc/dom_md/leases:1048576 for 3,12,3729
2013-05-30 01:19:43+0300 194 [2597]: s3 lockspace 795c691b-0aaf-4068-bb29-8ad969bc936a:1:/rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__37__925681/795c691b-0aaf-4068-bb29-8ad969bc936a/dom_md/ids:0
2013-05-30 01:20:13+0300 224 [6557]: 795c691b aio timeout 0 0x7f33e80008c0:0x7f33e80008d0:0x7f33e8101000 ioto 10 to_count 1
2013-05-30 01:20:13+0300 224 [6557]: read_sectors delta_leader offset 0 rv -202 /rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__37__925681/795c691b-0aaf-4068-bb29-8ad969bc936a/dom_md/ids
2013-05-30 01:20:18+0300 229 [6557]: 795c691b aio collect 0x7f33e80008c0:0x7f33e80008d0:0x7f33e8101000 result 512:0 close free
2013-05-30 01:20:18+0300 229 [2597]: s3 add_lockspace fail result -202
2013-05-30 01:20:43+0300 254 [2598]: s4 lockspace 28a3f1a9-7e8d-4b4b-8c69-1410f769aedc:1:/rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__36__484978/28a3f1a9-7e8d-4b4b-8c69-1410f769aedc/dom_md/ids:0
2013-05-30 01:21:05+0300 276 [2597]: s4:r3 resource 28a3f1a9-7e8d-4b4b-8c69-1410f769aedc:SDM:/rhev/data-center/mnt/10.35.64.106:_fastpass_jenkins-vm-01__nfs__2013__05__30__1__12__36__484978/28a3f1a9-7e8d-4b4b-8c69-1410f769aedc/dom_md/leases:1048576 for 3,12,3729