Bug 1359659 - [vdsm] err=(-202, 'Sanlock resource not acquired', 'Sanlock exception') for storage domain attachment to storage pool
Summary: [vdsm] err=(-202, 'Sanlock resource not acquired', 'Sanlock exception') for s...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.18.8
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.0.4
: ---
Assignee: Nir Soffer
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-25 08:59 UTC by Elad
Modified: 2016-08-15 09:03 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-08-15 09:03:40 UTC
oVirt Team: Storage
Embargoed:
gklein: ovirt-4.0.z?
ebenahar: planning_ack?
ebenahar: devel_ack?
ebenahar: testing_ack?


Attachments (Terms of Use)
logs and output from sanlock status commands for 2 hosts (2.85 MB, application/x-gzip)
2016-07-25 09:02 UTC, Elad
no flags Details

Description Elad 2016-07-25 08:59:23 UTC
Description of problem:
We're facing an issue while creating storage domains reside on a specific storage server (Netapp) from specific hosts (mainly ones that located on a different lab than the storage).

 
We get a Sanlock exception during storage domain attachment to DC.

"Error while executing action Attach Storage Domain: Could not obtain lock"


Thread-12::DEBUG::2016-07-24 16:43:51,918::check::327::storage.check::(_check_completed) FINISH check '/rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-setup_tmpQvA0pI/8ea9e23b-da07-4662-8
366-e10b67ded636/dom_md/metadata' rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n517 bytes (517 B) copied, 0.000121961 s, 4.2 MB/s\n') elapsed=0.04
jsonrpc.Executor/2::DEBUG::2016-07-24 16:43:54,439::clusterlock::269::Storage.SANLock::(releaseHostId) Host id for domain 4c07f161-388a-44fb-a2dd-6f36582be748 released successfully (id: 250)
jsonrpc.Executor/2::ERROR::2016-07-24 16:43:54,439::sp::626::Storage.StoragePool::(create) Create pool hosted_datacenter canceled
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 623, in create
    self.attachSD(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 78, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 950, in attachSD
    dom.acquireClusterLock(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 682, in acquireClusterLock
    self._manifest.acquireDomainLock(hostID)
  File "/usr/share/vdsm/storage/sd.py", line 422, in acquireDomainLock
    self._domainLock.acquire(hostID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 317, in acquire
    "Cannot acquire cluster lock", str(e))
AcquireLockFailure: Cannot obtain lock: "id=4c07f161-388a-44fb-a2dd-6f36582be748, rc=-202, out=Cannot acquire cluster lock, err=(-202, 'Sanlock resource not acquired', 'Sanlock exception')"




Version-Release number of selected component (if applicable):
vdsm-4.18.8-1.el7ev.x86_64
package libvirt is not installed
libvirt-daemon-1.2.17-13.el7_2.5.x86_64
sanlock-3.2.4-3.el7_2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.19.x86_64
selinux-policy-3.13.1-60.el7_2.7.noarch


How reproducible:
With a NetApp storage server and hypervisors which are geographically remote located.
Occurs with both file and block storage.


Steps to Reproduce:
1. Create a storage domain (NFS or iSCSI) with the specifications described above (storage server and hypervisors location)


Actual results:
Storage domain attachment to storage pool fails with the mentioned Sanlock exception.


===============================================================================

sanlock.log:

2016-07-24 16:56:14+0300 614 [702]: s3 lockspace 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb:1:/rhev/data-center/mnt/10.35.118.113:_nas01_artyom__test/1dfb9fa1-d4d5-4d50-9cea-655c030de6bb/dom_md/ids:0
2016-07-24 16:56:35+0300 635 [4361]: s3:r2 resource 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb:SDM:/rhev/data-center/mnt/10.35.118.113:_nas01_artyom__test/1dfb9fa1-d4d5-4d50-9cea-655c030de6bb/dom_md/leases:1048576 for 2,9,2608
2016-07-24 16:56:42+0300 642 [697]: s3 host 1 1 614 b4a23c98-6742-4a3e-af45-2146b4f7bb34.blond-vdsh
2016-07-24 16:56:52+0300 652 [4361]: worker2 aio timeout 0 0x7fe4580008c0:0x7fe4580008d0:0x7fe469cf7000 ioto 10 to_count 1
2016-07-24 16:56:52+0300 652 [4361]: r2 acquire_token disk error -202
2016-07-24 16:56:52+0300 652 [4361]: r2 cmd_acquire 2,9,2608 acquire_token -202 
2016-07-24 16:56:54+0300 654 [4361]: worker2 aio collect 0 0x7fe4580008c0:0x7fe4580008d0:0x7fe469cf7000 result 1048576:0 other free

===============================================================================



Expected results:
Storage domain attachment to storage pool should succeed.

Additional info:

The issue doesn't happen with hosts which are located in the same lab with the storage server. 

vdsm.log, sanlock.log and outputs from sanlock status host and client commands are attached for 2 hosts: 1 located in a different lab than the storage server where storage domain attachment to pool fails and the second located in the same lab where storage domain attachment to pool succeeds.

Comment 1 Elad 2016-07-25 09:02:36 UTC
Created attachment 1183641 [details]
logs and output from sanlock status commands for 2 hosts

Comment 2 Nir Soffer 2016-07-25 09:43:46 UTC
This looks like a timeout in sanlock:

1. We try to attach the domain

jsonrpc.Executor/6::INFO::2016-07-24 16:56:14,173::sp::906::Storage.StoragePool::(attachSD) sdUUID=1dfb9fa1-d4d5-4d50-9cea-655c030de6bb spUUID=00000001-0001-0001-0001-000000
000005


2. Acquiring host id on this domain (synchronous)

jsonrpc.Executor/6::INFO::2016-07-24 16:56:14,174::clusterlock::236::Storage.SANLock::(acquireHostId) Acquiring host id for domain 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb (id: 
1)


3. Acquiring suceeded

jsonrpc.Executor/6::DEBUG::2016-07-24 16:56:35,176::clusterlock::254::Storage.SANLock::(acquireHostId) Host id for domain 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb successfully acquired (id: 1)


4. Trying to acquire the domain lock

jsonrpc.Executor/6::INFO::2016-07-24 16:56:35,177::clusterlock::298::Storage.SANLock::(acquire) Acquiring cluster lock for domain 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb (id: 1)


5. Timed out after 19 seconds 

jsonrpc.Executor/6::ERROR::2016-07-24 16:56:54,212::task::868::Storage.TaskManager.Task::(_setError) Task=`19e02374-68ff-4293-8e68-253689784ff2`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 875, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1162, in attachStorageDomain
    pool.attachSD(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 78, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 932, in attachSD
    dom.acquireClusterLock(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 682, in acquireClusterLock
    self._manifest.acquireDomainLock(hostID)
  File "/usr/share/vdsm/storage/sd.py", line 422, in acquireDomainLock
    self._domainLock.acquire(hostID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 317, in acquire
    "Cannot acquire cluster lock", str(e))
AcquireLockFailure: Cannot obtain lock: u"id=1dfb9fa1-d4d5-4d50-9cea-655c030de6bb, rc=-202, out=Cannot acquire cluster lock, err=(-202, 'Sanlock resource not acquired', 'Sanlock exception')"


Elad, is this reproducible? how many time did you try? how many times it failed?

Comment 3 Nir Soffer 2016-07-25 09:44:30 UTC
David, can you suggest how to continue debugging this issue?

Comment 4 Elad 2016-07-25 10:25:02 UTC
It's reproduced every time with hosts which are located in different lab than the storage. 
Artyom has tried it many times with such hosts for hosted-engine automation tests.

Comment 5 David Teigland 2016-07-25 14:46:27 UTC
I'd suggest debugging as close to the source of the problem as possible, which I suppose means analyzing io at the nfs level, or possibly the network level.  I'm afraid I don't have any suggestions for exactly what to look at or trace or what tools to use.

Comment 6 Yaniv Kaul 2016-07-27 07:57:00 UTC
(In reply to Elad from comment #4)
> It's reproduced every time with hosts which are located in different lab
> than the storage. 
> Artyom has tried it many times with such hosts for hosted-engine automation
> tests.

Elad - what is the RTT between the hosts and the storage?
Is it a slow connection, slow storage, both?

Comment 7 Artyom 2016-07-28 06:31:32 UTC
I checked with Dotan what was the reason for packet loss and it was some preventing DDoS mechanism. After disabling it packet loss disappeared, but I still have the same issue with HE deploy.

[root@rose05 ~]# ping -A 10.35.118.118
PING 10.35.118.118 (10.35.118.118) 56(84) bytes of data.
64 bytes from 10.35.118.118: icmp_seq=1 ttl=254 time=0.083 ms
64 bytes from 10.35.118.118: icmp_seq=2 ttl=254 time=0.092 ms
64 bytes from 10.35.118.118: icmp_seq=3 ttl=254 time=0.097 ms
64 bytes from 10.35.118.118: icmp_seq=4 ttl=254 time=0.092 ms
....
64 bytes from 10.35.118.118: icmp_seq=14023 ttl=254 time=0.077 ms
64 bytes from 10.35.118.118: icmp_seq=14024 ttl=254 time=0.078 ms
64 bytes from 10.35.118.118: icmp_seq=14025 ttl=254 time=0.073 ms
64 bytes from 10.35.118.118: icmp_seq=14026 ttl=254 time=0.079 ms
64 bytes from 10.35.118.118: icmp_seq=14027 ttl=254 time=0.070 ms
^C
--- 10.35.118.118 ping statistics ---
14028 packets transmitted, 14027 received, 0% packet loss, time 1396ms
rtt min/avg/max/mdev = 0.052/0.082/0.254/0.011 ms, ipg/ewma 0.099/0.080 ms

If someone can take a look I can provide the host.

Comment 8 Yaniv Kaul 2016-08-07 05:46:00 UTC
(In reply to Artyom from comment #7)
> I checked with Dotan what was the reason for packet loss and it was some
> preventing DDoS mechanism. After disabling it packet loss disappeared, but I
> still have the same issue with HE deploy.
> 
> [root@rose05 ~]# ping -A 10.35.118.118
> PING 10.35.118.118 (10.35.118.118) 56(84) bytes of data.
> 64 bytes from 10.35.118.118: icmp_seq=1 ttl=254 time=0.083 ms
> 64 bytes from 10.35.118.118: icmp_seq=2 ttl=254 time=0.092 ms
> 64 bytes from 10.35.118.118: icmp_seq=3 ttl=254 time=0.097 ms
> 64 bytes from 10.35.118.118: icmp_seq=4 ttl=254 time=0.092 ms
> ....
> 64 bytes from 10.35.118.118: icmp_seq=14023 ttl=254 time=0.077 ms
> 64 bytes from 10.35.118.118: icmp_seq=14024 ttl=254 time=0.078 ms
> 64 bytes from 10.35.118.118: icmp_seq=14025 ttl=254 time=0.073 ms
> 64 bytes from 10.35.118.118: icmp_seq=14026 ttl=254 time=0.079 ms
> 64 bytes from 10.35.118.118: icmp_seq=14027 ttl=254 time=0.070 ms
> ^C
> --- 10.35.118.118 ping statistics ---
> 14028 packets transmitted, 14027 received, 0% packet loss, time 1396ms
> rtt min/avg/max/mdev = 0.052/0.082/0.254/0.011 ms, ipg/ewma 0.099/0.080 ms
> 
> If someone can take a look I can provide the host.

Logs?

Comment 9 Artyom 2016-08-10 08:39:53 UTC
What logs do you need?

Comment 10 Allon Mureinik 2016-08-10 15:07:10 UTC
(In reply to Artyom from comment #9)
> What logs do you need?

First of all, vdsm's.
/var/log/messages will also be useful.

Comment 11 Artyom 2016-08-15 08:42:27 UTC
I checked manually and ran automation tests and looks like now all works fine(HE deploy pass on NFS and ISCSI NETAPP storage). I hope it was only some temporary problem. I suggest to close the bug and in case if I will encounter the same behaviour again I will reopen it.

Comment 12 Allon Mureinik 2016-08-15 09:03:40 UTC
Agreed. Closing until we have a reliable reproducer.


Note You need to log in before you can comment on or make changes to this bug.