Description of problem: We're facing an issue while creating storage domains reside on a specific storage server (Netapp) from specific hosts (mainly ones that located on a different lab than the storage). We get a Sanlock exception during storage domain attachment to DC. "Error while executing action Attach Storage Domain: Could not obtain lock" Thread-12::DEBUG::2016-07-24 16:43:51,918::check::327::storage.check::(_check_completed) FINISH check '/rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-setup_tmpQvA0pI/8ea9e23b-da07-4662-8 366-e10b67ded636/dom_md/metadata' rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n517 bytes (517 B) copied, 0.000121961 s, 4.2 MB/s\n') elapsed=0.04 jsonrpc.Executor/2::DEBUG::2016-07-24 16:43:54,439::clusterlock::269::Storage.SANLock::(releaseHostId) Host id for domain 4c07f161-388a-44fb-a2dd-6f36582be748 released successfully (id: 250) jsonrpc.Executor/2::ERROR::2016-07-24 16:43:54,439::sp::626::Storage.StoragePool::(create) Create pool hosted_datacenter canceled Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 623, in create self.attachSD(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 78, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 950, in attachSD dom.acquireClusterLock(self.id) File "/usr/share/vdsm/storage/sd.py", line 682, in acquireClusterLock self._manifest.acquireDomainLock(hostID) File "/usr/share/vdsm/storage/sd.py", line 422, in acquireDomainLock self._domainLock.acquire(hostID) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 317, in acquire "Cannot acquire cluster lock", str(e)) AcquireLockFailure: Cannot obtain lock: "id=4c07f161-388a-44fb-a2dd-6f36582be748, rc=-202, out=Cannot acquire cluster lock, err=(-202, 'Sanlock resource not acquired', 'Sanlock exception')" Version-Release number of selected component (if applicable): vdsm-4.18.8-1.el7ev.x86_64 package libvirt is not installed libvirt-daemon-1.2.17-13.el7_2.5.x86_64 sanlock-3.2.4-3.el7_2.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.19.x86_64 selinux-policy-3.13.1-60.el7_2.7.noarch How reproducible: With a NetApp storage server and hypervisors which are geographically remote located. Occurs with both file and block storage. Steps to Reproduce: 1. Create a storage domain (NFS or iSCSI) with the specifications described above (storage server and hypervisors location) Actual results: Storage domain attachment to storage pool fails with the mentioned Sanlock exception. =============================================================================== sanlock.log: 2016-07-24 16:56:14+0300 614 [702]: s3 lockspace 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb:1:/rhev/data-center/mnt/10.35.118.113:_nas01_artyom__test/1dfb9fa1-d4d5-4d50-9cea-655c030de6bb/dom_md/ids:0 2016-07-24 16:56:35+0300 635 [4361]: s3:r2 resource 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb:SDM:/rhev/data-center/mnt/10.35.118.113:_nas01_artyom__test/1dfb9fa1-d4d5-4d50-9cea-655c030de6bb/dom_md/leases:1048576 for 2,9,2608 2016-07-24 16:56:42+0300 642 [697]: s3 host 1 1 614 b4a23c98-6742-4a3e-af45-2146b4f7bb34.blond-vdsh 2016-07-24 16:56:52+0300 652 [4361]: worker2 aio timeout 0 0x7fe4580008c0:0x7fe4580008d0:0x7fe469cf7000 ioto 10 to_count 1 2016-07-24 16:56:52+0300 652 [4361]: r2 acquire_token disk error -202 2016-07-24 16:56:52+0300 652 [4361]: r2 cmd_acquire 2,9,2608 acquire_token -202 2016-07-24 16:56:54+0300 654 [4361]: worker2 aio collect 0 0x7fe4580008c0:0x7fe4580008d0:0x7fe469cf7000 result 1048576:0 other free =============================================================================== Expected results: Storage domain attachment to storage pool should succeed. Additional info: The issue doesn't happen with hosts which are located in the same lab with the storage server. vdsm.log, sanlock.log and outputs from sanlock status host and client commands are attached for 2 hosts: 1 located in a different lab than the storage server where storage domain attachment to pool fails and the second located in the same lab where storage domain attachment to pool succeeds.
Created attachment 1183641 [details] logs and output from sanlock status commands for 2 hosts
This looks like a timeout in sanlock: 1. We try to attach the domain jsonrpc.Executor/6::INFO::2016-07-24 16:56:14,173::sp::906::Storage.StoragePool::(attachSD) sdUUID=1dfb9fa1-d4d5-4d50-9cea-655c030de6bb spUUID=00000001-0001-0001-0001-000000 000005 2. Acquiring host id on this domain (synchronous) jsonrpc.Executor/6::INFO::2016-07-24 16:56:14,174::clusterlock::236::Storage.SANLock::(acquireHostId) Acquiring host id for domain 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb (id: 1) 3. Acquiring suceeded jsonrpc.Executor/6::DEBUG::2016-07-24 16:56:35,176::clusterlock::254::Storage.SANLock::(acquireHostId) Host id for domain 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb successfully acquired (id: 1) 4. Trying to acquire the domain lock jsonrpc.Executor/6::INFO::2016-07-24 16:56:35,177::clusterlock::298::Storage.SANLock::(acquire) Acquiring cluster lock for domain 1dfb9fa1-d4d5-4d50-9cea-655c030de6bb (id: 1) 5. Timed out after 19 seconds jsonrpc.Executor/6::ERROR::2016-07-24 16:56:54,212::task::868::Storage.TaskManager.Task::(_setError) Task=`19e02374-68ff-4293-8e68-253689784ff2`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 875, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1162, in attachStorageDomain pool.attachSD(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 78, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 932, in attachSD dom.acquireClusterLock(self.id) File "/usr/share/vdsm/storage/sd.py", line 682, in acquireClusterLock self._manifest.acquireDomainLock(hostID) File "/usr/share/vdsm/storage/sd.py", line 422, in acquireDomainLock self._domainLock.acquire(hostID) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 317, in acquire "Cannot acquire cluster lock", str(e)) AcquireLockFailure: Cannot obtain lock: u"id=1dfb9fa1-d4d5-4d50-9cea-655c030de6bb, rc=-202, out=Cannot acquire cluster lock, err=(-202, 'Sanlock resource not acquired', 'Sanlock exception')" Elad, is this reproducible? how many time did you try? how many times it failed?
David, can you suggest how to continue debugging this issue?
It's reproduced every time with hosts which are located in different lab than the storage. Artyom has tried it many times with such hosts for hosted-engine automation tests.
I'd suggest debugging as close to the source of the problem as possible, which I suppose means analyzing io at the nfs level, or possibly the network level. I'm afraid I don't have any suggestions for exactly what to look at or trace or what tools to use.
(In reply to Elad from comment #4) > It's reproduced every time with hosts which are located in different lab > than the storage. > Artyom has tried it many times with such hosts for hosted-engine automation > tests. Elad - what is the RTT between the hosts and the storage? Is it a slow connection, slow storage, both?
I checked with Dotan what was the reason for packet loss and it was some preventing DDoS mechanism. After disabling it packet loss disappeared, but I still have the same issue with HE deploy. [root@rose05 ~]# ping -A 10.35.118.118 PING 10.35.118.118 (10.35.118.118) 56(84) bytes of data. 64 bytes from 10.35.118.118: icmp_seq=1 ttl=254 time=0.083 ms 64 bytes from 10.35.118.118: icmp_seq=2 ttl=254 time=0.092 ms 64 bytes from 10.35.118.118: icmp_seq=3 ttl=254 time=0.097 ms 64 bytes from 10.35.118.118: icmp_seq=4 ttl=254 time=0.092 ms .... 64 bytes from 10.35.118.118: icmp_seq=14023 ttl=254 time=0.077 ms 64 bytes from 10.35.118.118: icmp_seq=14024 ttl=254 time=0.078 ms 64 bytes from 10.35.118.118: icmp_seq=14025 ttl=254 time=0.073 ms 64 bytes from 10.35.118.118: icmp_seq=14026 ttl=254 time=0.079 ms 64 bytes from 10.35.118.118: icmp_seq=14027 ttl=254 time=0.070 ms ^C --- 10.35.118.118 ping statistics --- 14028 packets transmitted, 14027 received, 0% packet loss, time 1396ms rtt min/avg/max/mdev = 0.052/0.082/0.254/0.011 ms, ipg/ewma 0.099/0.080 ms If someone can take a look I can provide the host.
(In reply to Artyom from comment #7) > I checked with Dotan what was the reason for packet loss and it was some > preventing DDoS mechanism. After disabling it packet loss disappeared, but I > still have the same issue with HE deploy. > > [root@rose05 ~]# ping -A 10.35.118.118 > PING 10.35.118.118 (10.35.118.118) 56(84) bytes of data. > 64 bytes from 10.35.118.118: icmp_seq=1 ttl=254 time=0.083 ms > 64 bytes from 10.35.118.118: icmp_seq=2 ttl=254 time=0.092 ms > 64 bytes from 10.35.118.118: icmp_seq=3 ttl=254 time=0.097 ms > 64 bytes from 10.35.118.118: icmp_seq=4 ttl=254 time=0.092 ms > .... > 64 bytes from 10.35.118.118: icmp_seq=14023 ttl=254 time=0.077 ms > 64 bytes from 10.35.118.118: icmp_seq=14024 ttl=254 time=0.078 ms > 64 bytes from 10.35.118.118: icmp_seq=14025 ttl=254 time=0.073 ms > 64 bytes from 10.35.118.118: icmp_seq=14026 ttl=254 time=0.079 ms > 64 bytes from 10.35.118.118: icmp_seq=14027 ttl=254 time=0.070 ms > ^C > --- 10.35.118.118 ping statistics --- > 14028 packets transmitted, 14027 received, 0% packet loss, time 1396ms > rtt min/avg/max/mdev = 0.052/0.082/0.254/0.011 ms, ipg/ewma 0.099/0.080 ms > > If someone can take a look I can provide the host. Logs?
What logs do you need?
(In reply to Artyom from comment #9) > What logs do you need? First of all, vdsm's. /var/log/messages will also be useful.
I checked manually and ran automation tests and looks like now all works fine(HE deploy pass on NFS and ISCSI NETAPP storage). I hope it was only some temporary problem. I suggest to close the bug and in case if I will encounter the same behaviour again I will reopen it.
Agreed. Closing until we have a reliable reproducer.