Created attachment 841434 [details] engine, vdsm, sanlock and libvirt logs Description of problem: Manage storage connection: After changing storage connection path on POSIXFS storage domain and activating the domain (master), vdsm tries to perform spmStart. The operation fails with the following error message: 4c2bf930-f92d-4d8d-8aec-a508c8ee18c5::ERROR::2013-12-25 10:35:30,543::task::850::TaskManager.Task::(_setError) Task=`4c2bf930-f92d-4d8d-8aec-a508c8ee18c5`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 266, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(19, 'Sanlock lockspace add failure', 'No such device')) Version-Release number of selected component (if applicable): is28.1 vdsm-4.13.2-0.3.el6ev.x86_64 libvirt-0.10.2-29.el6.x86_64 sanlock-2.8-1.el6.x86_64 rhevm-3.3.0-0.41.el6ev.noarch How reproducible: Always Steps to Reproduce: On POSIXFS DC with 1 host 1. create 1 POSIX data domain and specify 'nfs' under VFS Type 2. create a new mount point and copy all the data from the domain's mount point to the new one 3. put the domain in maintenance and edit its storage connection path. Change it to the new mount point 3. activate the domain Actual results: spmStart fails. When changing the storage connection path back to the old one, spmStart succeed. Expected results: spmStart should woek Additional info: engine, vdsm, sanlock and libvirt logs
Elad: does this work with other storage domains?
Allon: is this operation (copying domain data manually from one domain to another) supported?
Federico: what can cause this sanlock failure?
(In reply to Nir Soffer from comment #2) > Allon: is this operation (copying domain data manually from one domain to > another) supported? It works fine with nfs and iscsi
(In reply to Nir Soffer from comment #1) > Elad: does this work with other storage domains? (In reply to Nir Soffer from comment #2) > Allon: is this operation (copying domain data manually from one domain to > another) supported? Yup - that's the point of the edit connection feature.
Created attachment 843839 [details] logs2 Reproduced again: f964988f-5e26-41f0-a701-d794d7fb087d::ERROR::2013-12-31 17:37:43,735::task::850::TaskManager.Task::(_setError) Task=`f964988f-5e26-41f0-a701-d794d7fb087d`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 266, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 185, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock lockspace add failure', 'Operation now in progress')) attaching logs (logs2)
(In reply to Elad from comment #6) > Created attachment 843839 [details] > logs2 > > Reproduced again: > AcquireHostIdFailure: Cannot acquire host id: > ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock > lockspace add failure', 'Operation now in progress')) This is different error then the previous error. This seems to be a random failure which is very hard to reproduce. Elad failed to reproduce the original error after trying many time, and could reproduce this new error only once after many tries. Federico, what could cause this random failures to acuire the host id?
(In reply to Nir Soffer from comment #7) > (In reply to Elad from comment #6) > > Created attachment 843839 [details] > > logs2 > > > > Reproduced again: > > AcquireHostIdFailure: Cannot acquire host id: > > ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock > > lockspace add failure', 'Operation now in progress')) > > This is different error then the previous error. > > This seems to be a random failure which is very hard to reproduce. Elad > failed to reproduce the original error after trying many time, and could > reproduce this new error only once after many tries. > > Federico, what could cause this random failures to acuire the host id? If we're syncing the domain WHILE ITS ACTIVE, including the leases, who knows what garbage state they'll end up in on the target domain, especially if the copying is done by a trivial cp operation. Federico - would it make any sense to clear all the leases when we activate a domain?
(In reply to Elad from comment #6) > Reproduced again: Elad, did you copy the domain data while the domain was active when this error was reproduced, or after the domain was deactivated?
(In reply to Nir Soffer from comment #9) > (In reply to Elad from comment #6) > > Reproduced again: > > Elad, did you copy the domain data while the domain was active when this > error was reproduced, or after the domain was deactivated? I can't tell you for sure
(In reply to Elad from comment #10) > (In reply to Nir Soffer from comment #9) > > (In reply to Elad from comment #6) > > > Reproduced again: > > > > Elad, did you copy the domain data while the domain was active when this > > error was reproduced, or after the domain was deactivated? > > I can't tell you for sure So I assume that this reproduction is invalid. Please try to reproduce again by: 1. Moving domain into maintanace mode 2. Create new direcotory on the nfs server 3. Copy data from old domain to new directory 4. Edit the path in the domain to point to new directory 5. Activate the new directory
I'll update if I'll encounter the issue again. Meanwhile lets try to understand the root of this failure.
Nir, do we have enough info here to try and find the root cause?
(In reply to Ayal Baron from comment #14) > Nir, do we have enough info here to try and find the root cause? I did not find one in the logs. I suggest to close this.
Please reopen if this happens again and attach updated logs. Thanks.