Bug 1046430 - [vdsm] spmStart fails with AcquireHostIdFailure after changing storage connection path on POSIXFS DC
Summary: [vdsm] spmStart fails with AcquireHostIdFailure after changing storage connec...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.4.0
Assignee: Nir Soffer
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks: rhev3.4beta 1142926
TreeView+ depends on / blocked
 
Reported: 2013-12-25 09:15 UTC by Elad
Modified: 2016-02-10 16:40 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-02-04 12:03:28 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
amureini: Triaged+


Attachments (Terms of Use)
engine, vdsm, sanlock and libvirt logs (4.03 MB, application/x-gzip)
2013-12-25 09:15 UTC, Elad
no flags Details
logs2 (2.36 MB, application/x-gzip)
2013-12-31 16:18 UTC, Elad
no flags Details

Description Elad 2013-12-25 09:15:05 UTC
Created attachment 841434 [details]
engine, vdsm, sanlock and libvirt logs

Description of problem:
Manage storage connection:
After changing storage connection path on POSIXFS storage domain and activating the domain (master), vdsm tries to perform spmStart. The operation fails with the following error message:

4c2bf930-f92d-4d8d-8aec-a508c8ee18c5::ERROR::2013-12-25 10:35:30,543::task::850::TaskManager.Task::(_setError) Task=`4c2bf930-f92d-4d8d-8aec-a508c8ee18c5`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 266, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))

Version-Release number of selected component (if applicable):
is28.1
vdsm-4.13.2-0.3.el6ev.x86_64
libvirt-0.10.2-29.el6.x86_64
sanlock-2.8-1.el6.x86_64
rhevm-3.3.0-0.41.el6ev.noarch

How reproducible:
Always

Steps to Reproduce:
On POSIXFS DC with 1 host
1. create 1 POSIX data domain and specify 'nfs' under VFS Type
2. create a new mount point and copy all the data from the domain's mount point to the new one
3. put the domain in maintenance and edit its storage connection path. Change it to the new mount point
3. activate the domain

Actual results:
spmStart fails. When changing the storage connection path back to the old one, spmStart succeed.

Expected results:
spmStart should woek 

Additional info: 
engine, vdsm, sanlock and libvirt logs

Comment 1 Nir Soffer 2013-12-26 18:04:26 UTC
Elad: does this work with other storage domains?

Comment 2 Nir Soffer 2013-12-26 18:07:18 UTC
Allon: is this operation (copying domain data manually from one domain to another) supported?

Comment 3 Nir Soffer 2013-12-26 18:09:50 UTC
Federico: what can cause this sanlock failure?

Comment 4 Elad 2013-12-29 07:59:10 UTC
(In reply to Nir Soffer from comment #2)
> Allon: is this operation (copying domain data manually from one domain to
> another) supported?

It works fine with nfs and iscsi

Comment 5 Allon Mureinik 2013-12-29 15:09:30 UTC
(In reply to Nir Soffer from comment #1)
> Elad: does this work with other storage domains?

(In reply to Nir Soffer from comment #2)
> Allon: is this operation (copying domain data manually from one domain to
> another) supported?
Yup - that's the point of the edit connection feature.

Comment 6 Elad 2013-12-31 16:18:10 UTC
Created attachment 843839 [details]
logs2

Reproduced again:

f964988f-5e26-41f0-a701-d794d7fb087d::ERROR::2013-12-31 17:37:43,735::task::850::TaskManager.Task::(_setError) Task=`f964988f-5e26-41f0-a701-d794d7fb087d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 266, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 185, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock lockspace add failure', 'Operation now in progress'))


attaching logs (logs2)

Comment 7 Nir Soffer 2014-01-01 08:43:48 UTC
(In reply to Elad from comment #6)
> Created attachment 843839 [details]
> logs2
> 
> Reproduced again:
> AcquireHostIdFailure: Cannot acquire host id:
> ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock
> lockspace add failure', 'Operation now in progress'))

This is different error then the previous error.

This seems to be a random failure which is very hard to reproduce. Elad failed to reproduce the original error after trying many time, and could reproduce this new error only once after many tries.

Federico, what could cause this random failures to acuire the host id?

Comment 8 Allon Mureinik 2014-01-01 09:02:12 UTC
(In reply to Nir Soffer from comment #7)
> (In reply to Elad from comment #6)
> > Created attachment 843839 [details]
> > logs2
> > 
> > Reproduced again:
> > AcquireHostIdFailure: Cannot acquire host id:
> > ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock
> > lockspace add failure', 'Operation now in progress'))
> 
> This is different error then the previous error.
> 
> This seems to be a random failure which is very hard to reproduce. Elad
> failed to reproduce the original error after trying many time, and could
> reproduce this new error only once after many tries.
> 
> Federico, what could cause this random failures to acuire the host id?

If we're syncing the domain WHILE ITS ACTIVE, including the leases, who knows what garbage state they'll end up in on the target domain, especially if the copying is done by a trivial cp operation.

Federico - would it make any sense to clear all the leases when we activate a domain?

Comment 9 Nir Soffer 2014-01-01 09:08:39 UTC
(In reply to Elad from comment #6)
> Reproduced again:

Elad, did you copy the domain data while the domain was active when this error was reproduced, or after the domain was deactivated?

Comment 10 Elad 2014-01-01 10:33:44 UTC
(In reply to Nir Soffer from comment #9)
> (In reply to Elad from comment #6)
> > Reproduced again:
> 
> Elad, did you copy the domain data while the domain was active when this
> error was reproduced, or after the domain was deactivated?

I can't tell you for sure

Comment 11 Nir Soffer 2014-01-01 10:50:12 UTC
(In reply to Elad from comment #10)
> (In reply to Nir Soffer from comment #9)
> > (In reply to Elad from comment #6)
> > > Reproduced again:
> > 
> > Elad, did you copy the domain data while the domain was active when this
> > error was reproduced, or after the domain was deactivated?
> 
> I can't tell you for sure

So I assume that this reproduction is invalid. Please try to reproduce again by:
1. Moving domain into maintanace mode
2. Create new direcotory on the nfs server
3. Copy data from old domain to new directory
4. Edit the path in the domain to point to new directory
5. Activate the new directory

Comment 13 Elad 2014-01-06 14:27:54 UTC
I'll update if I'll encounter the issue again. Meanwhile lets try to understand the root of this failure.

Comment 14 Ayal Baron 2014-01-09 18:57:05 UTC
Nir, do we have enough info here to try and find the root cause?

Comment 15 Nir Soffer 2014-02-04 11:12:14 UTC
(In reply to Ayal Baron from comment #14)
> Nir, do we have enough info here to try and find the root cause?

I did not find one in the logs. I suggest to close this.

Comment 16 Ayal Baron 2014-02-04 12:03:28 UTC
Please reopen if this happens again and attach updated logs.
Thanks.


Note You need to log in before you can comment on or make changes to this bug.