Bug 1046430

Summary: [vdsm] spmStart fails with AcquireHostIdFailure after changing storage connection path on POSIXFS DC
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Aharon Canan <acanan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, amureini, bazulay, ebenahar, fsimonce, iheim, lpeer, nsoffer, scohen, yeylon
Target Milestone: ---Flags: amureini: Triaged+
Target Release: 3.4.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-04 12:03:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1078909, 1142926    
Attachments:
Description Flags
engine, vdsm, sanlock and libvirt logs
none
logs2 none

Description Elad 2013-12-25 09:15:05 UTC
Created attachment 841434 [details]
engine, vdsm, sanlock and libvirt logs

Description of problem:
Manage storage connection:
After changing storage connection path on POSIXFS storage domain and activating the domain (master), vdsm tries to perform spmStart. The operation fails with the following error message:

4c2bf930-f92d-4d8d-8aec-a508c8ee18c5::ERROR::2013-12-25 10:35:30,543::task::850::TaskManager.Task::(_setError) Task=`4c2bf930-f92d-4d8d-8aec-a508c8ee18c5`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 266, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))

Version-Release number of selected component (if applicable):
is28.1
vdsm-4.13.2-0.3.el6ev.x86_64
libvirt-0.10.2-29.el6.x86_64
sanlock-2.8-1.el6.x86_64
rhevm-3.3.0-0.41.el6ev.noarch

How reproducible:
Always

Steps to Reproduce:
On POSIXFS DC with 1 host
1. create 1 POSIX data domain and specify 'nfs' under VFS Type
2. create a new mount point and copy all the data from the domain's mount point to the new one
3. put the domain in maintenance and edit its storage connection path. Change it to the new mount point
3. activate the domain

Actual results:
spmStart fails. When changing the storage connection path back to the old one, spmStart succeed.

Expected results:
spmStart should woek 

Additional info: 
engine, vdsm, sanlock and libvirt logs

Comment 1 Nir Soffer 2013-12-26 18:04:26 UTC
Elad: does this work with other storage domains?

Comment 2 Nir Soffer 2013-12-26 18:07:18 UTC
Allon: is this operation (copying domain data manually from one domain to another) supported?

Comment 3 Nir Soffer 2013-12-26 18:09:50 UTC
Federico: what can cause this sanlock failure?

Comment 4 Elad 2013-12-29 07:59:10 UTC
(In reply to Nir Soffer from comment #2)
> Allon: is this operation (copying domain data manually from one domain to
> another) supported?

It works fine with nfs and iscsi

Comment 5 Allon Mureinik 2013-12-29 15:09:30 UTC
(In reply to Nir Soffer from comment #1)
> Elad: does this work with other storage domains?

(In reply to Nir Soffer from comment #2)
> Allon: is this operation (copying domain data manually from one domain to
> another) supported?
Yup - that's the point of the edit connection feature.

Comment 6 Elad 2013-12-31 16:18:10 UTC
Created attachment 843839 [details]
logs2

Reproduced again:

f964988f-5e26-41f0-a701-d794d7fb087d::ERROR::2013-12-31 17:37:43,735::task::850::TaskManager.Task::(_setError) Task=`f964988f-5e26-41f0-a701-d794d7fb087d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 266, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 185, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock lockspace add failure', 'Operation now in progress'))


attaching logs (logs2)

Comment 7 Nir Soffer 2014-01-01 08:43:48 UTC
(In reply to Elad from comment #6)
> Created attachment 843839 [details]
> logs2
> 
> Reproduced again:
> AcquireHostIdFailure: Cannot acquire host id:
> ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock
> lockspace add failure', 'Operation now in progress'))

This is different error then the previous error.

This seems to be a random failure which is very hard to reproduce. Elad failed to reproduce the original error after trying many time, and could reproduce this new error only once after many tries.

Federico, what could cause this random failures to acuire the host id?

Comment 8 Allon Mureinik 2014-01-01 09:02:12 UTC
(In reply to Nir Soffer from comment #7)
> (In reply to Elad from comment #6)
> > Created attachment 843839 [details]
> > logs2
> > 
> > Reproduced again:
> > AcquireHostIdFailure: Cannot acquire host id:
> > ('d136bc03-5cb4-4dce-a2c2-e439d6453867', SanlockException(115, 'Sanlock
> > lockspace add failure', 'Operation now in progress'))
> 
> This is different error then the previous error.
> 
> This seems to be a random failure which is very hard to reproduce. Elad
> failed to reproduce the original error after trying many time, and could
> reproduce this new error only once after many tries.
> 
> Federico, what could cause this random failures to acuire the host id?

If we're syncing the domain WHILE ITS ACTIVE, including the leases, who knows what garbage state they'll end up in on the target domain, especially if the copying is done by a trivial cp operation.

Federico - would it make any sense to clear all the leases when we activate a domain?

Comment 9 Nir Soffer 2014-01-01 09:08:39 UTC
(In reply to Elad from comment #6)
> Reproduced again:

Elad, did you copy the domain data while the domain was active when this error was reproduced, or after the domain was deactivated?

Comment 10 Elad 2014-01-01 10:33:44 UTC
(In reply to Nir Soffer from comment #9)
> (In reply to Elad from comment #6)
> > Reproduced again:
> 
> Elad, did you copy the domain data while the domain was active when this
> error was reproduced, or after the domain was deactivated?

I can't tell you for sure

Comment 11 Nir Soffer 2014-01-01 10:50:12 UTC
(In reply to Elad from comment #10)
> (In reply to Nir Soffer from comment #9)
> > (In reply to Elad from comment #6)
> > > Reproduced again:
> > 
> > Elad, did you copy the domain data while the domain was active when this
> > error was reproduced, or after the domain was deactivated?
> 
> I can't tell you for sure

So I assume that this reproduction is invalid. Please try to reproduce again by:
1. Moving domain into maintanace mode
2. Create new direcotory on the nfs server
3. Copy data from old domain to new directory
4. Edit the path in the domain to point to new directory
5. Activate the new directory

Comment 13 Elad 2014-01-06 14:27:54 UTC
I'll update if I'll encounter the issue again. Meanwhile lets try to understand the root of this failure.

Comment 14 Ayal Baron 2014-01-09 18:57:05 UTC
Nir, do we have enough info here to try and find the root cause?

Comment 15 Nir Soffer 2014-02-04 11:12:14 UTC
(In reply to Ayal Baron from comment #14)
> Nir, do we have enough info here to try and find the root cause?

I did not find one in the logs. I suggest to close this.

Comment 16 Ayal Baron 2014-02-04 12:03:28 UTC
Please reopen if this happens again and attach updated logs.
Thanks.