Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1107585

Summary: Failed to activate Storage Domain: SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')
Product: [Retired] oVirt Reporter: lily <junli.li>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED WORKSFORME QA Contact: Gil Klein <gklein>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.3CC: amureini, bazulay, bugs, gklein, iheim, junli.li, mgoldboi, pedro.aleluia, rbalakri, sabose, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: gluster
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-01 14:26:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs for this bug none

Description lily 2014-06-10 09:35:12 UTC
Description of problem:
The Storage Domain's type is GlusterFS. Today its status changed from active to UNKnown. Then manullay activate it, but failed.
vdsm.log:
9a3d82da-cc52-4f1a-bf2d-ad43c44555ed::ERROR::2014-06-10 11:03:02,071::task::850::TaskManager.Task::(_setError) Task=`9a3d82da-cc52-4f1a-bf2d-ad43c44555ed`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('c13bf545-2d42-4ee8-8ca1-a1413e01595e', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error'))

Version-Release number of selected component (if applicable):


How reproducible:
Rare

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
activate succesfully

Additional info:

Comment 1 Allon Mureinik 2014-06-11 06:28:08 UTC
Lily, can you please add oVirt Engine, VDSM and sanlock logs?

Comment 2 lily 2014-06-11 07:15:21 UTC
Created attachment 907500 [details]
logs for this bug

including engine.log, vdsm.log and sanlock.log

Comment 3 Federico Simoncelli 2014-08-06 09:23:11 UTC
This is the typical behavior when the gluster mount either doesn't have the quorum or there's a split-brain.

2014-06-10 11:00:55+0800 1627 [6981]: sanlock daemon started 2.8 host ead0bc03-69f7-487f-8460-3865bd9f8637.su04
2014-06-10 11:02:55+0800 1747 [6986]: s1 lockspace c13bf545-2d42-4ee8-8ca1-a1413e01595e:1:/rhev/data-center/mnt/glusterSD/su04:neofs-images/c13bf545-2d42-4ee8-8ca1-a1413e01595e/dom_md/ids:0
2014-06-10 11:02:55+0800 1747 [7419]: c13bf545 aio collect 0 0x7f90840008c0:0x7f90840008d0:0x7f9084001000 result -5:0 match res
2014-06-10 11:02:55+0800 1747 [7419]: read_sectors delta_leader offset 0 rv -5 /rhev/data-center/mnt/glusterSD/su04:neofs-images/c13bf545-2d42-4ee8-8ca1-a1413e01595e/dom_md/ids
2014-06-10 11:02:56+0800 1748 [6986]: s1 add_lockspace fail result -5
2014-06-10 11:03:01+0800 1753 [6985]: s2 lockspace c13bf545-2d42-4ee8-8ca1-a1413e01595e:1:/rhev/data-center/mnt/glusterSD/su04:neofs-images/c13bf545-2d42-4ee8-8ca1-a1413e01595e/dom_md/ids:0
2014-06-10 11:03:01+0800 1753 [7436]: c13bf545 aio collect 0 0x7f90840008c0:0x7f90840008d0:0x7f9084101000 result -5:0 match res
2014-06-10 11:03:01+0800 1753 [7436]: read_sectors delta_leader offset 0 rv -5 /rhev/data-center/mnt/glusterSD/su04:neofs-images/c13bf545-2d42-4ee8-8ca1-a1413e01595e/dom_md/ids
2014-06-10 11:03:02+0800 1754 [6985]: s2 add_lockspace fail result -5

fail result -5 == EIO (Input/output error)

Anyway we're not able to read/write from/to the gluster mount, there's nothing much that we can do (it's just as any other unreachable storage domain)

If you add more logs and information about gluster maybe someone from the gluster team can help you.

Comment 4 Allon Mureinik 2014-08-06 10:22:57 UTC
(In reply to Federico Simoncelli from comment #3)
> Anyway we're not able to read/write from/to the gluster mount, there's
> nothing much that we can do (it's just as any other unreachable storage
> domain)
> 
> If you add more logs and information about gluster maybe someone from the
> gluster team can help you.
Moving whiteboard to gluster to take charge of this issue.

Comment 5 Pedro Aleluia 2014-09-07 13:18:25 UTC
Same error, NFS Data (Master)

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 981, in createStoragePool
    leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 580, in create
    self._acquireTemporaryClusterLock(msdUUID, leaseParams)
  File "/usr/share/vdsm/storage/sp.py", line 512, in _acquireTemporaryClusterLock
    msd.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 477, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 237, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: (u'69a52632-5037-400a-8761-276e624743fa', SanlockException(-203, 'Sanlock lockspace add failure', 'Sanlock exception'))
Thread-91::DEBUG::2014-09-07 13:48:40,289::task::885::Storage.TaskManager.Task::(_run) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::Task._run: aea3b71a-e312-4748-862a-28282e940cd6 (None, 'abb1264c-5e4d-48ab-b64b-253f01a0dd7d', 'Branch_Offices', '69a52632-5037-400a-8761-276e624743fa', ['69a52632-5037-400a-8761-276e624743fa'], 19, None, 5, 60, 10, 3) {} failed - stopping task
Thread-91::DEBUG::2014-09-07 13:48:40,290::task::1217::Storage.TaskManager.Task::(stop) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::stopping in state preparing (force False)
Thread-91::DEBUG::2014-09-07 13:48:40,290::task::993::Storage.TaskManager.Task::(_decref) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::ref 1 aborting True
Thread-91::INFO::2014-09-07 13:48:40,291::task::1171::Storage.TaskManager.Task::(prepare) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::aborting: Task is aborted: 'Cannot acquire host id' - code 661
Thread-91::DEBUG::2014-09-07 13:48:40,291::task::1176::Storage.TaskManager.Task::(prepare) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::Prepare: aborted: Cannot acquire host id
Thread-91::DEBUG::2014-09-07 13:48:40,291::task::993::Storage.TaskManager.Task::(_decref) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::ref 0 aborting True
Thread-91::DEBUG::2014-09-07 13:48:40,291::task::928::Storage.TaskManager.Task::(_doAbort) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::Task._doAbort: force False
Thread-91::DEBUG::2014-09-07 13:48:40,292::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-91::DEBUG::2014-09-07 13:48:40,292::task::595::Storage.TaskManager.Task::(_updateState) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::moving from state preparing -> state aborting
Thread-91::DEBUG::2014-09-07 13:48:40,292::task::550::Storage.TaskManager.Task::(__state_aborting) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::_aborting: recover policy none
Thread-91::DEBUG::2014-09-07 13:48:40,293::task::595::Storage.TaskManager.Task::(_updateState) Task=`aea3b71a-e312-4748-862a-28282e940cd6`::moving from state aborting -> state failed
Thread-91::DEBUG::2014-09-07 13:48:40,293::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.abb1264c-5e4d-48ab-b64b-253f01a0dd7d': < ResourceRef 'Storage.abb1264c-5e4d-48ab-b64b-253f01a0dd7d', isValid: 'True' obj: 'None'>, 'Storage.69a52632-5037-400a-8761-276e624743fa': < ResourceRef 'Storage.69a52632-5037-400a-8761-276e624743fa', isValid: 'True' obj: 'None'>}
Thread-91::DEBUG::2014-09-07 13:48:40,293::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-91::DEBUG::2014-09-07 13:48:40,294::resourceManager::616::Storage.ResourceManager::(releaseResource) Trying to release resource 'Storage.abb1264c-5e4d-48ab-b64b-253f01a0dd7d'
Thread-91::DEBUG::2014-09-07 13:48:40,294::resourceManager::635::Storage.ResourceManager::(releaseResource) Released resource 'Storage.abb1264c-5e4d-48ab-b64b-253f01a0dd7d' (0 active users)
Thread-91::DEBUG::2014-09-07 13:48:40,295::resourceManager::641::Storage.ResourceManager::(releaseResource) Resource 'Storage.abb1264c-5e4d-48ab-b64b-253f01a0dd7d' is free, finding out if anyone is waiting for it.
Thread-91::DEBUG::2014-09-07 13:48:40,295::resourceManager::649::Storage.ResourceManager::(releaseResource) No one is waiting for resource 'Storage.abb1264c-5e4d-48ab-b64b-253f01a0dd7d', Clearing records.
Thread-91::DEBUG::2014-09-07 13:48:40,296::resourceManager::616::Storage.ResourceManager::(releaseResource) Trying to release resource 'Storage.69a52632-5037-400a-8761-276e624743fa'
Thread-91::DEBUG::2014-09-07 13:48:40,296::resourceManager::635::Storage.ResourceManager::(releaseResource) Released resource 'Storage.69a52632-5037-400a-8761-276e624743fa' (0 active users)
Thread-91::DEBUG::2014-09-07 13:48:40,296::resourceManager::641::Storage.ResourceManager::(releaseResource) Resource 'Storage.69a52632-5037-400a-8761-276e624743fa' is free, finding out if anyone is waiting for it.
Thread-91::DEBUG::2014-09-07 13:48:40,297::resourceManager::649::Storage.ResourceManager::(releaseResource) No one is waiting for resource 'Storage.69a52632-5037-400a-8761-276e624743fa', Clearing records.
Thread-91::ERROR::2014-09-07 13:48:40,297::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Cannot acquire host id: (u'69a52632-5037-400a-8761-276e624743fa', SanlockException(-203, 'Sanlock lockspace add failure', 'Sanlock exception'))", 'code': 661}}
Thread-91::DEBUG::2014-09-07 13:48:43,047::BindingXMLRPC::1132::vds::(wrapper) client [10.100.207.8]::call volumesList with () {}
Thread-91::DEBUG::2014-09-07 13:48:43,065::BindingXMLRPC::1139::vds::(wrapper) return volumesList with {'status': {'message': 'Done', 'code': 0}, 'volumes': {}}
Thread-91::DEBUG::2014-09-07 13:48:48,576::BindingXMLRPC::1132::vds::(wrapper) client [10.100.207.8]::call volumesList with () {}
Thread-91::DEBUG::2014-09-07 13:48:48,594::BindingXMLRPC::1139::vds::(wrapper) return volumesList with {'status': {'message': 'Done', 'code': 0}, 'volumes': {}}
Thread-91::DEBUG::2014-09-07 13:48:52,298::task::595::Storage.TaskManager.Task::(_updateState) Task=`1cd58055-d4a8-425b-9b12-86355736dbf1`::moving from state init -> state preparing
Thread-91::INFO::2014-09-07 13:48:52,301::logUtils::44::dispatcher::(wrapper) Run and protect: repoStats(options=None)
Thread-91::INFO::2014-09-07 13:48:52,303::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {}
Thread-91::DEBUG::2014-09-07 13:48:52,305::task::1191::Storage.TaskManager.Task::(prepare) Task=`1cd58055-d4a8-425b-9b12-86355736dbf1`::finished: {}
Thread-91::DEBUG::2014-09-07 13:48:52,307::task::595::Storage.TaskManager.Task::(_updateState) Task=`1cd58055-d4a8-425b-9b12-86355736dbf1`::moving from state preparing -> state finished
Thread-91::DEBUG::2014-09-07 13:48:52,309::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-91::DEBUG::2014-09-07 13:48:52,311::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-91::DEBUG::2014-09-07 13:48:52,314::task::993::Storage.TaskManager.Task::(_decref) Task=`1cd58055-d4a8-425b-9b12-86355736dbf1`::ref 0 aborting False
Thread-91::DEBUG::2014-09-07 13:48:54,119::BindingXMLRPC::1132::vds::(wrapper) client [10.100.207.8]::call volumesList with () {}
Thread-91::DEBUG::2014-09-07 13:48:54,138::BindingXMLRPC::1139::vds::(wrapper) return volumesList with {'status': {'message': 'Done', 'code': 0}, 'volumes': {}}
Thread-91::DEBUG::2014-09-07 13:48:59,658::BindingXMLRPC::1132::vds::(wrapper) client [10.100.207.8]::call volumesList with () {} flowID [69ac36bb]
Thread-91::DEBUG::2014-09-07 13:48:59,677::BindingXMLRPC::1139::vds::(wrapper) return volumesList with {'status': {'message': 'Done', 'code': 0}, 'volumes': {}}
Thread-91::DEBUG::2014-09-07 13:49:05,228::BindingXMLRPC::1132::vds::(wrapper) client [10.100.207.8]::call volumesList with () {}

Comment 6 lily 2014-09-16 08:33:07 UTC
We finally find that the file of /rhev/data-center/mnt/glusterSD/su04:neofs-images/c13bf545-2d42-4ee8-8ca1-a1413e01595e/dom_md/ids has split brain. After solving split brain, this bug disappeard.

Comment 7 Sahina Bose 2014-10-01 14:26:27 UTC
Closing this bug as per comment 6