1441697 – [HC] - Can't acquire SPM due to Sanlock Exception (Sanlock lockspace add failure 'Input/output error'): Cannot acquire host id

Bug 1441697 - [HC] - Can't acquire SPM due to Sanlock Exception (Sanlock lockspace add failure 'Input/output error'): Cannot acquire host id

Summary: [HC] - Can't acquire SPM due to Sanlock Exception (Sanlock lockspace add fail...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-12 13:47 UTC by RamaKasturi
Modified:	2019-04-17 12:15 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-04-17 12:15:21 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description RamaKasturi 2017-04-12 13:47:36 UTC

Description of problem:
Had an arbiter volume in my cluster and while I/O was going on brought down the first data brick down , brought it up back and brought the second one down. Now i bring back all the bricks up and i see the below issues.

1) Hypervisors flip between Normal and contending, but never achieve SPM
2) primary storage will not activate and all the storage domains are in unknown state.
3) I see the following event messages in ui "VDSM <host_name> command HSMGetTaskStatusVDS failed: Cannot acquire host id"

Following Trace backs are seen in vdsm.log file:
================================================
 (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 333, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 306, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 790, in acquireHostId
    self._manifest.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/sd.py", line 449, in acquireHostId
    self._domainLock.acquireHostId(hostId, async)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 297, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: (u'7b189c07-19f5-419a-badf-15f0cd519169', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error'))


Version-Release number of selected component (if applicable):
vdsm-4.19.10.1-1.el7ev.x86_64

How reproducible:
Hit it once

Steps to Reproduce:
1. Install HC.
2. start running I/O using fio tool
3. Now bring down first data brick in the volume.
4. After some time bring up the first data brick.
5. Now bring down the second data brick down.
6. Now bring back the second data brick.

Actual results:
I see a sanlockException 5, 'Sanlock lockspace add failure', 'Input/output error'in vdsm logs and none of my storage domains are active plus Hypervisor flips between contending and Normal but never becomes SPM.

Expected results:
There should not be any exceptions seen in the logs, user should be able to activate storage domains and one of the hyper visor should not flip between normal and contending.

Additional info:

Comment 1 RamaKasturi 2017-04-12 13:53:12 UTC

copied engine, vdsm and sanlock logs from all the host in the cluster in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1441697/

Comment 2 Sam Yangsao 2017-06-15 18:07:37 UTC

Just ran into this on RHV 4.1.2

Logs are uploaded here:

ftp://10.15.108.31/pub/sosreport-LogCollector-20170615124703.tar.xz

Comment 3 Sam Yangsao 2017-06-15 18:08:15 UTC

For comment#2 - time on when the issue occurred

    2017-06-15 12:33:29,561-05 INFO  [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-36) [] User admin@internal successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access
    2017-06-15 12:33:29,605-05 INFO  [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-60) [ca57a15] Running command: CreateUserSessionCommand internal: false.
    2017-06-15 12:33:29,655-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-22) [4f29dd5b] FINISH, ConnectStorageServerVDSCommand, return: {5b50f056-83f5-4193-a391-b1b26f4d8ed2=0}, log id: 5821782
    2017-06-15 12:33:29,754-05 INFO  [org.ovirt.engine.core.bll.aaa.LogoutSessionCommand] (default task-60) [7b1172de] Running command: LogoutSessionCommand internal: false.
    2017-06-15 12:33:29,759-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-60) [7b1172de] EVENT_ID: USER_VDC_LOGOUT(31), Correlation ID: 7b1172de, Call Stack: null, Custom Event ID: -1, Message: User admin@internal-authz logged out.
    2017-06-15 12:33:30,656-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-12) [4d14b49f] FINISH, ConnectStorageServerVDSCommand, return: {5b50f056-83f5-4193-a391-b1b26f4d8ed2=0}, log id: 16cccdb4
    2017-06-15 12:33:30,659-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] START, HSMGetStorageDomainInfoVDSCommand(HostName = rhev5.lab.msp.redhat.com, HSMGetStorageDomainInfoVDSCommandParameters:{runAsync='true', hostId='7ae9bfa3-bafd-43b9-87fb-ffe8bef6b9e8', storageDomainId='91a58987-63f0-4c12-8a2b-f13693e2993b'}), log id: 55b48583
    2017-06-15 12:33:31,558-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] FINISH, HSMGetStorageDomainInfoVDSCommand, return: <StorageDomainStatic:{name='tatooine-test2', id='91a58987-63f0-4c12-8a2b-f13693e2993b'}, 58c0d0a3-0242-0193-00d9-00000000007e>, log id: 55b48583
    2017-06-15 12:33:31,559-05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] START, DetachStorageDomainVDSCommand( DetachStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='58c0d0a3-0242-0193-00d9-00000000007e', ignoreFailoverLimit='false', storageDomainId='91a58987-63f0-4c12-8a2b-f13693e2993b', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='0', force='true'}), log id: 1bcf9c7e
    2017-06-15 12:33:33,476-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Failed in 'DetachStorageDomainVDS' method
    2017-06-15 12:33:33,483-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command DetachStorageDomainVDS failed: Cannot acquire host id: (u'91a58987-63f0-4c12-8a2b-f13693e2993b', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))
    2017-06-15 12:33:33,483-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Command 'DetachStorageDomainVDSCommand( DetachStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='58c0d0a3-0242-0193-00d9-00000000007e', ignoreFailoverLimit='false', storageDomainId='91a58987-63f0-4c12-8a2b-f13693e2993b', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='0', force='true'})' execution failed: IRSGenericException: IRSErrorException: Failed to DetachStorageDomainVDS, error = Cannot acquire host id: (u'91a58987-63f0-4c12-8a2b-f13693e2993b', SanlockException(19, 'Sanlock lockspace add failure', 'No such device')), code = 661
    2017-06-15 12:33:33,483-05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] FINISH, DetachStorageDomainVDSCommand, log id: 1bcf9c7e
    2017-06-15 12:33:33,483-05 WARN  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Detaching Storage Domain '91a58987-63f0-4c12-8a2b-f13693e2993b' from it's previous storage pool '00000000-0000-0000-0000-000000000000' has failed. The meta data of the Storage Domain might still indicate that it is attached to a different Storage Pool.
    2017-06-15 12:33:33,483-05 ERROR [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Command 'org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Failed to DetachStorageDomainVDS, error = Cannot acquire host id: (u'91a58987-63f0-4c12-8a2b-f13693e2993b', SanlockException(19, 'Sanlock lockspace add failure', 'No such device')), code = 661 (Failed with error AcquireHostIdFailure and code 661)
    2017-06-15 12:33:33,485-05 INFO  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Command [id=13b659cf-ba2a-499e-acda-1c3dd194f321]: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: StoragePoolIsoMapId:{storagePoolId='58c0d0a3-0242-0193-00d9-00000000007e', storageId='91a58987-63f0-4c12-8a2b-f13693e2993b'}.
    2017-06-15 12:33:33,491-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] EVENT_ID: USER_ATTACH_STORAGE_DOMAIN_TO_POOL_FAILED(963), Correlation ID: 88b90f60-4704-41b1-9ddc-73c2274cacbd, Job ID: 39f45055-4b21-4514-94d0-fd6a706da95c, Call Stack: null, Custom Event ID: -1, Message: Failed to attach Storage Domain tatooine-test2 to Data Center DEV. (User: admin@internal-authz)
    2017-06-15 12:33:33,495-05 INFO  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Lock freed to object 'EngineLock:{exclusiveLocks='[91a58987-63f0-4c12-8a2b-f13693e2993b=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'

Comment 4 Allon Mureinik 2017-07-02 20:38:30 UTC

4.1.4 is planned as a minimal, fast, z-stream version to fix any open issues we may have in supporting the upcoming EL 7.4.

Pushing out anything unrelated, although if there's a minimal/trival, SAFE fix that's ready on time, we can consider introducing it in 4.1.4.

Comment 5 Fred Rolland 2017-07-13 15:04:44 UTC

David hi,
Can you explain what is the meaning of these Sanlock errors ?

From the sanlock log
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1441697/sanlock.log

2017-04-12 18:04:40+0530 615782 [31431]: s12 renewal error -5 delta_length 0 last_success 615761
2017-04-12 18:04:40+0530 615782 [31431]: 7b189c07 aio collect RD 0x7fb9c00008c0:0x7fb9c00008d0:0x7fb9d44ab000 result -5:0 match res
2017-04-12 18:04:40+0530 615782 [31431]: s12 delta_renew read rv -5 offset 0 /rhev/data-center/mnt/glusterSD/10.70.36.82:_data/7b189c07-19f5-419a-badf-15f0cd519169/dom_md/ids

2017-04-12 18:25:17+0530 617019 [26942]: s211 add_lockspace fail result -5
2017-04-12 18:25:22+0530 617024 [26942]: s212 lockspace 7b189c07-19f5-419a-badf-15f0cd519169:1:/rhev/data-center/mnt/glusterSD/10.70.36.82:_data/7b189c07-19f5-419a-badf-15f0cd519169/dom_md/ids:0
2017-04-12 18:25:22+0530 617024 [773]: 7b189c07 aio collect RD 0x7fb9c00008c0:0x7fb9c00008d0:0x7fb9c0101000 result -5:0 match res
2017-04-12 18:25:22+0530 617024 [773]: read_sectors delta_leader offset 0 rv -5 /rhev/data-center/mnt/glusterSD/10.70.36.82:_data/7b189c07-19f5-419a-badf-15f0cd519169/dom_md/ids

Comment 6 David Teigland 2017-07-13 15:30:49 UTC

sanlock is getting i/o errors from gluster.  You'll have to ask gluster people if gluster is supposed to transparently tolerate the sort of test you are running, and if so why it seems to not work for you.

Comment 7 Fred Rolland 2017-07-20 07:13:31 UTC

Denis,
Do you know who from gluster team can take a look at this issue ?
Thanks.

Comment 8 Denis Chaplygin 2017-07-20 08:03:42 UTC

Kasturi, who reported that BZ, may help.

Comment 9 RamaKasturi 2017-07-21 06:20:38 UTC

Hi Denis,

  Can you elaborate on what is the specific info you are looking from me ?

Thanks
kasturi

Comment 10 Denis Chaplygin 2017-07-21 11:05:29 UTC

Hi Kasturi,


Fred Rolland requested for some help with gluster.

Comment 11 Fred Rolland 2017-07-31 08:43:38 UTC

Sahina hi,

Is this scenario supposed to work ? 
It seems that sanlock is unable to read from the GlusterFS after this flow.

Comment 12 Sahina Bose 2017-07-31 09:18:00 UTC

(In reply to Fred Rolland from comment #11)
> Sahina hi,
> 
> Is this scenario supposed to work ? 
> It seems that sanlock is unable to read from the GlusterFS after this flow.

Yes, with one brick down the gluster volume is still available and the reads should succeed. Is there a minimum time in which the reads are supposed to be serviced? 

Kasturi,

Was the "data" volume where the test was tried? I see error in mount logs for data volume on /sosreport-rhsqa-grafton5.lab.eng.blr.redhat.com.1441697-20170413112939/var/log/glusterfs/.

Comment 13 RamaKasturi 2017-07-31 10:36:49 UTC

sahina, yes. volume was "data"

Comment 14 Fred Rolland 2017-07-31 11:40:36 UTC

Sahina,

According to sanlock log in:
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1441697/sanlock.log
It fails to access from 2017-04-12 18:04 to 2017-04-12 19:21 at least.

It seems that the access to the file system was not restored successfully.

Comment 15 Fred Rolland 2017-07-31 12:29:07 UTC

I cannot find any issue on oVirt side for now. Sanlock is failing to get access to file system. Moving to gluster.

Comment 16 Niels de Vos 2017-11-07 10:41:56 UTC

This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Comment 20 Shyamsundar 2018-10-23 14:54:40 UTC

Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 21 Yaniv Kaul 2019-04-17 12:15:21 UTC

Please re-open if still relevant.

Note You need to log in before you can comment on or make changes to this bug.