Description of problem: Had an arbiter volume in my cluster and while I/O was going on brought down the first data brick down , brought it up back and brought the second one down. Now i bring back all the bricks up and i see the below issues. 1) Hypervisors flip between Normal and contending, but never achieve SPM 2) primary storage will not activate and all the storage domains are in unknown state. 3) I see the following event messages in ui "VDSM <host_name> command HSMGetTaskStatusVDS failed: Cannot acquire host id" Following Trace backs are seen in vdsm.log file: ================================================ (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 333, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 306, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 790, in acquireHostId self._manifest.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/sd.py", line 449, in acquireHostId self._domainLock.acquireHostId(hostId, async) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 297, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: (u'7b189c07-19f5-419a-badf-15f0cd519169', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')) Version-Release number of selected component (if applicable): vdsm-4.19.10.1-1.el7ev.x86_64 How reproducible: Hit it once Steps to Reproduce: 1. Install HC. 2. start running I/O using fio tool 3. Now bring down first data brick in the volume. 4. After some time bring up the first data brick. 5. Now bring down the second data brick down. 6. Now bring back the second data brick. Actual results: I see a sanlockException 5, 'Sanlock lockspace add failure', 'Input/output error'in vdsm logs and none of my storage domains are active plus Hypervisor flips between contending and Normal but never becomes SPM. Expected results: There should not be any exceptions seen in the logs, user should be able to activate storage domains and one of the hyper visor should not flip between normal and contending. Additional info:
copied engine, vdsm and sanlock logs from all the host in the cluster in the link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1441697/
Just ran into this on RHV 4.1.2 Logs are uploaded here: ftp://10.15.108.31/pub/sosreport-LogCollector-20170615124703.tar.xz
For comment#2 - time on when the issue occurred 2017-06-15 12:33:29,561-05 INFO [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-36) [] User admin@internal successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access 2017-06-15 12:33:29,605-05 INFO [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-60) [ca57a15] Running command: CreateUserSessionCommand internal: false. 2017-06-15 12:33:29,655-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-22) [4f29dd5b] FINISH, ConnectStorageServerVDSCommand, return: {5b50f056-83f5-4193-a391-b1b26f4d8ed2=0}, log id: 5821782 2017-06-15 12:33:29,754-05 INFO [org.ovirt.engine.core.bll.aaa.LogoutSessionCommand] (default task-60) [7b1172de] Running command: LogoutSessionCommand internal: false. 2017-06-15 12:33:29,759-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-60) [7b1172de] EVENT_ID: USER_VDC_LOGOUT(31), Correlation ID: 7b1172de, Call Stack: null, Custom Event ID: -1, Message: User admin@internal-authz logged out. 2017-06-15 12:33:30,656-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-6-thread-12) [4d14b49f] FINISH, ConnectStorageServerVDSCommand, return: {5b50f056-83f5-4193-a391-b1b26f4d8ed2=0}, log id: 16cccdb4 2017-06-15 12:33:30,659-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] START, HSMGetStorageDomainInfoVDSCommand(HostName = rhev5.lab.msp.redhat.com, HSMGetStorageDomainInfoVDSCommandParameters:{runAsync='true', hostId='7ae9bfa3-bafd-43b9-87fb-ffe8bef6b9e8', storageDomainId='91a58987-63f0-4c12-8a2b-f13693e2993b'}), log id: 55b48583 2017-06-15 12:33:31,558-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] FINISH, HSMGetStorageDomainInfoVDSCommand, return: <StorageDomainStatic:{name='tatooine-test2', id='91a58987-63f0-4c12-8a2b-f13693e2993b'}, 58c0d0a3-0242-0193-00d9-00000000007e>, log id: 55b48583 2017-06-15 12:33:31,559-05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] START, DetachStorageDomainVDSCommand( DetachStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='58c0d0a3-0242-0193-00d9-00000000007e', ignoreFailoverLimit='false', storageDomainId='91a58987-63f0-4c12-8a2b-f13693e2993b', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='0', force='true'}), log id: 1bcf9c7e 2017-06-15 12:33:33,476-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Failed in 'DetachStorageDomainVDS' method 2017-06-15 12:33:33,483-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command DetachStorageDomainVDS failed: Cannot acquire host id: (u'91a58987-63f0-4c12-8a2b-f13693e2993b', SanlockException(19, 'Sanlock lockspace add failure', 'No such device')) 2017-06-15 12:33:33,483-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Command 'DetachStorageDomainVDSCommand( DetachStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='58c0d0a3-0242-0193-00d9-00000000007e', ignoreFailoverLimit='false', storageDomainId='91a58987-63f0-4c12-8a2b-f13693e2993b', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='0', force='true'})' execution failed: IRSGenericException: IRSErrorException: Failed to DetachStorageDomainVDS, error = Cannot acquire host id: (u'91a58987-63f0-4c12-8a2b-f13693e2993b', SanlockException(19, 'Sanlock lockspace add failure', 'No such device')), code = 661 2017-06-15 12:33:33,483-05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] FINISH, DetachStorageDomainVDSCommand, log id: 1bcf9c7e 2017-06-15 12:33:33,483-05 WARN [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Detaching Storage Domain '91a58987-63f0-4c12-8a2b-f13693e2993b' from it's previous storage pool '00000000-0000-0000-0000-000000000000' has failed. The meta data of the Storage Domain might still indicate that it is attached to a different Storage Pool. 2017-06-15 12:33:33,483-05 ERROR [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Command 'org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Failed to DetachStorageDomainVDS, error = Cannot acquire host id: (u'91a58987-63f0-4c12-8a2b-f13693e2993b', SanlockException(19, 'Sanlock lockspace add failure', 'No such device')), code = 661 (Failed with error AcquireHostIdFailure and code 661) 2017-06-15 12:33:33,485-05 INFO [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Command [id=13b659cf-ba2a-499e-acda-1c3dd194f321]: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: StoragePoolIsoMapId:{storagePoolId='58c0d0a3-0242-0193-00d9-00000000007e', storageId='91a58987-63f0-4c12-8a2b-f13693e2993b'}. 2017-06-15 12:33:33,491-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] EVENT_ID: USER_ATTACH_STORAGE_DOMAIN_TO_POOL_FAILED(963), Correlation ID: 88b90f60-4704-41b1-9ddc-73c2274cacbd, Job ID: 39f45055-4b21-4514-94d0-fd6a706da95c, Call Stack: null, Custom Event ID: -1, Message: Failed to attach Storage Domain tatooine-test2 to Data Center DEV. (User: admin@internal-authz) 2017-06-15 12:33:33,495-05 INFO [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-45) [88b90f60-4704-41b1-9ddc-73c2274cacbd] Lock freed to object 'EngineLock:{exclusiveLocks='[91a58987-63f0-4c12-8a2b-f13693e2993b=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
4.1.4 is planned as a minimal, fast, z-stream version to fix any open issues we may have in supporting the upcoming EL 7.4. Pushing out anything unrelated, although if there's a minimal/trival, SAFE fix that's ready on time, we can consider introducing it in 4.1.4.
David hi, Can you explain what is the meaning of these Sanlock errors ? From the sanlock log http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1441697/sanlock.log 2017-04-12 18:04:40+0530 615782 [31431]: s12 renewal error -5 delta_length 0 last_success 615761 2017-04-12 18:04:40+0530 615782 [31431]: 7b189c07 aio collect RD 0x7fb9c00008c0:0x7fb9c00008d0:0x7fb9d44ab000 result -5:0 match res 2017-04-12 18:04:40+0530 615782 [31431]: s12 delta_renew read rv -5 offset 0 /rhev/data-center/mnt/glusterSD/10.70.36.82:_data/7b189c07-19f5-419a-badf-15f0cd519169/dom_md/ids 2017-04-12 18:25:17+0530 617019 [26942]: s211 add_lockspace fail result -5 2017-04-12 18:25:22+0530 617024 [26942]: s212 lockspace 7b189c07-19f5-419a-badf-15f0cd519169:1:/rhev/data-center/mnt/glusterSD/10.70.36.82:_data/7b189c07-19f5-419a-badf-15f0cd519169/dom_md/ids:0 2017-04-12 18:25:22+0530 617024 [773]: 7b189c07 aio collect RD 0x7fb9c00008c0:0x7fb9c00008d0:0x7fb9c0101000 result -5:0 match res 2017-04-12 18:25:22+0530 617024 [773]: read_sectors delta_leader offset 0 rv -5 /rhev/data-center/mnt/glusterSD/10.70.36.82:_data/7b189c07-19f5-419a-badf-15f0cd519169/dom_md/ids
sanlock is getting i/o errors from gluster. You'll have to ask gluster people if gluster is supposed to transparently tolerate the sort of test you are running, and if so why it seems to not work for you.
Denis, Do you know who from gluster team can take a look at this issue ? Thanks.
Kasturi, who reported that BZ, may help.
Hi Denis, Can you elaborate on what is the specific info you are looking from me ? Thanks kasturi
Hi Kasturi, Fred Rolland requested for some help with gluster.
Sahina hi, Is this scenario supposed to work ? It seems that sanlock is unable to read from the GlusterFS after this flow.
(In reply to Fred Rolland from comment #11) > Sahina hi, > > Is this scenario supposed to work ? > It seems that sanlock is unable to read from the GlusterFS after this flow. Yes, with one brick down the gluster volume is still available and the reads should succeed. Is there a minimum time in which the reads are supposed to be serviced? Kasturi, Was the "data" volume where the test was tried? I see error in mount logs for data volume on /sosreport-rhsqa-grafton5.lab.eng.blr.redhat.com.1441697-20170413112939/var/log/glusterfs/.
sahina, yes. volume was "data"
Sahina, According to sanlock log in: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1441697/sanlock.log It fails to access from 2017-04-12 18:04 to 2017-04-12 19:21 at least. It seems that the access to the file system was not restored successfully.
I cannot find any issue on oVirt side for now. Sanlock is failing to get access to file system. Moving to gluster.
This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.
Please re-open if still relevant.