Created attachment 1183790 [details] logs from hypervisor and engine Description of problem: A failure in attach storage domain to storage pool for the first storage domain in the pool (storage pool creation) causes the domain to remain in status locked. Happens after BZ #1359659 is reproduced for the first storage domain in the DC. Version-Release number of selected component (if applicable): rhevm-4.0.2-0.1.rc.el7ev.noarch vdsm-4.18.8-1.el7ev.x86_64 sanlock-3.2.4-3.el7_2.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.19.x86_64 How reproducible: Occurs while BZ #1359659 is reproduced for first storage domain in the DC. Steps to Reproduce: 1. In an uninitialized DC: Create first storage domain 2. Cause a failure in AttachStorageDomain (in our case, caused due to BZ #1359659) Actual results: Storage domain remains in locked state after storage pool creation failure due to Sanlock timeout that caused a failure in attach storage domain. Expected results: Storage should become unattached. Additional info: Sanlock failure: snlock.log: 2016-07-25 14:44:13+0300 79094 [702]: worker1 aio timeout 0 0x7fe46c0008c0:0x7fe46c0008d0:0x7fe4781aa000 ioto 10 to_count 3 2016-07-25 14:44:13+0300 79094 [702]: r5 ballot 1 dblock read2 error -202 2016-07-25 14:44:13+0300 79094 [702]: r5 ballot 1 retract error -210 2016-07-25 14:44:13+0300 79094 [702]: r5 paxos_acquire 1 ballot error -210 2016-07-25 14:44:13+0300 79094 [702]: r5 acquire_token disk error -210 RETRACT_PAXOS 2016-07-25 14:44:15+0300 79095 [702]: worker1 aio collect 0 0x7fe46c0008c0:0x7fe46c0008d0:0x7fe4781aa000 result 1048576:0 other free 2016-07-25 14:44:21+0300 79101 [7005]: c01302bc aio timeout 0 0x7fe4500008c0:0x7fe4500008d0:0x7fe47c5ba000 ioto 10 to_count 1 2016-07-25 14:44:21+0300 79101 [7005]: s8 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.35.118.113:_nas01_ge__6__nfs__3/c01302bc-f723-446a-bf02-10a18e682975/dom_md/ids 2016-07-25 14:44:21+0300 79101 [7005]: s8 renewal error -202 delta_length 10 last_success 79071 2016-07-25 14:44:25+0300 79105 [702]: worker1 aio timeout 1 0x7fe46c000910:0x7fe46c000920:0x7fe46c002000 ioto 10 to_count 4 2016-07-25 14:44:25+0300 79105 [702]: write_sectors dblock offset 1177088 rv -202 /rhev/data-center/mnt/10.35.118.113:_nas01_ge__6__nfs__3/c01302bc-f723-446a-bf02-10a18e682975/dom_md/leases 2016-07-25 14:44:25+0300 79105 [702]: r5 release_token erase_dblock error -202 r_flags 80 ============================================================================= vdsm.log: jsonrpc.Executor/7::ERROR::2016-07-25 14:44:38,012::dispatcher::77::Storage.Dispatcher::(wrapper) {'status': {'message': 'Cannot obtain lock: u"id=c01302bc-f723-446a-bf02-10a18e682975, rc=-2 10, out=Cannot acquire cluster lock, err=(-210, \'Sanlock resource not acquired\', \'Sanlock exception\')"', 'code': 651}} ============================================================================= Storage pool creation failure: engine.log: 2016-07-25 14:44:39,045 ERROR [org.ovirt.engine.core.bll.storage.pool.AddStoragePoolWithStoragesCommand] (default task-6) [b1532eb] Command 'org.ovirt.engine.core.bll.storage.pool.AddStorage PoolWithStoragesCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error = Cannot obtain lock: u"id=c01302bc-f723-446a-bf02-10a18e682975, rc=-210, out=Cannot acquire cluster lock, err=(-210, 'Sanlock resource not acquired', 'Sanlock exception')", code = 651 (Faile d with error AcquireLockFailure and code 651) ============================================================================= Storage domain in status locked: id | storage_name | status --------------------------------------+--------------+-------- c01302bc-f723-446a-bf02-10a18e682975 | test2 | 5
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Tested with the following code: ---------------------------------------- rhevm-4.0.4-0.1.el7ev.noarch vdsm-4.18.12-1.el7ev.x86_64 Tested with the following scenario: Steps to Reproduce: 1. In an uninitialized DC: Create first storage domain 2. Cause a failure in AttachStorageDomain by blocking access between the host and the storage server >>>>>> The AddDomain process fails and the Domain in reported as been unattached. It does not remain locked Actual results: The AddDomain process fails and the Domain in reported as been unattached. It does not remain locked Expected results: Moving to VERIFIED!