Bug 1359788 - [engine-backend] Storage domain gets stuck in locked during storage pool initialization after a CreateStoragePool failure
Summary: [engine-backend] Storage domain gets stuck in locked during storage pool init...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.0.2
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.0.4
: 4.0.4
Assignee: Liron Aravot
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks: 1364804
TreeView+ depends on / blocked
 
Reported: 2016-07-25 12:51 UTC by Elad
Modified: 2016-09-26 12:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1364804 (view as bug list)
Environment:
Last Closed: 2016-09-26 12:42:01 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.0.z+
rule-engine: blocker+
ylavi: planning_ack+
rule-engine: devel_ack+
acanan: testing_ack+


Attachments (Terms of Use)
logs from hypervisor and engine (2.85 MB, application/x-gzip)
2016-07-25 12:51 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 62035 0 master MERGED core: AttachStorageDomainToPoolCommand - dont pass compensation context 2020-05-25 01:34:14 UTC
oVirt gerrit 62041 0 ovirt-engine-4.0 MERGED core: AttachStorageDomainToPoolCommand - dont pass compensation context 2020-05-25 01:34:14 UTC

Description Elad 2016-07-25 12:51:59 UTC
Created attachment 1183790 [details]
logs from hypervisor and engine

Description of problem:
A failure in attach storage domain to storage pool for the first storage domain in the pool (storage pool creation) causes the domain to remain in status locked.

Happens after BZ #1359659 is reproduced for the first storage domain in the DC. 


Version-Release number of selected component (if applicable):
rhevm-4.0.2-0.1.rc.el7ev.noarch
vdsm-4.18.8-1.el7ev.x86_64
sanlock-3.2.4-3.el7_2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.19.x86_64


How reproducible:
Occurs while BZ #1359659 is reproduced for first storage domain in the DC.

Steps to Reproduce:
1. In an uninitialized DC: Create first storage domain 
2. Cause a failure in AttachStorageDomain (in our case, caused due to BZ #1359659) 


Actual results:
Storage domain remains in locked state after storage pool creation failure due to Sanlock timeout that caused a failure in attach storage domain.


Expected results:
Storage should become unattached.


Additional info:

Sanlock failure:

snlock.log:

2016-07-25 14:44:13+0300 79094 [702]: worker1 aio timeout 0 0x7fe46c0008c0:0x7fe46c0008d0:0x7fe4781aa000 ioto 10 to_count 3
2016-07-25 14:44:13+0300 79094 [702]: r5 ballot 1 dblock read2 error -202
2016-07-25 14:44:13+0300 79094 [702]: r5 ballot 1 retract error -210
2016-07-25 14:44:13+0300 79094 [702]: r5 paxos_acquire 1 ballot error -210
2016-07-25 14:44:13+0300 79094 [702]: r5 acquire_token disk error -210 RETRACT_PAXOS
2016-07-25 14:44:15+0300 79095 [702]: worker1 aio collect 0 0x7fe46c0008c0:0x7fe46c0008d0:0x7fe4781aa000 result 1048576:0 other free
2016-07-25 14:44:21+0300 79101 [7005]: c01302bc aio timeout 0 0x7fe4500008c0:0x7fe4500008d0:0x7fe47c5ba000 ioto 10 to_count 1
2016-07-25 14:44:21+0300 79101 [7005]: s8 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.35.118.113:_nas01_ge__6__nfs__3/c01302bc-f723-446a-bf02-10a18e682975/dom_md/ids
2016-07-25 14:44:21+0300 79101 [7005]: s8 renewal error -202 delta_length 10 last_success 79071
2016-07-25 14:44:25+0300 79105 [702]: worker1 aio timeout 1 0x7fe46c000910:0x7fe46c000920:0x7fe46c002000 ioto 10 to_count 4
2016-07-25 14:44:25+0300 79105 [702]: write_sectors dblock offset 1177088 rv -202 /rhev/data-center/mnt/10.35.118.113:_nas01_ge__6__nfs__3/c01302bc-f723-446a-bf02-10a18e682975/dom_md/leases
2016-07-25 14:44:25+0300 79105 [702]: r5 release_token erase_dblock error -202 r_flags 80

=============================================================================

vdsm.log:

jsonrpc.Executor/7::ERROR::2016-07-25 14:44:38,012::dispatcher::77::Storage.Dispatcher::(wrapper) {'status': {'message': 'Cannot obtain lock: u"id=c01302bc-f723-446a-bf02-10a18e682975, rc=-2
10, out=Cannot acquire cluster lock, err=(-210, \'Sanlock resource not acquired\', \'Sanlock exception\')"', 'code': 651}}


=============================================================================


Storage pool creation failure:

engine.log:

2016-07-25 14:44:39,045 ERROR [org.ovirt.engine.core.bll.storage.pool.AddStoragePoolWithStoragesCommand] (default task-6) [b1532eb] Command 'org.ovirt.engine.core.bll.storage.pool.AddStorage
PoolWithStoragesCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error =
 Cannot obtain lock: u"id=c01302bc-f723-446a-bf02-10a18e682975, rc=-210, out=Cannot acquire cluster lock, err=(-210, 'Sanlock resource not acquired', 'Sanlock exception')", code = 651 (Faile
d with error AcquireLockFailure and code 651)

=============================================================================

Storage domain in status locked:



                  id                  | storage_name | status 
--------------------------------------+--------------+--------
 c01302bc-f723-446a-bf02-10a18e682975 | test2        |      5

Comment 1 Red Hat Bugzilla Rules Engine 2016-07-25 15:09:28 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Kevin Alon Goldblatt 2016-09-08 15:12:54 UTC
Tested with the following code:
----------------------------------------
rhevm-4.0.4-0.1.el7ev.noarch
vdsm-4.18.12-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. In an uninitialized DC: Create first storage domain 
2. Cause a failure in AttachStorageDomain by blocking access between the host and the storage server >>>>>> The AddDomain process fails and the Domain in reported as been unattached. It does not remain locked



Actual results:
The AddDomain process fails and the Domain in reported as been unattached. It does not remain locked

Expected results:



Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.