Bug 1359788

Summary: [engine-backend] Storage domain gets stuck in locked during storage pool initialization after a CreateStoragePool failure
Product: [oVirt] ovirt-engine Reporter: Elad <ebenahar>
Component: BLL.StorageAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0.2CC: acanan, amureini, bugs, talayan, ylavi
Target Milestone: ovirt-4.0.4Keywords: Automation, Regression
Target Release: 4.0.4Flags: rule-engine: ovirt-4.0.z+
rule-engine: blocker+
ylavi: planning_ack+
rule-engine: devel_ack+
acanan: testing_ack+
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1364804 (view as bug list) Environment:
Last Closed: 2016-09-26 12:42:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1364804    
Attachments:
Description Flags
logs from hypervisor and engine none

Description Elad 2016-07-25 12:51:59 UTC
Created attachment 1183790 [details]
logs from hypervisor and engine

Description of problem:
A failure in attach storage domain to storage pool for the first storage domain in the pool (storage pool creation) causes the domain to remain in status locked.

Happens after BZ #1359659 is reproduced for the first storage domain in the DC. 


Version-Release number of selected component (if applicable):
rhevm-4.0.2-0.1.rc.el7ev.noarch
vdsm-4.18.8-1.el7ev.x86_64
sanlock-3.2.4-3.el7_2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.19.x86_64


How reproducible:
Occurs while BZ #1359659 is reproduced for first storage domain in the DC.

Steps to Reproduce:
1. In an uninitialized DC: Create first storage domain 
2. Cause a failure in AttachStorageDomain (in our case, caused due to BZ #1359659) 


Actual results:
Storage domain remains in locked state after storage pool creation failure due to Sanlock timeout that caused a failure in attach storage domain.


Expected results:
Storage should become unattached.


Additional info:

Sanlock failure:

snlock.log:

2016-07-25 14:44:13+0300 79094 [702]: worker1 aio timeout 0 0x7fe46c0008c0:0x7fe46c0008d0:0x7fe4781aa000 ioto 10 to_count 3
2016-07-25 14:44:13+0300 79094 [702]: r5 ballot 1 dblock read2 error -202
2016-07-25 14:44:13+0300 79094 [702]: r5 ballot 1 retract error -210
2016-07-25 14:44:13+0300 79094 [702]: r5 paxos_acquire 1 ballot error -210
2016-07-25 14:44:13+0300 79094 [702]: r5 acquire_token disk error -210 RETRACT_PAXOS
2016-07-25 14:44:15+0300 79095 [702]: worker1 aio collect 0 0x7fe46c0008c0:0x7fe46c0008d0:0x7fe4781aa000 result 1048576:0 other free
2016-07-25 14:44:21+0300 79101 [7005]: c01302bc aio timeout 0 0x7fe4500008c0:0x7fe4500008d0:0x7fe47c5ba000 ioto 10 to_count 1
2016-07-25 14:44:21+0300 79101 [7005]: s8 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.35.118.113:_nas01_ge__6__nfs__3/c01302bc-f723-446a-bf02-10a18e682975/dom_md/ids
2016-07-25 14:44:21+0300 79101 [7005]: s8 renewal error -202 delta_length 10 last_success 79071
2016-07-25 14:44:25+0300 79105 [702]: worker1 aio timeout 1 0x7fe46c000910:0x7fe46c000920:0x7fe46c002000 ioto 10 to_count 4
2016-07-25 14:44:25+0300 79105 [702]: write_sectors dblock offset 1177088 rv -202 /rhev/data-center/mnt/10.35.118.113:_nas01_ge__6__nfs__3/c01302bc-f723-446a-bf02-10a18e682975/dom_md/leases
2016-07-25 14:44:25+0300 79105 [702]: r5 release_token erase_dblock error -202 r_flags 80

=============================================================================

vdsm.log:

jsonrpc.Executor/7::ERROR::2016-07-25 14:44:38,012::dispatcher::77::Storage.Dispatcher::(wrapper) {'status': {'message': 'Cannot obtain lock: u"id=c01302bc-f723-446a-bf02-10a18e682975, rc=-2
10, out=Cannot acquire cluster lock, err=(-210, \'Sanlock resource not acquired\', \'Sanlock exception\')"', 'code': 651}}


=============================================================================


Storage pool creation failure:

engine.log:

2016-07-25 14:44:39,045 ERROR [org.ovirt.engine.core.bll.storage.pool.AddStoragePoolWithStoragesCommand] (default task-6) [b1532eb] Command 'org.ovirt.engine.core.bll.storage.pool.AddStorage
PoolWithStoragesCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CreateStoragePoolVDS, error =
 Cannot obtain lock: u"id=c01302bc-f723-446a-bf02-10a18e682975, rc=-210, out=Cannot acquire cluster lock, err=(-210, 'Sanlock resource not acquired', 'Sanlock exception')", code = 651 (Faile
d with error AcquireLockFailure and code 651)

=============================================================================

Storage domain in status locked:



                  id                  | storage_name | status 
--------------------------------------+--------------+--------
 c01302bc-f723-446a-bf02-10a18e682975 | test2        |      5

Comment 1 Red Hat Bugzilla Rules Engine 2016-07-25 15:09:28 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Kevin Alon Goldblatt 2016-09-08 15:12:54 UTC
Tested with the following code:
----------------------------------------
rhevm-4.0.4-0.1.el7ev.noarch
vdsm-4.18.12-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. In an uninitialized DC: Create first storage domain 
2. Cause a failure in AttachStorageDomain by blocking access between the host and the storage server >>>>>> The AddDomain process fails and the Domain in reported as been unattached. It does not remain locked



Actual results:
The AddDomain process fails and the Domain in reported as been unattached. It does not remain locked

Expected results:



Moving to VERIFIED!