Bug 1404727

Summary: Storage domain remain locked after engine restart while attachment is in progress due to NPE in the compensation infrastructure
Product: [oVirt] ovirt-engine Reporter: Lilach Zitnitski <lzitnits>
Component: BLL.StorageAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED CURRENTRELEASE QA Contact: Lilach Zitnitski <lzitnits>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: amureini, bugs, ratamir, tnisan
Target Milestone: ovirt-4.1.0-betaFlags: rule-engine: ovirt-4.1+
rule-engine: blocker+
rule-engine: planning_ack+
amureini: devel_ack+
ratamir: testing_ack+
Target Release: 4.1.0.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-01 14:50:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs zip none

Description Lilach Zitnitski 2016-12-14 13:56:34 UTC
Description of problem:
When attaching storage domain to a DC, and restarting the engine service while the attachment in progress, storage domain looks Locked in the UI and no actions can be performed, accept destroy. 

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch
vdsm-4.18.999-1138.git6c51957.el7.centos.x86_64

How reproducible:
Tried with 2 SDs, reproduced on both. 

Steps to Reproduce:
1. attach storage domain to a dc
2. while the process is still running, restart the ovirt-engine service
3. wait for the UI to come back and check the storage domains' status 

Actual results:
Storage domain appears Locked and no actions can be performed on it (expect destroy)

Expected results:
Storage domain should be unattached and the user should be able to attach it to the dc

Additional info:

vdsm.log

2016-12-14 14:23:36,308 INFO  (jsonrpc/1) [dispatcher] Run and protect: connectStoragePool(spUUID=u'cb93d507-6f32-4eda-b916-c99ff6a7afe1', hostID=1, msdUUID=u'bd0c9dd0-ca22-4ce5-bb47-3c903409baec', masterVersion=12, domainsMap={u'bd0c9dd0-ca22-4ce5-bb47-3c903409baec': u'active', u'8c14efe4-c881-47e7-a5b8-0fa8d3179e07': u'active', u'67944510-99f4-4746-88a0-ba5c6aeaf21d': u'active', u'ed6f577a-2d9c-4c31-ac08-720edf376940': u'active'}, options=None) (logUtils:49)

engine.log

2016-12-14 14:21:48,529+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (org.ovirt.thread.pool-6-thread-6) [ed406bd5-ba7c-401e-a444-4b6be6b1
7010] FINISH, HSMGetStorageDomainInfoVDSCommand, return: <StorageDomainStatic:{name='unattached_sd2', id='67944510-99f4-4746-88a0-ba5c6aeaf21d'}, null>, log id: 5dc0c956
2016-12-14 14:21:48,533+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-6) [ed406bd5-ba7c-401e-a444-4b6be6b17010
] START, AttachStorageDomainVDSCommand( AttachStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='cb93d507-6f32-4eda-b916-c99ff6a7afe1', ignoreFailoverLimit='false'
, storageDomainId='67944510-99f4-4746-88a0-ba5c6aeaf21d'}), log id: 5a796712

Comment 1 Lilach Zitnitski 2016-12-14 14:15:17 UTC
Created attachment 1231753 [details]
logs zip

engine.log
vdsm.log

Comment 2 Allon Mureinik 2016-12-18 09:09:21 UTC
Looking through the patch attached to to the BZ is a bit unsettling.

While it should indeed solve the bug described here, the issue is deeper than just this flow. The bug occurs in the compensation infrastructure, and would, in theory, affect all the flow that use it if the engine is restarted in the middle of them.

Raz - at the very least I think we wait with engine-restart tests till QA has a build with this fix. Do you want to track this here, or open a separate BZ(s) for it?

Comment 3 Raz Tamir 2016-12-19 09:06:58 UTC
Allon,
We can track it here

Comment 4 Lilach Zitnitski 2017-01-01 15:11:57 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
rhevm-4.1.0-0.3.beta2.el7.noarch
vdsm-4.19.1-1.el7ev.x86_64


Tested with the following scenario:

Steps to Reproduce:
1. attach storage domain to a dc
2. while the process is still running, restart the ovirt-engine service
3. wait for the UI to come back and check the storage domains' status 

Actual results:
After ovirt-engine restart, the attached storage domain appears unattached and can be attached again. 

Moving to VERIFIED!