Bug 968294

Summary: engine: create of thin copy vm fails because of "Storage domain does not exist" error from vdsm during GetImageInfoVDSCommand and vm get stuck in image locked
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, acathrow, dron, iheim, jkt, lpeer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Keywords: Regression
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 22:19:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-05-29 12:00:40 UTC
Created attachment 754318 [details]
logs

Description of problem:

I tried creating a thin copy vm when one of the domains holding the template copy is inactive and GetImageInfoVDSCommand fails in vdsm with domain does not exists error ->  vm gest stuck in image locked

Version-Release number of selected component (if applicable):

sf7.2
vdsm-4.10.2-22.0.el6ev.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create two iscsi storage domains located on two different storage servers
2. create a template and copy the template to both domains
3. block connectivity to the non-master domain using iptables from all hosts
4. once the domain becomes inactive try to create a new thin copy vm on the active domain

***use vdsm-4.10.2-22.0.el6ev.x86_64***

Actual results:

we get an error from vdsm during GetImageInfoVDSCommand and the vm gets stuck in image locked. 

Expected results:

even if there is a failure in vdsm engine should still release the lock and remove the vm (since the vm is based on template there is no reason to keep the vm). 

Additional info: logs

2013-05-29 14:19:02,617 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-48) [619f7ff3] IrsBroker::getImageInfo::Failed getting image inf
o imageId = f82a0d58-0791-4137-b1e6-22a8794acd2a does not exist on domainName = tiger-01 , domainId = 7414f930-bbdb-4ec6-8132-4640cbb3c722,  error code: StorageDomainDoesNotExi
st, message: Storage domain does not exist: ()
2013-05-29 14:19:02,617 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [619f7ff3] Command org.ovirt.engine.core.vdsbroker.irsbroker.GetI
mageInfoVDSCommand return value 
 
OneImageInfoReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=358, mMessage=Storage domain does not exist: ()]]

2013-05-29 14:19:02,617 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-48) [619f7ff3] FINISH, GetImageInfoVDSCommand, log id: 662f6e39
2013-05-29 14:19:02,617 ERROR [org.ovirt.engine.core.bll.CreateSnapshotFromTemplateCommand] (pool-4-thread-48) [619f7ff3] Command org.ovirt.engine.core.bll.CreateSnapshotFromTe
mplateCommand throw Vdc Bll exception. With error message VdcBLLException:
2013-05-29 14:19:02,620 ERROR [org.ovirt.engine.core.bll.CreateSnapshotFromTemplateCommand] (pool-4-thread-48) [619f7ff3] Transaction rolled-back for command: org.ovirt.engine.
core.bll.CreateSnapshotFromTemplateCommand.
2013-05-29 14:19:02,620 ERROR [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-48) [619f7ff3] Command org.ovirt.engine.core.bll.AddVmCommand throw Vdc Bll exception. Wit
h error message VdcBLLException: RESOURCE_MANAGER_VM_SNAPSHOT_MISSMATCH
2013-05-29 14:19:02,621 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: glusterVolume for class org.ovirt.en
gine.core.bll.AddVmCommand

Comment 2 Ayal Baron 2013-06-11 00:08:21 UTC
Dafna, why is this a regression?

Allon, problem is that the addVm calls are indeed called with the active master domain:
masterDomainId = 38755249-4bb3-4841-bf5b-05f4a521514d
yet call getVolumeInfo with the faulty domain.

vdsm fails correctly since the GetInfo command is sent with a domain which is reported as faulty: 7414f930-bbdb-4ec6-8132-4640cbb3c722

2013-05-29 12:19:51,583 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-47) Domain 7414f930-bbdb-4ec6-8132-4640cbb3c722:tiger-01 was reported 
by all hosts in status UP as problematic. Moving the domain to NonOperational.

AddVm is called 3 times by user (several minutes apart) and keeps failing on the same thing:

2013-05-29 14:19:02,617 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-48) [619f7ff3] IrsBroker::getImageInfo::Failed getting image info imageId = f82a0d58-0791-4137-b1e6-22a8794acd2a does not exist on domainName = tiger-01 , domainId = 7414f930-bbdb-4ec6-8132-4640cbb3c722,  error code: StorageDomainDoesNotExist, message: Storage domain does not exist: ()


2013-05-29 14:05:40,916 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-48) [72c2c7cb] Running command: AddVmCommand internal: false. Entities affected :  ID: 066a4468-2023-4baa-b7a4-625c4d9a5ba0 Type: VdsGroups,  ID: 8241801a-fd55-480c-b92f-3926eb935368 Type: VmTemplate,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage

...

2013-05-29 14:06:44,352 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-44) [53bee9fa] Running command: AddVmCommand internal: false. Entities affected :  ID: 066a4468-2023-4baa-b7a4-625c4d9a5ba0 Type: VdsGroups,  ID: 8241801a-fd55-480c-b92f-3926eb935368 Type: VmTemplate,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage

...

2013-05-29 14:19:02,204 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-48) [619f7ff3] Running command: AddVmCommand internal: false. Entities affected :  ID: 066a4468-2023-4baa-b7a4-625c4d9a5ba0 Type: VdsGroups,  ID: 8241801a-fd55-480c-b92f-3926eb935368 Type: VmTemplate,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage

Comment 3 Dafna Ron 2013-06-11 08:31:17 UTC
its a regression because I remember testing this scenario on 3.1 when multiple domains feature came out and we were able to create the vm on the active domain.

Comment 5 Maor 2013-07-09 11:32:27 UTC
1. a new CDA should be added for validating storage domain (Bug https://bugzilla.redhat.com/show_bug.cgi?id=975053)

2. getImageInfo has been removed in commit 2575a223515a4f984157e8017e272cdd5ac98db0
and a new compensation has been added to disk at 32783a9f41c150b07c1146c1336fd87bd122956c

Could be that this could not reproduce after 2 has been merged.

Comment 6 Maor 2013-07-10 07:10:26 UTC
The image should not stay in image locked, after commits (described in comment 5) has been merged

Comment 7 Elad 2013-07-18 07:22:40 UTC
After a failure in create vm from template (thin) with a blocked data domain that contains the image, the image is get deleted from the system, there are no disks in 'LOCKED' state

Verified on RHEVM3.3 - IS5:
rhevm-3.3.0-0.7.master.el6ev.noarch

Comment 8 Itamar Heim 2014-01-21 22:19:21 UTC
Closing - RHEV 3.3 Released

Comment 9 Itamar Heim 2014-01-21 22:25:31 UTC
Closing - RHEV 3.3 Released