Bug 968294 - engine: create of thin copy vm fails because of "Storage domain does not exist" error from vdsm during GetImageInfoVDSCommand and vm get stuck in image locked
engine: create of thin copy vm fails because of "Storage domain does not exis...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Linux
unspecified Severity high
: ---
: 3.3.0
Assigned To: Maor
Elad
storage
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-29 08:00 EDT by Dafna Ron
Modified: 2016-02-10 15:24 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 17:19:21 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (587.58 KB, application/x-gzip)
2013-05-29 08:00 EDT, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2013-05-29 08:00:40 EDT
Created attachment 754318 [details]
logs

Description of problem:

I tried creating a thin copy vm when one of the domains holding the template copy is inactive and GetImageInfoVDSCommand fails in vdsm with domain does not exists error ->  vm gest stuck in image locked

Version-Release number of selected component (if applicable):

sf7.2
vdsm-4.10.2-22.0.el6ev.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create two iscsi storage domains located on two different storage servers
2. create a template and copy the template to both domains
3. block connectivity to the non-master domain using iptables from all hosts
4. once the domain becomes inactive try to create a new thin copy vm on the active domain

***use vdsm-4.10.2-22.0.el6ev.x86_64***

Actual results:

we get an error from vdsm during GetImageInfoVDSCommand and the vm gets stuck in image locked. 

Expected results:

even if there is a failure in vdsm engine should still release the lock and remove the vm (since the vm is based on template there is no reason to keep the vm). 

Additional info: logs

2013-05-29 14:19:02,617 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-48) [619f7ff3] IrsBroker::getImageInfo::Failed getting image inf
o imageId = f82a0d58-0791-4137-b1e6-22a8794acd2a does not exist on domainName = tiger-01 , domainId = 7414f930-bbdb-4ec6-8132-4640cbb3c722,  error code: StorageDomainDoesNotExi
st, message: Storage domain does not exist: ()
2013-05-29 14:19:02,617 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [619f7ff3] Command org.ovirt.engine.core.vdsbroker.irsbroker.GetI
mageInfoVDSCommand return value 
 
OneImageInfoReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=358, mMessage=Storage domain does not exist: ()]]

2013-05-29 14:19:02,617 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-48) [619f7ff3] FINISH, GetImageInfoVDSCommand, log id: 662f6e39
2013-05-29 14:19:02,617 ERROR [org.ovirt.engine.core.bll.CreateSnapshotFromTemplateCommand] (pool-4-thread-48) [619f7ff3] Command org.ovirt.engine.core.bll.CreateSnapshotFromTe
mplateCommand throw Vdc Bll exception. With error message VdcBLLException:
2013-05-29 14:19:02,620 ERROR [org.ovirt.engine.core.bll.CreateSnapshotFromTemplateCommand] (pool-4-thread-48) [619f7ff3] Transaction rolled-back for command: org.ovirt.engine.
core.bll.CreateSnapshotFromTemplateCommand.
2013-05-29 14:19:02,620 ERROR [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-48) [619f7ff3] Command org.ovirt.engine.core.bll.AddVmCommand throw Vdc Bll exception. Wit
h error message VdcBLLException: RESOURCE_MANAGER_VM_SNAPSHOT_MISSMATCH
2013-05-29 14:19:02,621 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: glusterVolume for class org.ovirt.en
gine.core.bll.AddVmCommand
Comment 2 Ayal Baron 2013-06-10 20:08:21 EDT
Dafna, why is this a regression?

Allon, problem is that the addVm calls are indeed called with the active master domain:
masterDomainId = 38755249-4bb3-4841-bf5b-05f4a521514d
yet call getVolumeInfo with the faulty domain.

vdsm fails correctly since the GetInfo command is sent with a domain which is reported as faulty: 7414f930-bbdb-4ec6-8132-4640cbb3c722

2013-05-29 12:19:51,583 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-47) Domain 7414f930-bbdb-4ec6-8132-4640cbb3c722:tiger-01 was reported 
by all hosts in status UP as problematic. Moving the domain to NonOperational.

AddVm is called 3 times by user (several minutes apart) and keeps failing on the same thing:

2013-05-29 14:19:02,617 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (pool-4-thread-48) [619f7ff3] IrsBroker::getImageInfo::Failed getting image info imageId = f82a0d58-0791-4137-b1e6-22a8794acd2a does not exist on domainName = tiger-01 , domainId = 7414f930-bbdb-4ec6-8132-4640cbb3c722,  error code: StorageDomainDoesNotExist, message: Storage domain does not exist: ()


2013-05-29 14:05:40,916 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-48) [72c2c7cb] Running command: AddVmCommand internal: false. Entities affected :  ID: 066a4468-2023-4baa-b7a4-625c4d9a5ba0 Type: VdsGroups,  ID: 8241801a-fd55-480c-b92f-3926eb935368 Type: VmTemplate,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage

...

2013-05-29 14:06:44,352 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-44) [53bee9fa] Running command: AddVmCommand internal: false. Entities affected :  ID: 066a4468-2023-4baa-b7a4-625c4d9a5ba0 Type: VdsGroups,  ID: 8241801a-fd55-480c-b92f-3926eb935368 Type: VmTemplate,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage

...

2013-05-29 14:19:02,204 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (pool-4-thread-48) [619f7ff3] Running command: AddVmCommand internal: false. Entities affected :  ID: 066a4468-2023-4baa-b7a4-625c4d9a5ba0 Type: VdsGroups,  ID: 8241801a-fd55-480c-b92f-3926eb935368 Type: VmTemplate,  ID: 38755249-4bb3-4841-bf5b-05f4a521514d Type: Storage
Comment 3 Dafna Ron 2013-06-11 04:31:17 EDT
its a regression because I remember testing this scenario on 3.1 when multiple domains feature came out and we were able to create the vm on the active domain.
Comment 5 Maor 2013-07-09 07:32:27 EDT
1. a new CDA should be added for validating storage domain (Bug https://bugzilla.redhat.com/show_bug.cgi?id=975053)

2. getImageInfo has been removed in commit 2575a223515a4f984157e8017e272cdd5ac98db0
and a new compensation has been added to disk at 32783a9f41c150b07c1146c1336fd87bd122956c

Could be that this could not reproduce after 2 has been merged.
Comment 6 Maor 2013-07-10 03:10:26 EDT
The image should not stay in image locked, after commits (described in comment 5) has been merged
Comment 7 Elad 2013-07-18 03:22:40 EDT
After a failure in create vm from template (thin) with a blocked data domain that contains the image, the image is get deleted from the system, there are no disks in 'LOCKED' state

Verified on RHEVM3.3 - IS5:
rhevm-3.3.0-0.7.master.el6ev.noarch
Comment 8 Itamar Heim 2014-01-21 17:19:21 EST
Closing - RHEV 3.3 Released
Comment 9 Itamar Heim 2014-01-21 17:25:31 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.