1264677 – Can't create a template with Cinder disks

Bug 1264677 - Can't create a template with Cinder disks

Summary: Can't create a template with Cinder disks

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc3
Target Release:	3.6.0
Assignee:	Daniel Erez
QA Contact:	Ori Gofen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1135132 1252958 1253215
TreeView+	depends on / blocked

Reported:	2015-09-20 14:31 UTC by Ori Gofen
Modified:	2019-10-10 10:16 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ovirt-engine-3.6.0.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1264691 (view as bug list)
Environment:
Last Closed:	2016-03-09 21:13:31 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm.log (879.14 KB, text/plain) 2015-09-20 17:52 UTC, Daniel Erez	no flags	Details
engine.log (2.89 MB, text/plain) 2015-09-20 17:52 UTC, Daniel Erez	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:0376	normal	SHIPPED_LIVE	Red Hat Enterprise Virtualization Manager 3.6.0	2016-03-10 01:20:52 UTC
oVirt gerrit	46438	master	MERGED	core: remove addVmTemplateCinderDisks from transaction	Never
oVirt gerrit	47015	ovirt-engine-3.6	MERGED	core: remove addVmTemplateCinderDisks from transaction	Never
oVirt gerrit	47124	ovirt-engine-3.6.0	MERGED	core: remove addVmTemplateCinderDisks from transaction	Never

Description Ori Gofen 2015-09-20 14:31:35 UTC

Description of problem:
This bug was initially found during cinder/ceph manual testing and was opened to storage as a cinder bug (bz #1252958) but didn't reproduce on ovirt-master-3.6 build on fedora21.

The best way to reproduce this bug is to create a template from a vm with a cinder disk(ceph backend) on a rhevm engine (over rhel6.6/7), the operation gets stuck and then fails on timed out.

This happens only on RHEVM builds on rhel6.6/6.7 (jboss eap 6.4.z), rhevm's code is exactly the same as oVirt and the behaviour differs when the code reaches:

"                buildCinderChildCommandParameters(cinderDisks, getVmSnapshotId()),
                cloneContextAndDetachFromParent(),
                CINDERStorageHelper.getStorageEntities(cinderDisks));
        try {
>>>>>>>>>>> VdcReturnValueBase vdcReturnValueBase = future.get();
            if (vdcReturnValueBase.getSucceeded()) {
                Map<Guid, Guid> diskImageMap = vdcReturnValueBase.getActionReturnValue();
                srcDeviceIdToTargetDeviceIdMapping.putAll(diskImageMap);
            } else {"

the part when it calls future.get() java concurrent interface.

Version-Release number of selected component (if applicable):
rhevm 3.6-13

How reproducible:
100%

Steps to Reproduce:
1.create a template from a vm with a cinder disk

Actual results:
operation fails after it gets timed out

Expected results:
operation successful

Additional info:
* attachments are found at bz #1252958

Comment 1 Daniel Erez 2015-09-20 14:40:53 UTC

Hi Ravi,

The mentioned code is located at 'AddVmTemplateCommand -> addVmTemplateCinderDisks()', and is invoked from within a transaction. Perhaps there's an issue with Future's get() and transactions on JBoss EAP?

Thanks!

Comment 2 Piotr Kliczewski 2015-09-20 16:41:40 UTC

Please provide engine log with debug level and corresponding vdsm log.

Comment 3 Oved Ourfali 2015-09-20 16:59:45 UTC

Also, please describe the flow, as I see future objects being used there both to get a return value from running async engine commands, and probably also VDSM ones.

Also, please verify you run using the exact same version (engine+vdsm), both in jboss EAP and wildfly versions, as I find it hard to believe it is related to that.

Comment 4 Daniel Erez 2015-09-20 17:13:33 UTC

(In reply to Oved Ourfali from comment #3)
> Also, please describe the flow, as I see future objects being used there
> both to get a return value from running async engine commands, and probably
> also VDSM ones.
> 

The relevant flow consists of a VM with a single Cinder disk (i.e. no VDSM tasks, only CoCo). The flow is as follows: AddVmTemplateCommand -> executeCommand() -> addVmTemplateCinderDisks method (inside a transaction) -> Future 'VdcActionType.CloneCinderDisks' -> stuck on future.get().

> Also, please verify you run using the exact same version (engine+vdsm), both
> in jboss EAP and wildfly versions, as I find it hard to believe it is
> related to that.

It was just a guess.. As the same scenario works correctly in oVirt stable (and not in RHEV 3.6 beta), so it could be any other difference (DB version/jboss/java/etc).

@Ori - can you please attach the logs with debug level.. (https://bugzilla.redhat.com/show_bug.cgi?id=1264677#c2)

Comment 5 Oved Ourfali 2015-09-20 17:22:20 UTC

But there was no difference between latest upstream and downstream builds?

Also, we will also need a cinder instance to work with. Can you send its details offline?

Comment 6 Moti Asayag 2015-09-20 17:40:36 UTC

By examining the log, it is shown that the duration of the command exceeds 10 minutes:

Start:
  2015-08-30 15:54:41,331 INFO  [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9) [4566b62d] Running command: AddVmTemplateCommand internal: false. Entities affected :  ID: 00000001-0001-0001-0001-00000000033a Type: StoragePoolAction group CREATE_TEMPLATE with role type USER


End:

 2015-08-30 16:04:42,758 ERROR [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9) [] Command 'org.ovirt.engine.core.bll.AddVmTemplateCommand' failed: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction
   ...
   Caused by: javax.resource.ResourceException: IJ000459: Transaction is not active:
 

And this makes sense since the default timeout of engine's transaction is 600 seconds:
  <coordinator-environment default-timeout="600"/>

I think it makes perfect sense to re-shape that command and exclude the template/disks creation part from the transaction in AddVmTemplateCommand.executeCommand().

It might be that on the environment where the scenario worked - the command didn't face the transaction timeout (better connectivity/smaller disk or other reason for not hitting timeout).

Comment 7 Daniel Erez 2015-09-20 17:52:12 UTC

Created attachment 1075328 [details]
vdsm.log

Comment 8 Daniel Erez 2015-09-20 17:52:48 UTC

Created attachment 1075329 [details]
engine.log

Comment 9 Daniel Erez 2015-09-20 17:53:37 UTC

Attached are logs from my env.

Comment 10 Daniel Erez 2015-09-20 17:57:36 UTC

(In reply to Moti Asayag from comment #6)
> By examining the log, it is shown that the duration of the command exceeds
> 10 minutes:
> 
> Start:
>   2015-08-30 15:54:41,331 INFO 
> [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9)
> [4566b62d] Running command: AddVmTemplateCommand internal: false. Entities
> affected :  ID: 00000001-0001-0001-0001-00000000033a Type: StoragePoolAction
> group CREATE_TEMPLATE with role type USER
> 
> 
> End:
> 
>  2015-08-30 16:04:42,758 ERROR
> [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9) []
> Command 'org.ovirt.engine.core.bll.AddVmTemplateCommand' failed: Could not
> get JDBC Connection; nested exception is java.sql.SQLException:
> javax.resource.ResourceException: IJ000460: Error checking for a transaction
>    ...
>    Caused by: javax.resource.ResourceException: IJ000459: Transaction is not
> active:
>  
> 
> And this makes sense since the default timeout of engine's transaction is
> 600 seconds:
>   <coordinator-environment default-timeout="600"/>
> 
> I think it makes perfect sense to re-shape that command and exclude the
> template/disks creation part from the transaction in
> AddVmTemplateCommand.executeCommand().
> 
> It might be that on the environment where the scenario worked - the command
> didn't face the transaction timeout (better connectivity/smaller disk or
> other reason for not hitting timeout).

Yeah, thought about that as well. However, the timeout repeats constantly on multiple rhev environments. Whereas, works constantly on multiple oVirt envs...

Comment 11 Daniel Erez 2015-09-20 18:05:06 UTC

(In reply to Oved Ourfali from comment #5)
> But there was no difference between latest upstream and downstream builds?
> 
> Also, we will also need a cinder instance to work with. Can you send its
> details offline?

Sure, I'll send it offline by request. Ori has one as well if needed.

Comment 12 Oved Ourfali 2015-09-20 18:15:51 UTC

Based on Moti's comment, and offline discussion, moving to storage to consider applying Moti's suggestion, and check whether it solves the issue.

Comment 13 Ori Gofen 2015-09-21 09:36:34 UTC

Returning needinfo from comment #1

Comment 14 Ravi Nori 2015-09-21 13:06:42 UTC

The Future.get operation can block only if the child command's execute method is not completing. I would check to see if CloneSingleCinderDiskCommand.executeCommand method is completing, If not I would think there is an underlying issue there.

Comment 18 Ori Gofen 2015-10-19 12:22:16 UTC

Verified on rhevm-3.6.0.1-0.1.el6.noarch

Comment 20 errata-xmlrpc 2016-03-09 21:13:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html

Note You need to log in before you can comment on or make changes to this bug.