Bug 1264691 - Future interface of java.util.concurrent get stuck and prevent cinder operations
Future interface of java.util.concurrent get stuck and prevent cinder operations
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.6.0
Unspecified Unspecified
unspecified Severity urgent
: ovirt-3.6.0-rc3
: 3.6.0
Assigned To: Moti Asayag
Ori Gofen
:
Depends On:
Blocks: 1255221 1266250 1271672
  Show dependency treegraph
 
Reported: 2015-09-20 14:57 EDT by Daniel Erez
Modified: 2016-04-19 21:29 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1264677
Environment:
Last Closed: 2016-04-19 21:29:15 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 47408 master MERGED engine: Update command associated entities instantly Never
oVirt gerrit 47450 ovirt-engine-3.6 MERGED engine: Update command associated entities instantly Never
oVirt gerrit 47457 ovirt-engine-3.6.0 MERGED engine: Update command associated entities instantly Never

  None (edit)
Description Daniel Erez 2015-09-20 14:57:12 EDT
+++ This bug was initially created as a clone of Bug #1264677 +++

Description of problem:
This bug was initially found during cinder/ceph manual testing and was opened to storage as a cinder bug (bz #1252958) but didn't reproduce on ovirt-master-3.6 build on fedora21.

The best way to reproduce this bug is to create a template from a vm with a cinder disk(ceph backend) on a rhevm engine (over rhel6.6/7), the operation gets stuck and then fails on timed out.

This happens only on RHEVM builds on rhel6.6/6.7 (jboss eap 6.4.z), rhevm's code is exactly the same as oVirt and the behaviour differs when the code reaches:

"                buildCinderChildCommandParameters(cinderDisks, getVmSnapshotId()),
                cloneContextAndDetachFromParent(),
                CINDERStorageHelper.getStorageEntities(cinderDisks));
        try {
>>>>>>>>>>> VdcReturnValueBase vdcReturnValueBase = future.get();
            if (vdcReturnValueBase.getSucceeded()) {
                Map<Guid, Guid> diskImageMap = vdcReturnValueBase.getActionReturnValue();
                srcDeviceIdToTargetDeviceIdMapping.putAll(diskImageMap);
            } else {"

the part when it calls future.get() java concurrent interface.

Version-Release number of selected component (if applicable):
rhevm 3.6-13

How reproducible:
100%

Steps to Reproduce:
1.create a template from a vm with a cinder disk

Actual results:
operation fails after it gets timed out

Expected results:
operation successful

Additional info:
* attachments are found at bz #1252958

--- Additional comment from Daniel Erez on 2015-09-20 10:40:53 EDT ---

Hi Ravi,

The mentioned code is located at 'AddVmTemplateCommand -> addVmTemplateCinderDisks()', and is invoked from within a transaction. Perhaps there's an issue with Future's get() and transactions on JBoss EAP?

Thanks!

--- Additional comment from Piotr Kliczewski on 2015-09-20 12:41:40 EDT ---

Please provide engine log with debug level and corresponding vdsm log.

--- Additional comment from Oved Ourfali on 2015-09-20 12:59:45 EDT ---

Also, please describe the flow, as I see future objects being used there both to get a return value from running async engine commands, and probably also VDSM ones.

Also, please verify you run using the exact same version (engine+vdsm), both in jboss EAP and wildfly versions, as I find it hard to believe it is related to that.

--- Additional comment from Daniel Erez on 2015-09-20 13:13:33 EDT ---

(In reply to Oved Ourfali from comment #3)
> Also, please describe the flow, as I see future objects being used there
> both to get a return value from running async engine commands, and probably
> also VDSM ones.
> 

The relevant flow consists of a VM with a single Cinder disk (i.e. no VDSM tasks, only CoCo). The flow is as follows: AddVmTemplateCommand -> executeCommand() -> addVmTemplateCinderDisks method (inside a transaction) -> Future 'VdcActionType.CloneCinderDisks' -> stuck on future.get().

> Also, please verify you run using the exact same version (engine+vdsm), both
> in jboss EAP and wildfly versions, as I find it hard to believe it is
> related to that.

It was just a guess.. As the same scenario works correctly in oVirt stable (and not in RHEV 3.6 beta), so it could be any other difference (DB version/jboss/java/etc).

@Ori - can you please attach the logs with debug level.. (https://bugzilla.redhat.com/show_bug.cgi?id=1264677#c2)

--- Additional comment from Oved Ourfali on 2015-09-20 13:22:20 EDT ---

But there was no difference between latest upstream and downstream builds?

Also, we will also need a cinder instance to work with. Can you send its details offline?

--- Additional comment from Moti Asayag on 2015-09-20 13:40:36 EDT ---

By examining the log, it is shown that the duration of the command exceeds 10 minutes:

Start:
  2015-08-30 15:54:41,331 INFO  [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9) [4566b62d] Running command: AddVmTemplateCommand internal: false. Entities affected :  ID: 00000001-0001-0001-0001-00000000033a Type: StoragePoolAction group CREATE_TEMPLATE with role type USER


End:

 2015-08-30 16:04:42,758 ERROR [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9) [] Command 'org.ovirt.engine.core.bll.AddVmTemplateCommand' failed: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction
   ...
   Caused by: javax.resource.ResourceException: IJ000459: Transaction is not active:
 

And this makes sense since the default timeout of engine's transaction is 600 seconds:
  <coordinator-environment default-timeout="600"/>

I think it makes perfect sense to re-shape that command and exclude the template/disks creation part from the transaction in AddVmTemplateCommand.executeCommand().

It might be that on the environment where the scenario worked - the command didn't face the transaction timeout (better connectivity/smaller disk or other reason for not hitting timeout).

--- Additional comment from Daniel Erez on 2015-09-20 13:52 EDT ---



--- Additional comment from Daniel Erez on 2015-09-20 13:52 EDT ---



--- Additional comment from Daniel Erez on 2015-09-20 13:53:37 EDT ---

Attached are logs from my env.

--- Additional comment from Daniel Erez on 2015-09-20 13:57:36 EDT ---

(In reply to Moti Asayag from comment #6)
> By examining the log, it is shown that the duration of the command exceeds
> 10 minutes:
> 
> Start:
>   2015-08-30 15:54:41,331 INFO 
> [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9)
> [4566b62d] Running command: AddVmTemplateCommand internal: false. Entities
> affected :  ID: 00000001-0001-0001-0001-00000000033a Type: StoragePoolAction
> group CREATE_TEMPLATE with role type USER
> 
> 
> End:
> 
>  2015-08-30 16:04:42,758 ERROR
> [org.ovirt.engine.core.bll.AddVmTemplateCommand] (ajp-/127.0.0.1:8702-9) []
> Command 'org.ovirt.engine.core.bll.AddVmTemplateCommand' failed: Could not
> get JDBC Connection; nested exception is java.sql.SQLException:
> javax.resource.ResourceException: IJ000460: Error checking for a transaction
>    ...
>    Caused by: javax.resource.ResourceException: IJ000459: Transaction is not
> active:
>  
> 
> And this makes sense since the default timeout of engine's transaction is
> 600 seconds:
>   <coordinator-environment default-timeout="600"/>
> 
> I think it makes perfect sense to re-shape that command and exclude the
> template/disks creation part from the transaction in
> AddVmTemplateCommand.executeCommand().
> 
> It might be that on the environment where the scenario worked - the command
> didn't face the transaction timeout (better connectivity/smaller disk or
> other reason for not hitting timeout).

Yeah, thought about that as well. However, the timeout repeats constantly on multiple rhev environments. Whereas, works constantly on multiple oVirt envs...

--- Additional comment from Daniel Erez on 2015-09-20 14:05:06 EDT ---

(In reply to Oved Ourfali from comment #5)
> But there was no difference between latest upstream and downstream builds?
> 
> Also, we will also need a cinder instance to work with. Can you send its
> details offline?

Sure, I'll send it offline by request. Ori has one as well if needed.

--- Additional comment from Oved Ourfali on 2015-09-20 14:15:51 EDT ---

Based on Moti's comment, and offline discussion, moving to storage to consider applying Moti's suggestion, and check whether it solves the issue.
Comment 1 Oved Ourfali 2015-09-21 01:49:11 EDT
Daniel - please run your development environment with wildfly, and then with EAP, and report back on your findings.
Comment 2 Daniel Erez 2015-09-21 03:53:45 EDT
(In reply to Oved Ourfali from comment #1)
> Daniel - please run your development environment with wildfly, and then with
> EAP, and report back on your findings.

Same issue in oVirt (with wildfly) on el6. So the problem is somewhere else.
Comment 3 Moti Asayag 2015-09-22 06:33:55 EDT
The described scenario was debugged and raised a lock on db access as a result of the following:

Thread A ---- Transaction 1 open
                   |
                Thread B ------ Transaction 2 open

Thread A (AddVmTemplate) opens new transaction. Within that transaction, it submit an async command (by coco):
1. Suspend current transaction
2. Save command (internally attempts to suspend transaction again)
3. Resume transaction
4. Submit command to Thread B
5. Block on Thread B

Thread B is locked on DB (on command_entities table). The lock ends while the open transaction of Thread A is timed-out.

By excluding the submission of thread B from the transaction - the issue was solved.
Comment 4 Ori Gofen 2015-10-25 09:52:23 EDT
Verified on rhevm-3.6.0.2

Note You need to log in before you can comment on or make changes to this bug.