Bug 1275736 - CreateSnapshotFromTemplateCommand executions may end with timeout when the SPM is overloaded.
Summary: CreateSnapshotFromTemplateCommand executions may end with timeout when the SP...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.2.2
: ---
Assignee: Fred Rolland
QA Contact: eberman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-10-27 15:25 UTC by Tim Speetjens
Modified: 2023-09-14 23:58 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-05 09:42:19 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Thread dump (filtered) (28.00 KB, text/plain)
2015-10-27 15:39 UTC, Tim Speetjens
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-45168 0 None None None 2022-03-13 14:08:18 UTC

Description Tim Speetjens 2015-10-27 15:25:17 UTC
Description of problem:
When multiple API or GUI calls require a command to be run at the SPM, they get serialized (running in a synchronized block inside executeVDSCommand in IrsBrokerCommand)
When commands enter this method, a transaction is already started. This could lead to aborted transactions, when the command finishes on the SPM

Version-Release number of selected component (if applicable):
rhevm-3.5.3.1-1.4.el6ev.noarch

How reproducible:
Only when rhev manager is under load
Can be demonstrated by using thread dumps

Steps to Reproduce:
Run multiple storage commands, in parallel


Actual results:
Thread dumps show that a transaction was started, while the thread is blocked on the object (executeVDSCommand)

Expected results:
Transaction should only be started inside the synchronized block

Additional info:
The following can be seen in a thread dump, illustrating the issue:
"org.ovirt.thread.pool-6-thread-37" prio=10 tid=0x00007fe37408a000 nid=0x2341 waiting for monitor entry [0x00007fe26af6c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:156)
        - waiting to lock <0x0000000629e56ee8> (a java.lang.Object)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31)
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418)
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33)
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2060)
        at org.ovirt.engine.core.bll.storage.StorageHandlingCommandBase.runVdsCommand(StorageHandlingCommandBase.java:852)
        at org.ovirt.engine.core.bll.BaseImagesCommand.endSuccessfully(BaseImagesCommand.java:337)
        at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:635)
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:578)
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1963)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:210)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:149)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:118)
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:507)
        at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:460)
        at sun.reflect.GeneratedMethodAccessor1029.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Comment 1 Tim Speetjens 2015-10-27 15:39:09 UTC
Created attachment 1086906 [details]
Thread dump (filtered)

Attachment contains relevant threads filtered out of a real thread dump.

One thread holds the lock, while two others are waiting on it

Comment 2 Tal Nisan 2015-10-27 15:49:26 UTC
Tim, can you please attach the engine log?

Comment 3 Tim Speetjens 2015-10-27 15:55:39 UTC
Tal,

I can't as I don't have it. I attached the rest of the stack traces (leading up to vm add command)

Comment 4 Yaniv Lavi 2016-03-27 09:27:59 UTC
What is the scale that you are aiming for?
What is the current limit of spm actions?

Comment 5 Tim Speetjens 2016-03-29 07:20:17 UTC
Unsure if I understand your request correctly.

I'm not aiming for a number of possible requests per unit of time. This bug was filed to point to this flaw, which may be an implementation issue or a problem with the architecture.

Within a database transaction, two things happen which may take up longer than anticipated: Waiting for the request to get sent to the SPM/SDM (these are serealized); storage request completion on the SPM/SDM itself.

Comment 6 Yaniv Kaul 2016-11-29 21:12:18 UTC
I hope with more work being actually done by the HSMs in 4.1, the overall responsiveness is improved. Other than that, I don't see any work going into 4.1 for this - please defer.

Comment 7 Yaniv Lavi 2016-12-28 16:28:20 UTC
Can we please make sure the HSM RFE includes this aspect of testing as well?

Comment 8 Piotr Kliczewski 2017-01-02 08:09:40 UTC
This BZ requires changes how we send storage commands to vdsm. I would like to suggest improvements in this area. We use dedicated network connection for SPM I think that now when xmlrpc was removed we do not need it any more.

Comment 9 Liron Aravot 2017-03-27 16:46:02 UTC
I've reviewed the test cases -
This bug should be about changing CreateSnapshotFromTemplateCommand to be non transactive on the engine side.
CreateSnapshotFromTemplateCommand (case https://access.redhat.com/support/cases/#/case/01581944)

Additionally, we may change the vdsm verb to not attempt to access the storage in the sycnronous part of the request.

Comment 10 Tim Speetjens 2017-03-28 07:17:34 UTC
Hi Liron,

The scope of the issue I wanted to report is bigger than just this single SPM command.

Running the following in a logcollector, I received from my customer, did return at least one more:

$ zgrep 'ERROR.*nested exception is.*SQLException.*IJ000460' var/log/ovirt-engine/engine.log*gz

For a related sosreport, these SPM commands/verbs are seen:
CreateSnapshotVDSCommand
CopyImageVDSCommand

This is rather unsurprising, as this customer creates and destroys lots of VMs. Still, I think opportunity exists this happens with other commands. I'll download some more LogCollectors, and see if I can find others.

The issue surfaces as a database issue (java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction) related to the transaction timeout (which in ovirt is set to 600s). The transaction that was in use by that thread was reaped, as it was open for too long.

Looking at the thread dumps, I understood that the database transaction is started before the code that limits the requests to the SPM to be at most one at the same time. This means that the transaction may be open for quite a while longer than foreseen, leading up to the transaction timeout.

I believe this mechanism should be reviewed, as Piotr suggests, too.

Tim

Comment 11 Liron Aravot 2017-03-28 07:36:25 UTC
Thanks Tim,
we generally have two different issues -

1. VDSM verbs synchronous execution is limited to 3 minutes by the engine, if it takes more than 3 minutes the request will timeout in the engine - new verbs are added in such fashion so that their syncornous part won't perform any operation that may take a long time (like accessing the shared storage).

2. Transaction timeouts (which relates to #1 as well, as issue #1 increase the chances to encounter it) - as you said, we currently limit the transaction time to 10 minutes. New flows added to the engine are non transactive (which means that you won't encounter that timeout) while there are old verbs that are still transactive.
Looking forward, we aim to convert those flows to be all non transactive, but we give priority to flows that actually affect users - that's the reason I've changed this bug to handle that concrete flow - It'll be great if you could open a separate bug for each problematic flow you encounter.

Comment 12 Yaniv Kaul 2017-03-28 07:49:55 UTC
(In reply to Tim Speetjens from comment #10)
> Hi Liron,
> 
> The scope of the issue I wanted to report is bigger than just this single
> SPM command.
> 
> Running the following in a logcollector, I received from my customer, did
> return at least one more:
> 
> $ zgrep 'ERROR.*nested exception is.*SQLException.*IJ000460'
> var/log/ovirt-engine/engine.log*gz
> 
> For a related sosreport, these SPM commands/verbs are seen:
> CreateSnapshotVDSCommand
> CopyImageVDSCommand

Tim, what action are they performing here? Especially the 2nd one? In 4.1 we've moved SOME commands to use HSMs and not the SPM, thus reducing the load on the SPM. For example, CopyVolumeDataVDSCommand is a new HSM command and is currently used in both cold move flow and when cloning a VM from a template - are they performing the latter?

Comment 13 Tim Speetjens 2017-03-28 10:20:59 UTC
Yaniv,

Is this what you're looking for?

2015-10-14 10:54:41,959 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (ajp-/127.0.0.1:8702-37) [2f0fc07e] FINISH, CopyImageVDSCommand, log id: 278d2741
2015-10-14 10:54:41,959 ERROR [org.ovirt.engine.core.bll.CreateImageTemplateCommand] (ajp-/127.0.0.1:8702-37) [2f0fc07e] Command org.ovirt.engine.core.bll.CreateImageTemplateCommand throw Vdc Bll exception. With error message VdcBLLException: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction (Failed with error ENGINE and code 5001)


Tim

Comment 14 Yaniv Kaul 2017-03-28 10:22:05 UTC
Allon - is the above improved in 4.1?

Comment 15 Allon Mureinik 2017-03-29 08:14:51 UTC
Creating a template from a VM requires copy-collapsing the images, which wasn't implemented it.
It's on our radar for 4.2.

Comment 16 Allon Mureinik 2017-11-15 07:41:57 UTC
(In reply to Allon Mureinik from comment #15)
> Creating a template from a VM requires copy-collapsing the images, which
> wasn't implemented it.
> It's on our radar for 4.2.

We won't be able to fit this in 4.2, unfortunately (although we did introduce several scaling improvements to take load off the SPM in general, which should help).

Comment 17 Yaniv Lavi 2018-02-05 09:42:19 UTC
This should be much improved in RHV 4.1 with moving of create VM from template to HSM. Please retest and open a new bug if you still see these issues.

Comment 18 Franta Kust 2019-05-16 12:55:14 UTC
BZ<2>Jira re-sync

Comment 19 Red Hat Bugzilla 2023-09-14 23:58:39 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.