Description of problem: When multiple API or GUI calls require a command to be run at the SPM, they get serialized (running in a synchronized block inside executeVDSCommand in IrsBrokerCommand) When commands enter this method, a transaction is already started. This could lead to aborted transactions, when the command finishes on the SPM Version-Release number of selected component (if applicable): rhevm-3.5.3.1-1.4.el6ev.noarch How reproducible: Only when rhev manager is under load Can be demonstrated by using thread dumps Steps to Reproduce: Run multiple storage commands, in parallel Actual results: Thread dumps show that a transaction was started, while the thread is blocked on the object (executeVDSCommand) Expected results: Transaction should only be started inside the synchronized block Additional info: The following can be seen in a thread dump, illustrating the issue: "org.ovirt.thread.pool-6-thread-37" prio=10 tid=0x00007fe37408a000 nid=0x2341 waiting for monitor entry [0x00007fe26af6c000] java.lang.Thread.State: BLOCKED (on object monitor) at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:156) - waiting to lock <0x0000000629e56ee8> (a java.lang.Object) at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418) at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2060) at org.ovirt.engine.core.bll.storage.StorageHandlingCommandBase.runVdsCommand(StorageHandlingCommandBase.java:852) at org.ovirt.engine.core.bll.BaseImagesCommand.endSuccessfully(BaseImagesCommand.java:337) at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:635) at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:578) at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1963) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:210) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:149) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:118) at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:507) at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:460) at sun.reflect.GeneratedMethodAccessor1029.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Created attachment 1086906 [details] Thread dump (filtered) Attachment contains relevant threads filtered out of a real thread dump. One thread holds the lock, while two others are waiting on it
Tim, can you please attach the engine log?
Tal, I can't as I don't have it. I attached the rest of the stack traces (leading up to vm add command)
What is the scale that you are aiming for? What is the current limit of spm actions?
Unsure if I understand your request correctly. I'm not aiming for a number of possible requests per unit of time. This bug was filed to point to this flaw, which may be an implementation issue or a problem with the architecture. Within a database transaction, two things happen which may take up longer than anticipated: Waiting for the request to get sent to the SPM/SDM (these are serealized); storage request completion on the SPM/SDM itself.
I hope with more work being actually done by the HSMs in 4.1, the overall responsiveness is improved. Other than that, I don't see any work going into 4.1 for this - please defer.
Can we please make sure the HSM RFE includes this aspect of testing as well?
This BZ requires changes how we send storage commands to vdsm. I would like to suggest improvements in this area. We use dedicated network connection for SPM I think that now when xmlrpc was removed we do not need it any more.
I've reviewed the test cases - This bug should be about changing CreateSnapshotFromTemplateCommand to be non transactive on the engine side. CreateSnapshotFromTemplateCommand (case https://access.redhat.com/support/cases/#/case/01581944) Additionally, we may change the vdsm verb to not attempt to access the storage in the sycnronous part of the request.
Hi Liron, The scope of the issue I wanted to report is bigger than just this single SPM command. Running the following in a logcollector, I received from my customer, did return at least one more: $ zgrep 'ERROR.*nested exception is.*SQLException.*IJ000460' var/log/ovirt-engine/engine.log*gz For a related sosreport, these SPM commands/verbs are seen: CreateSnapshotVDSCommand CopyImageVDSCommand This is rather unsurprising, as this customer creates and destroys lots of VMs. Still, I think opportunity exists this happens with other commands. I'll download some more LogCollectors, and see if I can find others. The issue surfaces as a database issue (java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction) related to the transaction timeout (which in ovirt is set to 600s). The transaction that was in use by that thread was reaped, as it was open for too long. Looking at the thread dumps, I understood that the database transaction is started before the code that limits the requests to the SPM to be at most one at the same time. This means that the transaction may be open for quite a while longer than foreseen, leading up to the transaction timeout. I believe this mechanism should be reviewed, as Piotr suggests, too. Tim
Thanks Tim, we generally have two different issues - 1. VDSM verbs synchronous execution is limited to 3 minutes by the engine, if it takes more than 3 minutes the request will timeout in the engine - new verbs are added in such fashion so that their syncornous part won't perform any operation that may take a long time (like accessing the shared storage). 2. Transaction timeouts (which relates to #1 as well, as issue #1 increase the chances to encounter it) - as you said, we currently limit the transaction time to 10 minutes. New flows added to the engine are non transactive (which means that you won't encounter that timeout) while there are old verbs that are still transactive. Looking forward, we aim to convert those flows to be all non transactive, but we give priority to flows that actually affect users - that's the reason I've changed this bug to handle that concrete flow - It'll be great if you could open a separate bug for each problematic flow you encounter.
(In reply to Tim Speetjens from comment #10) > Hi Liron, > > The scope of the issue I wanted to report is bigger than just this single > SPM command. > > Running the following in a logcollector, I received from my customer, did > return at least one more: > > $ zgrep 'ERROR.*nested exception is.*SQLException.*IJ000460' > var/log/ovirt-engine/engine.log*gz > > For a related sosreport, these SPM commands/verbs are seen: > CreateSnapshotVDSCommand > CopyImageVDSCommand Tim, what action are they performing here? Especially the 2nd one? In 4.1 we've moved SOME commands to use HSMs and not the SPM, thus reducing the load on the SPM. For example, CopyVolumeDataVDSCommand is a new HSM command and is currently used in both cold move flow and when cloning a VM from a template - are they performing the latter?
Yaniv, Is this what you're looking for? 2015-10-14 10:54:41,959 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (ajp-/127.0.0.1:8702-37) [2f0fc07e] FINISH, CopyImageVDSCommand, log id: 278d2741 2015-10-14 10:54:41,959 ERROR [org.ovirt.engine.core.bll.CreateImageTemplateCommand] (ajp-/127.0.0.1:8702-37) [2f0fc07e] Command org.ovirt.engine.core.bll.CreateImageTemplateCommand throw Vdc Bll exception. With error message VdcBLLException: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction (Failed with error ENGINE and code 5001) Tim
Allon - is the above improved in 4.1?
Creating a template from a VM requires copy-collapsing the images, which wasn't implemented it. It's on our radar for 4.2.
(In reply to Allon Mureinik from comment #15) > Creating a template from a VM requires copy-collapsing the images, which > wasn't implemented it. > It's on our radar for 4.2. We won't be able to fit this in 4.2, unfortunately (although we did introduce several scaling improvements to take load off the SPM in general, which should help).
This should be much improved in RHV 4.1 with moving of create VM from template to HSM. Please retest and open a new bug if you still see these issues.
BZ<2>Jira re-sync
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days