Bug 1277057 - SPM commands are not sent from the manager to the hypervisor on a FIFO basis
Summary: SPM commands are not sent from the manager to the hypervisor on a FIFO basis
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.0.0-rc
: 4.0.0
Assignee: Liron Aravot
QA Contact: Carlos Mestre González
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-02 08:44 UTC by Tim Speetjens
Modified: 2022-03-13 14:09 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 20:30:23 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-45169 0 None None None 2022-03-13 14:09:08 UTC
Red Hat Product Errata RHEA-2016:1743 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine) 2016-09-02 21:54:01 UTC
oVirt gerrit 48245 0 master MERGED infra/storage: send commands to the SPM on a FIFO basis 2016-06-05 12:42:47 UTC
oVirt gerrit 58667 0 ovirt-engine-4.0 MERGED infra/storage: send commands to the SPM on a FIFO basis 2016-06-06 16:25:54 UTC

Description Tim Speetjens 2015-11-02 08:44:26 UTC
Description of problem:
When requests are made that need to act on the SPM, the commands are serialized on the manager (the manager sends only one SPM command at a time). The mechanism used to force this, is a synchronized object, but this does not guarantee any fair ordering of requests.

When bursts of storage commands are sent to the manager using the API, and the SPM cannot handle them quickly enough, tasks need to wait for other SPM tasks to complete. This waiting mechanism is not using a fair queue. Because of this, multiple issue may arise:
- Commands may be sent to the SPM only after severe delays, which is confusing (starvation)
- Database transactions may be aborted, if the delays are greater than the JBoss transaction timeout (at least for some commands the transaction includes waiting on this synchronized object), leading to inconsistencies in the database, requiring manual cleanup.

Version-Release number of selected component (if applicable):
rhevm-3.5.3.1-1.4.el6ev.noarch

How reproducible:
Using API calls/scripting and/or a slowed down filesystem
Demonstrating that the commands arrive out of order, on the SPM is easier

Steps to demonstrate commands are handled out of order:
1. Create a number of disks
2. Delete the disks one by one, but without waiting for the API command to complete (for example, using a script that uses the API to delete one disk only, and run it in the background, than launch another in the background, and so on, while leaving a small delay between the requests)
3. Compare the order of the deletes on the SPM, with the order the deletes are requested.

Actual results:
Commands arrive on the SPM out of order. Depending on the setup, maybe even transaction are aborted, and related errors are seen in the engine.log.

Expected results:
Commands should be sent from the engine to the SPM using a fair mechanism, to avoid starvation

Additional info:

Comment 1 Tim Speetjens 2015-11-02 08:57:56 UTC
The use of a synchronized object is used in multiple places, and should be replaced by a  'fair' java.util.concurrent.locks.ReentrantLock, surrounding the code with a  lock() and unlock() within a try - finally block

Comment 3 Piotr Kliczewski 2015-11-04 09:12:42 UTC
There was an attempt [1] to fix this issue already. Storage team decided to take over this change so changing the whiteboard.


[1] https://gerrit.ovirt.org/#/c/37947/

Comment 4 Mike McCune 2016-03-28 22:55:34 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Yaniv Lavi 2016-05-09 10:58:13 UTC
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 8 Allon Mureinik 2016-06-06 11:56:07 UTC
Liron, can you please backport this patch to the 4.0 branch too?

Comment 9 Tal Nisan 2016-06-06 12:25:53 UTC
I already did, Liron - please verify

Comment 10 Carlos Mestre González 2016-06-17 16:54:27 UTC
Verified this with an script, created 10 disks and then sent 10 deletes, each 0.1 seconds, checked the commands sent to the SPM from the engine and all were properly sent in the order they arrived (FIFO).

I run this test multiple times and all seemed to work.

version: rhevm-4.0.0.4-0.1.el7ev.noarch

Comment 12 errata-xmlrpc 2016-08-23 20:30:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html


Note You need to log in before you can comment on or make changes to this bug.