Bug 1401585 - The pendingOperation task processing will lose tasks in case an exception is thrown
Summary: The pendingOperation task processing will lose tasks in case an exception is ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm-jsonrpc-java
Classification: oVirt
Component: Core
Version: 1.2.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.0.6
: 1.2.10
Assignee: Piotr Kliczewski
QA Contact: Aleksei Slaikovskii
URL:
Whiteboard:
: 1401896 1402597 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-05 15:44 UTC by Roy Golan
Modified: 2017-01-18 07:28 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-01-18 07:28:46 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.0.z+
rule-engine: blocker+


Attachments (Terms of Use)
Engine log with debug prints (4.42 MB, application/x-gzip)
2016-12-07 09:21 UTC, Fred Rolland
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 67391 0 None None None 2016-12-05 16:04:05 UTC
oVirt gerrit 67840 0 ovirt-4.0 MERGED Safe publication of the queue to avoid job loss 2016-12-06 09:22:46 UTC
oVirt gerrit 67872 0 master MERGED jsonrpc: version bump 2016-12-08 11:30:32 UTC
oVirt gerrit 67878 0 ovirt-4.0 MERGED Version bump 2016-12-08 10:21:32 UTC
oVirt gerrit 67883 0 ovirt-engine-4.0 MERGED jsonrpc: version bump 2016-12-08 11:54:46 UTC
oVirt gerrit 67901 0 master MERGED Poll the q instead of iterator 2016-12-07 13:12:20 UTC
oVirt gerrit 67976 0 ovirt-4.0 MERGED Poll the q instead of iterator 2016-12-08 10:21:37 UTC
oVirt gerrit 68024 0 ovirt-engine-4.0.6 MERGED jsonrpc: version bump 2016-12-08 20:02:06 UTC

Description Roy Golan 2016-12-05 15:44:02 UTC
Description of problem:

The task Q processing is being done on a local copy of the Q. while waiting for response, an exception will abort the local copy task processing and the rest of the jobs in side that local copy will be lost. The main Q meanwhile doesn't have those tasks any longer. This result in losing jobs, i.e lossing calls to api and in the worst case losing the connect request that can result the VdsManager lock to be held infinitely and the rest of the engine wide threads to start reporting network exception and contend on the same VdsManager lock. This will exaust the thread pool task Q as well and the engine daemons (quarts scheduled tasks) will stop work.

The attached dump and engine log shows how the "Reactor Client" is waiting forever on the connectTask.get() task without any progress and the rest of threads waiting for the VdsManager lock that they can never get hold on. 



Version-Release number of selected component (if applicable):
1.2.9 and affect ovirt-engine 4.0 and 4.1 master

How reproducible:
50% cause it depends on the timing.

Steps to Reproduce:
1. have 3 hosts and some vms and disks (so we will have activity on monitoring)
2. disconnect the network (pull the plug or drop packets or switch network)
3. wait till the SocketException is thrown or some other IO

Actual results:
Tasks that are in the Q get lost and never process. may render the system unsable

Expected results:
Task processing fail in isolation, the rest of the tasks gets proccesed, the system doest get contended on the VdsManager lock for (VdsManager.handleNetworkException -> needs an engine fix as well)

Additional info:
happen to in a production env and on development env
See ReactorScheduler#performPendingOpearations and ReactorClient#connect speciall connectTask.get()

Comment 1 sefi litmanovich 2016-12-06 11:47:48 UTC
*** Bug 1401896 has been marked as a duplicate of this bug. ***

Comment 2 Roy Golan 2016-12-06 13:53:47 UTC
underlying issue found -

ConcurrentHashMap as a weakly consistent iterator so while iterating we might not see the latest update. This will cause tasks to disappear

Comment 3 Piotr Kliczewski 2016-12-06 14:12:16 UTC
For sure you mean ConcurrentLinkedQueue which has the same property. It is important to notice that we iterate over only on previous collection. Once we hold reference in hand we use it to run the tasks and adding to this specific instance is not possible so weakly consistent iterator seems not to be the case here but making sure that new reference is visible to all the threads.

Comment 4 Fred Rolland 2016-12-07 09:21:13 UTC
Created attachment 1228943 [details]
Engine log with debug prints

Comment 5 Fred Rolland 2016-12-07 09:25:09 UTC
I reproduced the issue with the master version. (Without the timeout fix)

See attached the end of the log, and the debug that I added in the code.

The following task is inserted but never executed:
inserting future task java.util.concurrent.FutureTask@c3aaf7

Also in the thread dump, we can see that is is stuck, waiting for the FutureTask.

Comment 6 Piotr Kliczewski 2016-12-07 09:40:07 UTC
Without timeout fix the probability of occurring this issue is lower. Still it needs to be fixed. You just proved that hotspot javadocs are not describing what openjdk is doing.

Comment 7 Roy Golan 2016-12-08 07:45:12 UTC
*** Bug 1402597 has been marked as a duplicate of this bug. ***

Comment 8 Roy Golan 2016-12-08 07:59:46 UTC
need to cherry pick 5e3829d to 4.0. Should we create a clone of this bug?

Comment 9 Oved Ourfali 2016-12-08 08:04:53 UTC
I think we should just leave this one.

Comment 10 Aleksei Slaikovskii 2016-12-14 14:36:48 UTC
This wont be automated as it's codechange only and it is not reproducible anymore, basic sanity will be sufficient for verification process.


Note You need to log in before you can comment on or make changes to this bug.