Bug 1401585

Summary: The pendingOperation task processing will lose tasks in case an exception is thrown
Product: [oVirt] vdsm-jsonrpc-java Reporter: Roy Golan <rgolan>
Component: CoreAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: Aleksei Slaikovskii <aslaikov>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 1.2.6CC: bugs, emarcian, frolland, gklein, mperina, oourfali, slitmano
Target Milestone: ovirt-4.0.6Keywords: CodeChange, Performance, ZStream
Target Release: 1.2.10Flags: rule-engine: ovirt-4.0.z+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 07:28:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Engine log with debug prints none

Description Roy Golan 2016-12-05 15:44:02 UTC
Description of problem:

The task Q processing is being done on a local copy of the Q. while waiting for response, an exception will abort the local copy task processing and the rest of the jobs in side that local copy will be lost. The main Q meanwhile doesn't have those tasks any longer. This result in losing jobs, i.e lossing calls to api and in the worst case losing the connect request that can result the VdsManager lock to be held infinitely and the rest of the engine wide threads to start reporting network exception and contend on the same VdsManager lock. This will exaust the thread pool task Q as well and the engine daemons (quarts scheduled tasks) will stop work.

The attached dump and engine log shows how the "Reactor Client" is waiting forever on the connectTask.get() task without any progress and the rest of threads waiting for the VdsManager lock that they can never get hold on. 



Version-Release number of selected component (if applicable):
1.2.9 and affect ovirt-engine 4.0 and 4.1 master

How reproducible:
50% cause it depends on the timing.

Steps to Reproduce:
1. have 3 hosts and some vms and disks (so we will have activity on monitoring)
2. disconnect the network (pull the plug or drop packets or switch network)
3. wait till the SocketException is thrown or some other IO

Actual results:
Tasks that are in the Q get lost and never process. may render the system unsable

Expected results:
Task processing fail in isolation, the rest of the tasks gets proccesed, the system doest get contended on the VdsManager lock for (VdsManager.handleNetworkException -> needs an engine fix as well)

Additional info:
happen to in a production env and on development env
See ReactorScheduler#performPendingOpearations and ReactorClient#connect speciall connectTask.get()

Comment 1 sefi litmanovich 2016-12-06 11:47:48 UTC
*** Bug 1401896 has been marked as a duplicate of this bug. ***

Comment 2 Roy Golan 2016-12-06 13:53:47 UTC
underlying issue found -

ConcurrentHashMap as a weakly consistent iterator so while iterating we might not see the latest update. This will cause tasks to disappear

Comment 3 Piotr Kliczewski 2016-12-06 14:12:16 UTC
For sure you mean ConcurrentLinkedQueue which has the same property. It is important to notice that we iterate over only on previous collection. Once we hold reference in hand we use it to run the tasks and adding to this specific instance is not possible so weakly consistent iterator seems not to be the case here but making sure that new reference is visible to all the threads.

Comment 4 Fred Rolland 2016-12-07 09:21:13 UTC
Created attachment 1228943 [details]
Engine log with debug prints

Comment 5 Fred Rolland 2016-12-07 09:25:09 UTC
I reproduced the issue with the master version. (Without the timeout fix)

See attached the end of the log, and the debug that I added in the code.

The following task is inserted but never executed:
inserting future task java.util.concurrent.FutureTask@c3aaf7

Also in the thread dump, we can see that is is stuck, waiting for the FutureTask.

Comment 6 Piotr Kliczewski 2016-12-07 09:40:07 UTC
Without timeout fix the probability of occurring this issue is lower. Still it needs to be fixed. You just proved that hotspot javadocs are not describing what openjdk is doing.

Comment 7 Roy Golan 2016-12-08 07:45:12 UTC
*** Bug 1402597 has been marked as a duplicate of this bug. ***

Comment 8 Roy Golan 2016-12-08 07:59:46 UTC
need to cherry pick 5e3829d to 4.0. Should we create a clone of this bug?

Comment 9 Oved Ourfali 2016-12-08 08:04:53 UTC
I think we should just leave this one.

Comment 10 Aleksei Slaikovskii 2016-12-14 14:36:48 UTC
This wont be automated as it's codechange only and it is not reproducible anymore, basic sanity will be sufficient for verification process.