1401585 – The pendingOperation task processing will lose tasks in case an exception is thrown

Bug 1401585 - The pendingOperation task processing will lose tasks in case an exception is thrown

Summary: The pendingOperation task processing will lose tasks in case an exception is ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm-jsonrpc-java
Classification:	oVirt
Component:	Core
Sub Component:
Version:	1.2.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.0.6
Target Release:	1.2.10
Assignee:	Piotr Kliczewski
QA Contact:	Aleksei Slaikovskii
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1401896 1402597 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-05 15:44 UTC by Roy Golan
Modified:	2017-01-18 07:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-01-18 07:28:46 UTC
oVirt Team:	Infra
Embargoed:
Flags:	rule-engine: ovirt-4.0.z+ rule-engine: blocker+

Attachments	(Terms of Use)
Engine log with debug prints (4.42 MB, application/x-gzip) 2016-12-07 09:21 UTC, Fred Rolland	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	67391	None	None	None	2016-12-05 16:04:05 UTC
oVirt gerrit	67840	ovirt-4.0	MERGED	Safe publication of the queue to avoid job loss	2016-12-06 09:22:46 UTC
oVirt gerrit	67872	master	MERGED	jsonrpc: version bump	2016-12-08 11:30:32 UTC
oVirt gerrit	67878	ovirt-4.0	MERGED	Version bump	2016-12-08 10:21:32 UTC
oVirt gerrit	67883	ovirt-engine-4.0	MERGED	jsonrpc: version bump	2016-12-08 11:54:46 UTC
oVirt gerrit	67901	master	MERGED	Poll the q instead of iterator	2016-12-07 13:12:20 UTC
oVirt gerrit	67976	ovirt-4.0	MERGED	Poll the q instead of iterator	2016-12-08 10:21:37 UTC
oVirt gerrit	68024	ovirt-engine-4.0.6	MERGED	jsonrpc: version bump	2016-12-08 20:02:06 UTC

Description Roy Golan 2016-12-05 15:44:02 UTC

Description of problem:

The task Q processing is being done on a local copy of the Q. while waiting for response, an exception will abort the local copy task processing and the rest of the jobs in side that local copy will be lost. The main Q meanwhile doesn't have those tasks any longer. This result in losing jobs, i.e lossing calls to api and in the worst case losing the connect request that can result the VdsManager lock to be held infinitely and the rest of the engine wide threads to start reporting network exception and contend on the same VdsManager lock. This will exaust the thread pool task Q as well and the engine daemons (quarts scheduled tasks) will stop work.

The attached dump and engine log shows how the "Reactor Client" is waiting forever on the connectTask.get() task without any progress and the rest of threads waiting for the VdsManager lock that they can never get hold on.

Version-Release number of selected component (if applicable):
1.2.9 and affect ovirt-engine 4.0 and 4.1 master

How reproducible:
50% cause it depends on the timing.

Steps to Reproduce:
1. have 3 hosts and some vms and disks (so we will have activity on monitoring)
2. disconnect the network (pull the plug or drop packets or switch network)
3. wait till the SocketException is thrown or some other IO

Actual results:
Tasks that are in the Q get lost and never process. may render the system unsable

Expected results:
Task processing fail in isolation, the rest of the tasks gets proccesed, the system doest get contended on the VdsManager lock for (VdsManager.handleNetworkException -> needs an engine fix as well)

Additional info:
happen to in a production env and on development env
See ReactorScheduler#performPendingOpearations and ReactorClient#connect speciall connectTask.get()

Comment 1 sefi litmanovich 2016-12-06 11:47:48 UTC

*** Bug 1401896 has been marked as a duplicate of this bug. ***

Comment 2 Roy Golan 2016-12-06 13:53:47 UTC

underlying issue found -

ConcurrentHashMap as a weakly consistent iterator so while iterating we might not see the latest update. This will cause tasks to disappear

Comment 3 Piotr Kliczewski 2016-12-06 14:12:16 UTC

For sure you mean ConcurrentLinkedQueue which has the same property. It is important to notice that we iterate over only on previous collection. Once we hold reference in hand we use it to run the tasks and adding to this specific instance is not possible so weakly consistent iterator seems not to be the case here but making sure that new reference is visible to all the threads.

Comment 4 Fred Rolland 2016-12-07 09:21:13 UTC

Created attachment 1228943 [details]
Engine log with debug prints

Comment 5 Fred Rolland 2016-12-07 09:25:09 UTC

I reproduced the issue with the master version. (Without the timeout fix)

See attached the end of the log, and the debug that I added in the code.

The following task is inserted but never executed:
inserting future task java.util.concurrent.FutureTask@c3aaf7

Also in the thread dump, we can see that is is stuck, waiting for the FutureTask.

Comment 6 Piotr Kliczewski 2016-12-07 09:40:07 UTC

Without timeout fix the probability of occurring this issue is lower. Still it needs to be fixed. You just proved that hotspot javadocs are not describing what openjdk is doing.

Comment 7 Roy Golan 2016-12-08 07:45:12 UTC

*** Bug 1402597 has been marked as a duplicate of this bug. ***

Comment 8 Roy Golan 2016-12-08 07:59:46 UTC

need to cherry pick 5e3829d to 4.0. Should we create a clone of this bug?

Comment 9 Oved Ourfali 2016-12-08 08:04:53 UTC

I think we should just leave this one.

Comment 10 Aleksei Slaikovskii 2016-12-14 14:36:48 UTC

This wont be automated as it's codechange only and it is not reproducible anymore, basic sanity will be sufficient for verification process.

Note You need to log in before you can comment on or make changes to this bug.