Description of problem: The task Q processing is being done on a local copy of the Q. while waiting for response, an exception will abort the local copy task processing and the rest of the jobs in side that local copy will be lost. The main Q meanwhile doesn't have those tasks any longer. This result in losing jobs, i.e lossing calls to api and in the worst case losing the connect request that can result the VdsManager lock to be held infinitely and the rest of the engine wide threads to start reporting network exception and contend on the same VdsManager lock. This will exaust the thread pool task Q as well and the engine daemons (quarts scheduled tasks) will stop work. The attached dump and engine log shows how the "Reactor Client" is waiting forever on the connectTask.get() task without any progress and the rest of threads waiting for the VdsManager lock that they can never get hold on. Version-Release number of selected component (if applicable): 1.2.9 and affect ovirt-engine 4.0 and 4.1 master How reproducible: 50% cause it depends on the timing. Steps to Reproduce: 1. have 3 hosts and some vms and disks (so we will have activity on monitoring) 2. disconnect the network (pull the plug or drop packets or switch network) 3. wait till the SocketException is thrown or some other IO Actual results: Tasks that are in the Q get lost and never process. may render the system unsable Expected results: Task processing fail in isolation, the rest of the tasks gets proccesed, the system doest get contended on the VdsManager lock for (VdsManager.handleNetworkException -> needs an engine fix as well) Additional info: happen to in a production env and on development env See ReactorScheduler#performPendingOpearations and ReactorClient#connect speciall connectTask.get()
*** Bug 1401896 has been marked as a duplicate of this bug. ***
underlying issue found - ConcurrentHashMap as a weakly consistent iterator so while iterating we might not see the latest update. This will cause tasks to disappear
For sure you mean ConcurrentLinkedQueue which has the same property. It is important to notice that we iterate over only on previous collection. Once we hold reference in hand we use it to run the tasks and adding to this specific instance is not possible so weakly consistent iterator seems not to be the case here but making sure that new reference is visible to all the threads.
Created attachment 1228943 [details] Engine log with debug prints
I reproduced the issue with the master version. (Without the timeout fix) See attached the end of the log, and the debug that I added in the code. The following task is inserted but never executed: inserting future task java.util.concurrent.FutureTask@c3aaf7 Also in the thread dump, we can see that is is stuck, waiting for the FutureTask.
Without timeout fix the probability of occurring this issue is lower. Still it needs to be fixed. You just proved that hotspot javadocs are not describing what openjdk is doing.
*** Bug 1402597 has been marked as a duplicate of this bug. ***
need to cherry pick 5e3829d to 4.0. Should we create a clone of this bug?
I think we should just leave this one.
This wont be automated as it's codechange only and it is not reproducible anymore, basic sanity will be sufficient for verification process.