Bug 1208429
Summary: | Thread pools are depleted by ClusterTopologyManagerImpl.waitForView() and causing deadlock | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Osamu Nagano <onagano> | ||||
Component: | Infinispan | Assignee: | Dan Berindei <dberinde> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Gencur <mgencur> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.4.0 | CC: | afield, jdg-bugs, pzapataf, slaskawi, ttarrant | ||||
Target Milestone: | ER4 | ||||||
Target Release: | 6.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-06-23 12:24:14 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Osamu Nagano
2015-04-02 08:57:40 UTC
Created attachment 1010069 [details]
p2.coord-deadlock.waitForView.zip
Copied from JIRA: The fix for bug 1217380 (ISPN-5106) already fixed the waitForView() problem partially, but it wasn't enough. When the coordinator installed two views in quick succession, the thread updating the cache members list for the first view would block waiting for the CacheTopologyControlCommand(POLICY_GET_STATUS) response from the other members. Then, because the other members got the newer view before sending the join requests, all the remote-executor and OOB threads would block in waitForView(), and there would be no way to receive the POLICY_GET_STATUS responses (since processing a response also needs an OOB thread). The solution was to update the cache members asynchronously. Testing with a limited amount of OOB/remote-executor threads also exposed some other deadlocks, and the pull request tries to plug as many of them as possible. However, because the caches will not start in the same order on every node, there is always the possibility of 2 nodes sending state transfer requests to each other (for different caches) and not being able to process the responses (because the OOB threads are all blocked, actually waiting for those responses). So a deadlock is still possible if remote-executor.max-threads + OOB.max_threads < number of caches. The remote executor configuration is ignored in JDG server (bug 1219417), the pull request includes a fix for that as well. The attached configuration didn't specify anything for the remote executor, I would suggest defining a blocking-queueless-thread-pool: <transport ... remote-command-executor="infinispan-remote"/> ... <blocking-queueless-thread-pool name="infinispan-remote"> <max-threads count="400"/> <keepalive-time time="0" unit="milliseconds"/> <thread-factory name="infinispan-factory"/> </blocking-queueless-thread-pool> Ran several tests using the following scenario and JDG 6.5.0 CR1: 1) Start 4 servers hosting 1000 caches 2) Repeat the following 20 times: a) Start writing to the caches using 3 clients b) Kill all 4 servers c) Restart all 4 servers and wait until there are no topology changes or cache rehashes According to Dan, this scenario should be sufficient to verify this issue |