Bug 1109758
| Summary: | Management handler threads start to hang when multiple requests come in concurrently (thread pool starvation?) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Ladislav Thon <lthon> | ||||
| Component: | Domain Management | Assignee: | Brian Stansberry <brian.stansberry> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Ladislav Thon <lthon> | ||||
| Severity: | high | Docs Contact: | Russell Dickenson <rdickens> | ||||
| Priority: | unspecified | ||||||
| Version: | 6.3.0 | CC: | emuckenh, jdoyle, jkudrnac, myarboro | ||||
| Target Milestone: | ER8 | ||||||
| Target Release: | EAP 6.3.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-06-28 15:25:00 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1043589 | ||||||
| Attachments: |
|
||||||
High-level overview: When multiple clients are doing management operations concurrently (and multiple can be as low as 4), the management interface gets stuck and will never heal. This will prevent all subsequent management operations from executing. My hypothesis: all processing threads are waiting indefinitely for a new processing thread to appear, but the thread pool is exhausted. This was thread starvation, happening because https://issues.jboss.org/browse/WFLY-3225 wasn't backported. I've backported that, but am still working to sort an issue that pops up on JDK 6 with javax.security.auth.Policy being able to load the com.sun.security.auth.PolicyFile class. A class visiblity problem that I know how to fix but need to fully understand. PM bot is funny... Brian, is there an EAP pull request that I could look at? In the mean time, I took a quick look on ModelControllerClientOperationHandler in WildFly and it looks like my concern in "Additional info" above is addressed too. So that's good. Ladislav, A WIP fix is at https://github.com/bstansberry/jboss-eap/compare/BZ1109758. No PR yet, as a classloading issue loading a JDK class com.sun.security.auth.PolicyFile surfaces with it and I'm still sorting that. With what's on the branch the soak test passes though. That branch doesn't include a fix for your try/finally comment, but it will before a PR goes out. Wow, looks like WildFly has moved a _lot_. Backporting this must have been fun... I just kicked off another "full" run (100 concurrent clients, 1000 iterations, cca 3 hours) of the RBAC soak test from the EAP_6.3.0.ER8-dev-2 tag so that we can have feedback a bit sooner. I'm sure it's gonna be OK, though :-) (And we'll of course run the test during the test cycle too. This is just to get feedback sooner.) OK, running the "full" RBAC soak on the EAP_6.3.0.ER8-dev-2 tag went fine, no issues present. Of course -- this is a preliminary feedback, not a definitive QA statement. We will verify this issue during normal ER8 testing cycle. Thanks, Ladislav. Yes, WildFly is moving rapidly. This stuff was meant to be backported though as part of the work that led to this regression; I'd just done it long enough before the rest that when it came time to backport I forgot about it. So thanks for catching this. Marking this as a blocker for 1043589 to help ensure that if this gets backported to 6.x that the gets backported too. Verified with EAP 6.3.0.ER8. |
Created attachment 909075 [details] stack trace of the master host controller Description of problem: When doing multiple concurrent requests against master host controller, the processing threads start to hang indefinitely in ModelControllerClientOperationHandler$CompletedCallback.sendResponse, waiting on the latch. Stack trace of one of the threads on the master: "management-handler-thread - 4" prio=10 tid=0x00007fc984017800 nid=0x4c8 waiting on condition [0x00007fc970259000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000f5d05a10> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207) at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$CompletedCallback.sendResponse(ModelControllerClientOperationHandler.java:275) - locked <0x00000000f5611940> (a org.jboss.as.controller.remote.ModelControllerClientOperationHandler$CompletedCallback) at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler.doExecute(ModelControllerClientOperationHandler.java:194) at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler.access$300(ModelControllerClientOperationHandler.java:108) at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler$1$1.run(ModelControllerClientOperationHandler.java:129) at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler$1$1.run(ModelControllerClientOperationHandler.java:125) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.jboss.as.controller.AccessAuditContext.doAs(AccessAuditContext.java:94) at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler$1.execute(ModelControllerClientOperationHandler.java:125) at org.jboss.as.protocol.mgmt.AbstractMessageHandler$2$1.doExecute(AbstractMessageHandler.java:298) at org.jboss.as.protocol.mgmt.AbstractMessageHandler$AsyncTaskRunner.run(AbstractMessageHandler.java:537) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) at org.jboss.threads.JBossThread.run(JBossThread.java:122) Locked ownable synchronizers: - <0x00000000ff6657f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) In the full stack trace (see the attachment), I can't find any thread that would correspond to the ModelControllerClientOperationHandler$CompletedCallback.sendResponse's anonymous class sending the response. This, combined with the fact that the problem starts to appear with a certain number of concurrent clients (see "Steps to Reproduce" below), suggests that this might be a thread pool starvation problem: all threads from the management handlers pool are used to handle a request and there is none left to send a response, which is what the handlers are waiting for. My investigation is not conclusive, but this is the best explanation I have. I tracked the problem down via `git bisect` to this commit: commit b61f8eacef050f2d3c0793ad0977a40f4ca4508d Author: Brian Stansberry <brian.stansberry> Date: Wed Mar 5 18:04:20 2014 -0600 [WFLY-3090] Prevent mgmt op cancellation resulting in closing intra-process connections [WFLY-3091] Fix propagation of mgmt op cancellation to server update tasks Tests of the above and of WFLY-3058 Conflicts: controller/src/main/java/org/jboss/as/controller/logging/ControllerLogger.java controller/src/main/java/org/jboss/as/controller/remote/ModelControllerClientOperationHandler.java controller/src/main/java/org/jboss/as/controller/remote/RemoteProxyController.java controller/src/main/java/org/jboss/as/controller/remote/TransactionalProtocolOperationHandler.java host-controller/src/main/java/org/jboss/as/domain/controller/logging/DomainControllerLogger.java host-controller/src/main/java/org/jboss/as/domain/controller/operations/coordination/DomainRolloutStepHandler.java host-controller/src/main/java/org/jboss/as/domain/controller/operations/coordination/DomainSlaveHandler.java host-controller/src/main/java/org/jboss/as/domain/controller/plan/ConcurrentServerGroupUpdateTask.java protocol/src/main/java/org/jboss/as/protocol/logging/ProtocolLogger.java Conflicts: controller/src/main/java/org/jboss/as/controller/remote/TransactionalProtocolClientImpl.java It is reproducible using the RBAC soak test, see below. Version-Release number of selected component (if applicable): EAP 6.3.0.ER7 How reproducible: always Steps to Reproduce: 1. git checkout EAP_6.3.0.ER7-dev 2. ./build.sh clean install -DskipTests 3. ./integration-tests.sh test -Dts.noSmoke -Dts.domain -Prbac-soak -Djboss.test.rbac.soak.clients=10 -Djboss.test.rbac.soak.iterations=10 You will note that in this setting, the test uses 10 concurrent clients, so the load required to reproduce the issue is pretty low. I tried lower count of concurrent clients too and the threads started to hang with 4 of them (1, 2 and 3 clients were OK). Also, the productized tag EAP_6.3.0.ER7 works too, but requires using testsuite local Maven repository (jboss-eap-6.3.0.ER7-testsuite-local-repository.zip) for both build.sh and integration-tests.sh (-Dmaven.repo.local=...). Actual results: hangs indefinitely Expected results: test finishes in cca 1 minute Additional info: Looking at ModelControllerClientOperationHandler$CompletedCallback.sendResponse and inside the anonymous class that actually sends the response, the nested try{try/finally}catch looks suspicious. Is it OK that in case of an IOException, the latch is never decremented? Anyway, this doesn't seem to be the cause of the problem described above, as there should be no IOException during the test.