Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1109758

Summary: Management handler threads start to hang when multiple requests come in concurrently (thread pool starvation?)
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Ladislav Thon <lthon>
Component: Domain ManagementAssignee: Brian Stansberry <brian.stansberry>
Status: CLOSED CURRENTRELEASE QA Contact: Ladislav Thon <lthon>
Severity: high Docs Contact: Russell Dickenson <rdickens>
Priority: unspecified    
Version: 6.3.0CC: emuckenh, jdoyle, jkudrnac, myarboro
Target Milestone: ER8   
Target Release: EAP 6.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-28 15:25:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1043589    
Attachments:
Description Flags
stack trace of the master host controller none

Description Ladislav Thon 2014-06-16 10:23:58 UTC
Created attachment 909075 [details]
stack trace of the master host controller

Description of problem:

When doing multiple concurrent requests against master host controller, the processing threads start to hang indefinitely in ModelControllerClientOperationHandler$CompletedCallback.sendResponse, waiting on the latch. Stack trace of one of the threads on the master:

"management-handler-thread - 4" prio=10 tid=0x00007fc984017800 nid=0x4c8 waiting on condition [0x00007fc970259000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000f5d05a10> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
	at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$CompletedCallback.sendResponse(ModelControllerClientOperationHandler.java:275)
	- locked <0x00000000f5611940> (a org.jboss.as.controller.remote.ModelControllerClientOperationHandler$CompletedCallback)
	at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler.doExecute(ModelControllerClientOperationHandler.java:194)
	at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler.access$300(ModelControllerClientOperationHandler.java:108)
	at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler$1$1.run(ModelControllerClientOperationHandler.java:129)
	at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler$1$1.run(ModelControllerClientOperationHandler.java:125)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.jboss.as.controller.AccessAuditContext.doAs(AccessAuditContext.java:94)
	at org.jboss.as.controller.remote.ModelControllerClientOperationHandler$ExecuteRequestHandler$1.execute(ModelControllerClientOperationHandler.java:125)
	at org.jboss.as.protocol.mgmt.AbstractMessageHandler$2$1.doExecute(AbstractMessageHandler.java:298)
	at org.jboss.as.protocol.mgmt.AbstractMessageHandler$AsyncTaskRunner.run(AbstractMessageHandler.java:537)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:662)
	at org.jboss.threads.JBossThread.run(JBossThread.java:122)

   Locked ownable synchronizers:
	- <0x00000000ff6657f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

In the full stack trace (see the attachment), I can't find any thread that would correspond to the ModelControllerClientOperationHandler$CompletedCallback.sendResponse's anonymous class sending the response. This, combined with the fact that the problem starts to appear with a certain number of concurrent clients (see "Steps to Reproduce" below), suggests that this might be a thread pool starvation problem: all threads from the management handlers pool are used to handle a request and there is none left to send a response, which is what the handlers are waiting for. My investigation is not conclusive, but this is the best explanation I have.

I tracked the problem down via `git bisect` to this commit:

commit b61f8eacef050f2d3c0793ad0977a40f4ca4508d
Author: Brian Stansberry <brian.stansberry>
Date:   Wed Mar 5 18:04:20 2014 -0600

    [WFLY-3090] Prevent mgmt op cancellation resulting in closing intra-process connections
    [WFLY-3091] Fix propagation of mgmt op cancellation to server update tasks
    Tests of the above and of WFLY-3058
    
    Conflicts:
        controller/src/main/java/org/jboss/as/controller/logging/ControllerLogger.java
        controller/src/main/java/org/jboss/as/controller/remote/ModelControllerClientOperationHandler.java
        controller/src/main/java/org/jboss/as/controller/remote/RemoteProxyController.java
        controller/src/main/java/org/jboss/as/controller/remote/TransactionalProtocolOperationHandler.java
        host-controller/src/main/java/org/jboss/as/domain/controller/logging/DomainControllerLogger.java
        host-controller/src/main/java/org/jboss/as/domain/controller/operations/coordination/DomainRolloutStepHandler.java
        host-controller/src/main/java/org/jboss/as/domain/controller/operations/coordination/DomainSlaveHandler.java
        host-controller/src/main/java/org/jboss/as/domain/controller/plan/ConcurrentServerGroupUpdateTask.java
        protocol/src/main/java/org/jboss/as/protocol/logging/ProtocolLogger.java
    
    Conflicts:
        controller/src/main/java/org/jboss/as/controller/remote/TransactionalProtocolClientImpl.java

It is reproducible using the RBAC soak test, see below.

Version-Release number of selected component (if applicable):

EAP 6.3.0.ER7

How reproducible:

always

Steps to Reproduce:
1. git checkout EAP_6.3.0.ER7-dev
2. ./build.sh clean install -DskipTests
3. ./integration-tests.sh test -Dts.noSmoke -Dts.domain -Prbac-soak -Djboss.test.rbac.soak.clients=10 -Djboss.test.rbac.soak.iterations=10

You will note that in this setting, the test uses 10 concurrent clients, so the load required to reproduce the issue is pretty low. I tried lower count of concurrent clients too and the threads started to hang with 4 of them (1, 2 and 3 clients were OK).

Also, the productized tag EAP_6.3.0.ER7 works too, but requires using testsuite local Maven repository (jboss-eap-6.3.0.ER7-testsuite-local-repository.zip) for both build.sh and integration-tests.sh (-Dmaven.repo.local=...).

Actual results:

hangs indefinitely

Expected results:

test finishes in cca 1 minute

Additional info:

Looking at ModelControllerClientOperationHandler$CompletedCallback.sendResponse and inside the anonymous class that actually sends the response, the nested try{try/finally}catch looks suspicious. Is it OK that in case of an IOException, the latch is never decremented? Anyway, this doesn't seem to be the cause of the problem described above, as there should be no IOException during the test.

Comment 1 Ladislav Thon 2014-06-16 10:43:36 UTC
High-level overview:

When multiple clients are doing management operations concurrently (and multiple can be as low as 4), the management interface gets stuck and will never heal. This will prevent all subsequent management operations from executing.

My hypothesis: all processing threads are waiting indefinitely for a new processing thread to appear, but the thread pool is exhausted.

Comment 4 Brian Stansberry 2014-06-17 11:12:21 UTC
This was thread starvation, happening because https://issues.jboss.org/browse/WFLY-3225 wasn't backported.

I've backported that, but am still working to sort an issue that pops up on JDK 6 with javax.security.auth.Policy being able to load the com.sun.security.auth.PolicyFile class. A class visiblity problem that I know how to fix but need to fully understand.

Comment 5 Ladislav Thon 2014-06-17 11:49:20 UTC
PM bot is funny...

Brian, is there an EAP pull request that I could look at? In the mean time, I took a quick look on ModelControllerClientOperationHandler in WildFly and it looks like my concern in "Additional info" above is addressed too. So that's good.

Comment 6 Brian Stansberry 2014-06-17 13:13:37 UTC
Ladislav,

A WIP fix is at https://github.com/bstansberry/jboss-eap/compare/BZ1109758. No PR yet, as a classloading issue loading a JDK class com.sun.security.auth.PolicyFile surfaces with it and I'm still sorting that. With what's on the branch the soak test passes though.

That branch doesn't include a fix for your try/finally comment, but it will before a PR goes out.

Comment 8 Ladislav Thon 2014-06-18 07:10:44 UTC
Wow, looks like WildFly has moved a _lot_. Backporting this must have been fun...

I just kicked off another "full" run (100 concurrent clients, 1000 iterations, cca 3 hours) of the RBAC soak test from the EAP_6.3.0.ER8-dev-2 tag so that we can have feedback a bit sooner. I'm sure it's gonna be OK, though :-)

(And we'll of course run the test during the test cycle too. This is just to get feedback sooner.)

Comment 9 Ladislav Thon 2014-06-18 10:33:36 UTC
OK, running the "full" RBAC soak on the EAP_6.3.0.ER8-dev-2 tag went fine, no issues present.

Of course -- this is a preliminary feedback, not a definitive QA statement. We will verify this issue during normal ER8 testing cycle.

Comment 10 Brian Stansberry 2014-06-18 13:50:24 UTC
Thanks, Ladislav.

Yes, WildFly is moving rapidly. This stuff was meant to be backported though as part of the work that led to this regression; I'd just done it long enough before the rest that when it came time to backport I forgot about it. So thanks for catching this.

Comment 11 Brian Stansberry 2014-06-18 14:03:02 UTC
Marking this as a blocker for 1043589 to help ensure that if this gets backported to 6.x that the gets backported too.

Comment 12 Ladislav Thon 2014-06-25 07:10:30 UTC
Verified with EAP 6.3.0.ER8.