This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1023053 - Services calling XAResourceRecoveryRegistry.removeXAResourceRecovery in start/stop must use MSC async API
Services calling XAResourceRecoveryRegistry.removeXAResourceRecovery in start...
Status: CLOSED CURRENTRELEASE
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Server (Show other bugs)
6.2.0
Unspecified Unspecified
unspecified Severity unspecified
: CR1
: EAP 6.2.0
Assigned To: Paul Ferraro
Ondrej Chaloupka
:
Depends On:
Blocks: 1022054
  Show dependency treegraph
 
Reported: 2013-10-24 09:58 EDT by Brian Stansberry
Modified: 2013-12-15 11:18 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-12-15 11:18:03 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
standalone-full-ha.xml (27.92 KB, text/xml)
2013-11-05 09:33 EST, Ondrej Chaloupka
no flags Details

  None (edit)
Description Brian Stansberry 2013-10-24 09:58:19 EDT
XAResourceRecoveryRegistry.removeXAResourceRecovery is a blocking API, so MSC services that call it in start()/stop() must use Start|StopContext.asynchronous().

Services must not do blocking tasks with the MSC thread that calls the lifecycle methods, as discussed in the class javadoc at http://grepcode.com/file/repository.jboss.org/nexus/content/repositories/releases/org.jboss.msc/jboss-msc/1.0.0.Beta8/org/jboss/msc/service/Service.java. This is because the MSC thread pool is a very limited resource (4 threads), and blocking those threads has the potential to lock up the entire service container.

The issue discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1020209 shows that XAResourceRecoveryRegistry.removeXAResourceRecovery is a blocking call. From discussions with Tom Jenkinson it seems some sort of blocking behavior existed all the way back to 6.0.

The fix is to use the MSC asynchronous APIs in the relevant methods. Backport of fix for https://issues.jboss.org/browse/WFLY-2350.
Comment 1 Brian Stansberry 2013-10-25 18:15:35 EDT
https://github.com/jbossas/jboss-eap/pull/621 completes most of this, but https://github.com/wildfly/wildfly/pull/5330 needs to be backported.
Comment 2 Paul Ferraro 2013-10-28 14:59:28 EDT
https://github.com/jbossas/jboss-eap/pull/641
Comment 3 Ondrej Chaloupka 2013-11-05 09:33:20 EST
The fix seems to be in the ER7 release (lthon compiled sources, I've checked the decompiled jar files) but the server stucks at the start.

Adding standalone-full-ha.xml config file that simulates the problem.
Comment 4 Ondrej Chaloupka 2013-11-05 09:33:41 EST
Created attachment 819801 [details]
standalone-full-ha.xml
Comment 5 Ladislav Thon 2013-11-05 09:48:20 EST
Let me comment on this a bit too, as I was helping Ondra with the [failed] attempt to reproduce & verify.

Adding the following bits to the "infinispan" subsystem configuration in the XML makes EAP 6.2.0.ER7 hang during startup:

<cache-container name="aaa" default-cache="default" start="EAGER">
    <transport lock-timeout="60000"/>
    <replicated-cache name="default" mode="SYNC" batching="true">
        <locking isolation="REPEATABLE_READ"/>
        <transaction mode="FULL_XA"/>
    </replicated-cache>
</cache-container>
<cache-container name="bbb" default-cache="default" start="EAGER">
    <transport lock-timeout="60000"/>
    <replicated-cache name="default" mode="SYNC" batching="true">
        <locking isolation="REPEATABLE_READ"/>
        <transaction mode="FULL_XA"/>
    </replicated-cache>
</cache-container>
<cache-container name="ccc" default-cache="default" start="EAGER">
    <transport lock-timeout="60000"/>
    <replicated-cache name="default" mode="SYNC" batching="true">
        <locking isolation="REPEATABLE_READ"/>
        <transaction mode="FULL_XA"/>
    </replicated-cache>
</cache-container>
<cache-container name="eee" default-cache="default" start="EAGER">
    <transport lock-timeout="60000"/>
    <replicated-cache name="default" mode="SYNC" batching="true">
        <locking isolation="REPEATABLE_READ"/>
        <transaction mode="FULL_XA"/>
    </replicated-cache>
</cache-container>
<cache-container name="fff" default-cache="default" start="EAGER">
    <transport lock-timeout="60000"/>
    <replicated-cache name="default" mode="SYNC" batching="true">
        <locking isolation="REPEATABLE_READ"/>
        <transaction mode="FULL_XA"/>
    </replicated-cache>
</cache-container>
<cache-container name="ggg" default-cache="default" start="EAGER">
    <transport lock-timeout="60000"/>
    <replicated-cache name="default" mode="SYNC" batching="true">
        <locking isolation="REPEATABLE_READ"/>
        <transaction mode="FULL_XA"/>
    </replicated-cache>
</cache-container>

I'm not sure how much this is/isn't related to this particular issue, but it's definitely not OK.
Comment 6 Dimitris Andreadis 2013-11-05 12:57:11 EST
Paul, are you looking at it? This is the final week to get fixes for CR, which should hopefully be the last build.
Comment 8 Paul Ferraro 2013-11-07 14:51:06 EST
Deadlocking on startup is indeed an issue, but a separate issue nonetheless.
While caches service startup is asynchronous, the cache configuration service startup is synchronous.  The cache configuration can depend on the TransactionManager or TransactionSychronizationRegistry, which are probably blocking - causing the deadlock.  I will see if this fixes the issue.
Comment 9 Paul Ferraro 2013-11-07 15:23:06 EST
https://github.com/jbossas/jboss-eap/pull/678
Comment 10 Ladislav Thon 2013-11-08 04:36:16 EST
I built the tip of EAP branch and can confirm that the startup issue is fixed. Thanks Paul :-)
Comment 11 Ondrej Chaloupka 2013-11-11 10:26:05 EST
OK, the server does not stuck. It seems fine for me.
Thanks

Note You need to log in before you can comment on or make changes to this bug.