Bug 1020209 - EAP 6.2.0.ER6 takes very long time to shut down
Summary: EAP 6.2.0.ER6 takes very long time to shut down
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Transaction Manager
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ER7
: EAP 6.2.0
Assignee: tom.jenkinson
QA Contact: Ondrej Chaloupka
Keywords: Regression
Depends On: 1022054
TreeView+ depends on / blocked
Reported: 2013-10-17 09:44 UTC by Jan Martiska
Modified: 2017-10-10 00:18 UTC (History)
5 users (show)

Clone Of:
Last Closed: 2013-12-15 16:55:22 UTC

Attachments (Terms of Use)
thread dump during shutdown (64.35 KB, text/plain)
2013-10-17 09:49 UTC, Jan Martiska
no flags Details

Description Jan Martiska 2013-10-17 09:44:33 UTC
In previous builds, EAP usually takes less than 100 miliseconds to shut down (clean run, standalone.xml profile). In ER6, it usually takes 6 seconds or more. This affects all profiles.

This issue seems to be related to the update of JBoss Transactions to 4.17.12.Final. When I replaced the org.jboss.jts module with the files from ER5 (jbossjts-jacorb-4.17.10.Final-redhat-2.jar), the issue disappeared.

Comment 1 Jan Martiska 2013-10-17 09:49:00 UTC
Created attachment 813253 [details]
thread dump during shutdown

There are several suspicious threads related to XA recovery in the thread dump.

Comment 2 Ladislav Thon 2013-10-17 11:11:01 UTC
After bisecting https://github.com/jbossas/jboss-eap between EAP_6.2.0.ER5 and EAP_6.2.0.ER6, I can confirm that the problem in EAP lies in the upgrade of JBoss Transactions to 4.17.12. Specifically, it's the commit https://github.com/jbossas/jboss-eap/commit/c8f35c4d28b9faaa641b9b9b56632e2ddb3bdf82

After bisecting https://github.com/jbosstm/narayana/ between 4.17.10.Final and 4.17.12.Final, I believe that the problematic commit in Narayana is https://github.com/jbosstm/narayana/commit/67f7f1fc42232f8332b9b9ea65f685937a122207 I was able to confirm just by reverting this particular commit from 4.17.12.Final.

Comment 3 tom.jenkinson 2013-10-17 15:08:53 UTC

There was a bug in previous versions of JBTM which allowed a resource manager to deregister from the recovery manager between scans which can cause issues with recovery.

This was resolved in https://issues.jboss.org/browse/JBTM-1700 (4.17.11) and a bug that was introduced was squished in https://issues.jboss.org/browse/JBTM-1959 (4.17.12)

It can mean that shutdown takes up to 10 seconds (the interval between scans) longer. An enhancement could be to trigger the second phase of recovery directly when a resource tries to deregister but the impact of that change at this stage could be unforeseen.


Comment 6 tom.jenkinson 2013-10-21 09:50:40 UTC
Hi guys,

The regression is probably in 6.1 basically its a bug that resource managers can deregister from the recovery manager between scans.

Remember that delay will only be observed if the user happens to shutdown in the 10 second interval every 120 seconds. (Also note that the recovery manager fires immediately after starting up so if you startup and shutdown immediately that is why you will consistently see this in test).

I can add a line of info if you like, e.g.:

"Recovery manager between scans, this may pause shutdown for up to 10 seconds" when I detect that the shutdown has appeared in that interval

This is an example of if you try to stop immediately:
10:47:16,717 INFO  [org.jboss.as] (MSC service thread 1-6) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 8519ms

This is a stop after giving recovery manager chance to perform the initial scan:
10:46:32,040 INFO  [org.jboss.as] (MSC service thread 1-4) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 33ms

Hope that is OK?

Comment 7 Ladislav Thon 2013-10-21 10:34:08 UTC
Just noting that this can also manifest in CLI:

[standalone@localhost:9999 /] :reload
Communication error: java.net.ConnectException: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out
[disconnected /] 

(At least I believe that this is a manifestation of this thing.) To a user that is not aware of this issue, this looks pretty suspicious.

Comment 8 tom.jenkinson 2013-10-21 10:46:09 UTC
Hi Ladislav, Do you have a thread dump when your issue arose?

Comment 9 tom.jenkinson 2013-10-21 10:53:08 UTC
Hi Ladislav,

The only way I managed to replicate what you have seen is to call :reload while the server is in the process of shutting down. This is normal as the management endpoint should be shutdown at that point.


Comment 10 Ladislav Thon 2013-10-21 11:04:03 UTC
Tom, this is exactly the case. Given that server shutdown used to take a few millis, calling :reload multiple times in a row in a short time interval used to be A-OK. Now it shows this message. Agreed that it probably isn't a common situation, just wanted to point out that this situation might arise.

Comment 11 tom.jenkinson 2013-10-21 11:18:06 UTC
I see what you mean. It depends how short the time interval was between you calling :reload before, if you were very fast to reissue the command (<33ms on my machine - very unlikely) or multiple connections were open and performing operations on the same server at the same time then it would have the same behaviour. I verified the later case on both 6.2.0.Alpha1 and ER5 just now.

Comment 12 tom.jenkinson 2013-10-21 17:07:29 UTC
I can see a way to minimize the likelihood of the delay to the scenario where there is only an outstanding indoubt branch for the resource manager. I will do that fix.

Comment 15 Brian Stansberry 2013-10-23 02:59:12 UTC
Since the recovery manager behavior discussed in WildFly for the Narayana M6 update (see https://github.com/wildfly/wildfly/pull/5262) is present in EAP as well, the fix for https://issues.jboss.org/browse/WFLY-2350 is going to have to be backported as well.

Comment 16 tom.jenkinson 2013-10-24 12:30:55 UTC
I marked it as post as the component upgrade should alleviate the slow down, but clearly the backport of https://github.com/wildfly/wildfly/pull/5262 hasn't been performed so I am not entirely certain this was correct. I will move it back to assigned in case we want to wait for the backport

Comment 17 Jan Martiska 2013-11-01 11:15:29 UTC
ER7 stops in less than 50 milliseconds. Verified, thanks.

Note You need to log in before you can comment on or make changes to this bug.