Bug 1020209

Summary: EAP 6.2.0.ER6 takes very long time to shut down
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Jan Martiska <jmartisk>
Component: Transaction ManagerAssignee: tom.jenkinson
Status: CLOSED CURRENTRELEASE QA Contact: Ondrej Chaloupka <ochaloup>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.2.0CC: brian.stansberry, jmartisk, lthon, myarboro, tom.jenkinson
Target Milestone: ER7Keywords: Regression
Target Release: EAP 6.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-15 16:55:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1022054    
Bug Blocks:    
Attachments:
Description Flags
thread dump during shutdown none

Description Jan Martiska 2013-10-17 09:44:33 UTC
In previous builds, EAP usually takes less than 100 miliseconds to shut down (clean run, standalone.xml profile). In ER6, it usually takes 6 seconds or more. This affects all profiles.

This issue seems to be related to the update of JBoss Transactions to 4.17.12.Final. When I replaced the org.jboss.jts module with the files from ER5 (jbossjts-jacorb-4.17.10.Final-redhat-2.jar), the issue disappeared.

Comment 1 Jan Martiska 2013-10-17 09:49:00 UTC
Created attachment 813253 [details]
thread dump during shutdown

There are several suspicious threads related to XA recovery in the thread dump.

Comment 2 Ladislav Thon 2013-10-17 11:11:01 UTC
After bisecting https://github.com/jbossas/jboss-eap between EAP_6.2.0.ER5 and EAP_6.2.0.ER6, I can confirm that the problem in EAP lies in the upgrade of JBoss Transactions to 4.17.12. Specifically, it's the commit https://github.com/jbossas/jboss-eap/commit/c8f35c4d28b9faaa641b9b9b56632e2ddb3bdf82

After bisecting https://github.com/jbosstm/narayana/ between 4.17.10.Final and 4.17.12.Final, I believe that the problematic commit in Narayana is https://github.com/jbosstm/narayana/commit/67f7f1fc42232f8332b9b9ea65f685937a122207 I was able to confirm just by reverting this particular commit from 4.17.12.Final.

Comment 3 tom.jenkinson 2013-10-17 15:08:53 UTC
Hi,

There was a bug in previous versions of JBTM which allowed a resource manager to deregister from the recovery manager between scans which can cause issues with recovery.

This was resolved in https://issues.jboss.org/browse/JBTM-1700 (4.17.11) and a bug that was introduced was squished in https://issues.jboss.org/browse/JBTM-1959 (4.17.12)

It can mean that shutdown takes up to 10 seconds (the interval between scans) longer. An enhancement could be to trigger the second phase of recovery directly when a resource tries to deregister but the impact of that change at this stage could be unforeseen.

Tom

Comment 6 tom.jenkinson 2013-10-21 09:50:40 UTC
Hi guys,

The regression is probably in 6.1 basically its a bug that resource managers can deregister from the recovery manager between scans.

Remember that delay will only be observed if the user happens to shutdown in the 10 second interval every 120 seconds. (Also note that the recovery manager fires immediately after starting up so if you startup and shutdown immediately that is why you will consistently see this in test).

I can add a line of info if you like, e.g.:

"Recovery manager between scans, this may pause shutdown for up to 10 seconds" when I detect that the shutdown has appeared in that interval

This is an example of if you try to stop immediately:
10:47:16,717 INFO  [org.jboss.as] (MSC service thread 1-6) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 8519ms

This is a stop after giving recovery manager chance to perform the initial scan:
10:46:32,040 INFO  [org.jboss.as] (MSC service thread 1-4) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 33ms

Hope that is OK?
Tom

Comment 7 Ladislav Thon 2013-10-21 10:34:08 UTC
Just noting that this can also manifest in CLI:

[standalone@localhost:9999 /] :reload
Communication error: java.net.ConnectException: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out
[disconnected /] 

(At least I believe that this is a manifestation of this thing.) To a user that is not aware of this issue, this looks pretty suspicious.

Comment 8 tom.jenkinson 2013-10-21 10:46:09 UTC
Hi Ladislav, Do you have a thread dump when your issue arose?

Comment 9 tom.jenkinson 2013-10-21 10:53:08 UTC
Hi Ladislav,

The only way I managed to replicate what you have seen is to call :reload while the server is in the process of shutting down. This is normal as the management endpoint should be shutdown at that point.

Tom

Comment 10 Ladislav Thon 2013-10-21 11:04:03 UTC
Tom, this is exactly the case. Given that server shutdown used to take a few millis, calling :reload multiple times in a row in a short time interval used to be A-OK. Now it shows this message. Agreed that it probably isn't a common situation, just wanted to point out that this situation might arise.

Comment 11 tom.jenkinson 2013-10-21 11:18:06 UTC
I see what you mean. It depends how short the time interval was between you calling :reload before, if you were very fast to reissue the command (<33ms on my machine - very unlikely) or multiple connections were open and performing operations on the same server at the same time then it would have the same behaviour. I verified the later case on both 6.2.0.Alpha1 and ER5 just now.

Comment 12 tom.jenkinson 2013-10-21 17:07:29 UTC
I can see a way to minimize the likelihood of the delay to the scenario where there is only an outstanding indoubt branch for the resource manager. I will do that fix.

Comment 15 Brian Stansberry 2013-10-23 02:59:12 UTC
Since the recovery manager behavior discussed in WildFly for the Narayana M6 update (see https://github.com/wildfly/wildfly/pull/5262) is present in EAP as well, the fix for https://issues.jboss.org/browse/WFLY-2350 is going to have to be backported as well.

Comment 16 tom.jenkinson 2013-10-24 12:30:55 UTC
I marked it as post as the component upgrade should alleviate the slow down, but clearly the backport of https://github.com/wildfly/wildfly/pull/5262 hasn't been performed so I am not entirely certain this was correct. I will move it back to assigned in case we want to wait for the backport

Comment 17 Jan Martiska 2013-11-01 11:15:29 UTC
ER7 stops in less than 50 milliseconds. Verified, thanks.