In previous builds, EAP usually takes less than 100 miliseconds to shut down (clean run, standalone.xml profile). In ER6, it usually takes 6 seconds or more. This affects all profiles.
This issue seems to be related to the update of JBoss Transactions to 4.17.12.Final. When I replaced the org.jboss.jts module with the files from ER5 (jbossjts-jacorb-4.17.10.Final-redhat-2.jar), the issue disappeared.
Created attachment 813253 [details]
thread dump during shutdown
There are several suspicious threads related to XA recovery in the thread dump.
After bisecting https://github.com/jbossas/jboss-eap between EAP_6.2.0.ER5 and EAP_6.2.0.ER6, I can confirm that the problem in EAP lies in the upgrade of JBoss Transactions to 4.17.12. Specifically, it's the commit https://github.com/jbossas/jboss-eap/commit/c8f35c4d28b9faaa641b9b9b56632e2ddb3bdf82
After bisecting https://github.com/jbosstm/narayana/ between 4.17.10.Final and 4.17.12.Final, I believe that the problematic commit in Narayana is https://github.com/jbosstm/narayana/commit/67f7f1fc42232f8332b9b9ea65f685937a122207 I was able to confirm just by reverting this particular commit from 4.17.12.Final.
There was a bug in previous versions of JBTM which allowed a resource manager to deregister from the recovery manager between scans which can cause issues with recovery.
This was resolved in https://issues.jboss.org/browse/JBTM-1700 (4.17.11) and a bug that was introduced was squished in https://issues.jboss.org/browse/JBTM-1959 (4.17.12)
It can mean that shutdown takes up to 10 seconds (the interval between scans) longer. An enhancement could be to trigger the second phase of recovery directly when a resource tries to deregister but the impact of that change at this stage could be unforeseen.
The regression is probably in 6.1 basically its a bug that resource managers can deregister from the recovery manager between scans.
Remember that delay will only be observed if the user happens to shutdown in the 10 second interval every 120 seconds. (Also note that the recovery manager fires immediately after starting up so if you startup and shutdown immediately that is why you will consistently see this in test).
I can add a line of info if you like, e.g.:
"Recovery manager between scans, this may pause shutdown for up to 10 seconds" when I detect that the shutdown has appeared in that interval
This is an example of if you try to stop immediately:
10:47:16,717 INFO [org.jboss.as] (MSC service thread 1-6) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 8519ms
This is a stop after giving recovery manager chance to perform the initial scan:
10:46:32,040 INFO [org.jboss.as] (MSC service thread 1-4) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 33ms
Hope that is OK?
Just noting that this can also manifest in CLI:
[standalone@localhost:9999 /] :reload
Communication error: java.net.ConnectException: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out
(At least I believe that this is a manifestation of this thing.) To a user that is not aware of this issue, this looks pretty suspicious.
Hi Ladislav, Do you have a thread dump when your issue arose?
The only way I managed to replicate what you have seen is to call :reload while the server is in the process of shutting down. This is normal as the management endpoint should be shutdown at that point.
Tom, this is exactly the case. Given that server shutdown used to take a few millis, calling :reload multiple times in a row in a short time interval used to be A-OK. Now it shows this message. Agreed that it probably isn't a common situation, just wanted to point out that this situation might arise.
I see what you mean. It depends how short the time interval was between you calling :reload before, if you were very fast to reissue the command (<33ms on my machine - very unlikely) or multiple connections were open and performing operations on the same server at the same time then it would have the same behaviour. I verified the later case on both 6.2.0.Alpha1 and ER5 just now.
I can see a way to minimize the likelihood of the delay to the scenario where there is only an outstanding indoubt branch for the resource manager. I will do that fix.
Since the recovery manager behavior discussed in WildFly for the Narayana M6 update (see https://github.com/wildfly/wildfly/pull/5262) is present in EAP as well, the fix for https://issues.jboss.org/browse/WFLY-2350 is going to have to be backported as well.
I marked it as post as the component upgrade should alleviate the slow down, but clearly the backport of https://github.com/wildfly/wildfly/pull/5262 hasn't been performed so I am not entirely certain this was correct. I will move it back to assigned in case we want to wait for the backport
ER7 stops in less than 50 milliseconds. Verified, thanks.