Bug 1020209
Summary: | EAP 6.2.0.ER6 takes very long time to shut down | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Jan Martiska <jmartisk> | ||||
Component: | Transaction Manager | Assignee: | tom.jenkinson | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ondrej Chaloupka <ochaloup> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.2.0 | CC: | brian.stansberry, jmartisk, lthon, myarboro, tom.jenkinson | ||||
Target Milestone: | ER7 | Keywords: | Regression | ||||
Target Release: | EAP 6.2.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-12-15 16:55:22 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1022054 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Jan Martiska
2013-10-17 09:44:33 UTC
Created attachment 813253 [details]
thread dump during shutdown
There are several suspicious threads related to XA recovery in the thread dump.
After bisecting https://github.com/jbossas/jboss-eap between EAP_6.2.0.ER5 and EAP_6.2.0.ER6, I can confirm that the problem in EAP lies in the upgrade of JBoss Transactions to 4.17.12. Specifically, it's the commit https://github.com/jbossas/jboss-eap/commit/c8f35c4d28b9faaa641b9b9b56632e2ddb3bdf82 After bisecting https://github.com/jbosstm/narayana/ between 4.17.10.Final and 4.17.12.Final, I believe that the problematic commit in Narayana is https://github.com/jbosstm/narayana/commit/67f7f1fc42232f8332b9b9ea65f685937a122207 I was able to confirm just by reverting this particular commit from 4.17.12.Final. Hi, There was a bug in previous versions of JBTM which allowed a resource manager to deregister from the recovery manager between scans which can cause issues with recovery. This was resolved in https://issues.jboss.org/browse/JBTM-1700 (4.17.11) and a bug that was introduced was squished in https://issues.jboss.org/browse/JBTM-1959 (4.17.12) It can mean that shutdown takes up to 10 seconds (the interval between scans) longer. An enhancement could be to trigger the second phase of recovery directly when a resource tries to deregister but the impact of that change at this stage could be unforeseen. Tom Hi guys, The regression is probably in 6.1 basically its a bug that resource managers can deregister from the recovery manager between scans. Remember that delay will only be observed if the user happens to shutdown in the 10 second interval every 120 seconds. (Also note that the recovery manager fires immediately after starting up so if you startup and shutdown immediately that is why you will consistently see this in test). I can add a line of info if you like, e.g.: "Recovery manager between scans, this may pause shutdown for up to 10 seconds" when I detect that the shutdown has appeared in that interval This is an example of if you try to stop immediately: 10:47:16,717 INFO [org.jboss.as] (MSC service thread 1-6) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 8519ms This is a stop after giving recovery manager chance to perform the initial scan: 10:46:32,040 INFO [org.jboss.as] (MSC service thread 1-4) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-9) stopped in 33ms Hope that is OK? Tom Just noting that this can also manifest in CLI: [standalone@localhost:9999 /] :reload Communication error: java.net.ConnectException: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out: JBAS012144: Could not connect to remote://localhost:9999. The connection timed out [disconnected /] (At least I believe that this is a manifestation of this thing.) To a user that is not aware of this issue, this looks pretty suspicious. Hi Ladislav, Do you have a thread dump when your issue arose? Hi Ladislav, The only way I managed to replicate what you have seen is to call :reload while the server is in the process of shutting down. This is normal as the management endpoint should be shutdown at that point. Tom Tom, this is exactly the case. Given that server shutdown used to take a few millis, calling :reload multiple times in a row in a short time interval used to be A-OK. Now it shows this message. Agreed that it probably isn't a common situation, just wanted to point out that this situation might arise. I see what you mean. It depends how short the time interval was between you calling :reload before, if you were very fast to reissue the command (<33ms on my machine - very unlikely) or multiple connections were open and performing operations on the same server at the same time then it would have the same behaviour. I verified the later case on both 6.2.0.Alpha1 and ER5 just now. I can see a way to minimize the likelihood of the delay to the scenario where there is only an outstanding indoubt branch for the resource manager. I will do that fix. Since the recovery manager behavior discussed in WildFly for the Narayana M6 update (see https://github.com/wildfly/wildfly/pull/5262) is present in EAP as well, the fix for https://issues.jboss.org/browse/WFLY-2350 is going to have to be backported as well. I marked it as post as the component upgrade should alleviate the slow down, but clearly the backport of https://github.com/wildfly/wildfly/pull/5262 hasn't been performed so I am not entirely certain this was correct. I will move it back to assigned in case we want to wait for the backport ER7 stops in less than 50 milliseconds. Verified, thanks. |