Description of problem: There is deadlock during clean shutdown of backup. Issue was hit in teardown phase of one failover test for HornetQ resource adapter. EAP 6.1.0.ER6 was patched by bits from bz#958776. From thread dump: Found one Java-level deadlock: ============================= "HQ119000: Activation for server HornetQServerImpl::serverUUID=5f834973-b47c-11e2-80f3-ad3533d90f8c": waiting to lock monitor 0x00007fc48c01ddb8 (object 0x00000000d0383620, a org.hornetq.jms.server.impl.JMSServerManagerImpl), which is held by "MSC service thread 1-13" "MSC service thread 1-13": waiting to lock monitor 0x00007fc4c0005180 (object 0x00000000d0232ea8, a org.hornetq.core.server.impl.HornetQServerImpl), which is held by "HQ119000: Activation for server HornetQServerImpl::serverUUID=5f834973-b47c-11e2-80f3-ad3533d90f8c" Java stack information for the threads listed above: =================================================== "HQ119000: Activation for server HornetQServerImpl::serverUUID=5f834973-b47c-11e2-80f3-ad3533d90f8c": at org.hornetq.jms.server.impl.JMSServerManagerImpl.activated(JMSServerManagerImpl.java:216) - waiting to lock <0x00000000d0383620> (a org.hornetq.jms.server.impl.JMSServerManagerImpl) at org.hornetq.core.server.impl.HornetQServerImpl.callActivateCallbacks(HornetQServerImpl.java:1368) at org.hornetq.core.server.impl.HornetQServerImpl.initialisePart2(HornetQServerImpl.java:1591) - locked <0x00000000d0232ea8> (a org.hornetq.core.server.impl.HornetQServerImpl) at org.hornetq.core.server.impl.HornetQServerImpl.access$1400(HornetQServerImpl.java:169) at org.hornetq.core.server.impl.HornetQServerImpl$SharedStoreBackupActivation.run(HornetQServerImpl.java:2128) at java.lang.Thread.run(Thread.java:662) "MSC service thread 1-13": at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:558) - waiting to lock <0x00000000d0232ea8> (a org.hornetq.core.server.impl.HornetQServerImpl) at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:538) at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:505) at org.hornetq.jms.server.impl.JMSServerManagerImpl.stop(JMSServerManagerImpl.java:502) - locked <0x00000000d0383620> (a org.hornetq.jms.server.impl.JMSServerManagerImpl) at org.jboss.as.messaging.jms.JMSService.stop(JMSService.java:124) - locked <0x00000000d0561b38> (a org.jboss.as.messaging.jms.JMSService) at org.jboss.msc.service.ServiceControllerImpl$StopTask.stopService(ServiceControllerImpl.java:1911) at org.jboss.msc.service.ServiceControllerImpl$StopTask.run(ServiceControllerImpl.java:1874) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Found 1 deadlock. Full thread dump attached (threadump.txt)
Created attachment 743465 [details] threadump.txt
Not sure if this is a blocker for release (knowing where we are), but I want the triage team reviews this one.
Is there any user visible consequence of this deadlock?
When using a backup and issuing a clean shutdown only. And even so you can kill -9 I don't think this should be a blocker because: - it won't affect production. - it affect only a few users - for the few using this feature and eventually issuing a clean shutdown can do a kill -3 And Mainly: - that's a risk change to be done under pressure. I don't think we should do it now.
Customer will need to use "kill -9 ..." to kill the server where HQ is configured as backup when this is hit. So far I saw this dead lock only on backup and with EAP 6.1.0.ER6. But it's hard to say whether it's regression. This issue is problematic for our failover tests. I'll have to kill server when dead lock is hit during clean shutdown (tear down phase) so the test won't hang. It's not a test blocker but would help to have it fixed.
I agree with Clebert, but this should be fixed immediately after 6.1 goes out.
I agree, not a blocker.
It seems this issue will only happen if you have MDBs on the backup. Most backup users do it remotely. Maybe for the tests you could do something like removing the pooled connection from the standalone.
Documented as Known Issue for EAP 6.1.0
I am fixing this issue on master and 2.3.x. It turns out to be an easy fix, and I have replicated it with a byteman test. If you guys want to I can make a new release with this fix here.
https://github.com/hornetq/hornetq/pull/1045
After investigating this issue, this will only happen if you shutdown the server while activating. a workaround for QE would be to sleep a few seconds before shutting down the server. A Workaround for customers will be not shut down during activation. (i.e... it would be a really rare event, shutdown during activation). This will be fixed on next release just in case, but it's definitely not a big deal.
BZ is in incorrect state. Fix for this issue is present in EAP 6.1.1.ER7(HQ 2.3.5.Final) Verified in EAP 6.1.1.ER7. I can no longer hit the issue. Great work, Clebert!