Bug 959616 - Deadlock during clean shutdown of backup during activation
Summary: Deadlock during clean shutdown of backup during activation
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: HornetQ
Version: 6.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ER7
: EAP 6.1.1
Assignee: Clebert Suconic
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-04 07:09 UTC by Miroslav Novak
Modified: 2014-05-27 01:27 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
An JBoss Enterprise Application Platform 6 server that is configured as a HornetQ backup server will fail to shutdown in the following situation: * Message-Driven Beans are locally deployed on the server, and * the shutdown command is performed during the server's activation. In this scenario the shutdown will result in a deadlock that prevents the shutdown process from completing. Once deadlocked, the server must be forcibly terminated. On Red Hat Enterprise Linux 6 this can be done using the `kill -9` command. To avoid this situation, only attempt to shutdown the server before or after activation, and not during the journal loading process.
Clone Of:
Environment:
Last Closed: 2013-09-16 20:20:56 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
threadump.txt (47.17 KB, text/plain)
2013-05-04 07:10 UTC, Miroslav Novak
no flags Details

Description Miroslav Novak 2013-05-04 07:09:46 UTC
Description of problem:
There is deadlock during clean shutdown of backup. Issue was hit in teardown phase of one failover test for HornetQ resource adapter. EAP 6.1.0.ER6 was patched by bits from bz#958776.

From thread dump:
Found one Java-level deadlock:
=============================
"HQ119000: Activation for server HornetQServerImpl::serverUUID=5f834973-b47c-11e2-80f3-ad3533d90f8c":
  waiting to lock monitor 0x00007fc48c01ddb8 (object 0x00000000d0383620, a org.hornetq.jms.server.impl.JMSServerManagerImpl),
  which is held by "MSC service thread 1-13"
"MSC service thread 1-13":
  waiting to lock monitor 0x00007fc4c0005180 (object 0x00000000d0232ea8, a org.hornetq.core.server.impl.HornetQServerImpl),
  which is held by "HQ119000: Activation for server HornetQServerImpl::serverUUID=5f834973-b47c-11e2-80f3-ad3533d90f8c"

Java stack information for the threads listed above:
===================================================
"HQ119000: Activation for server HornetQServerImpl::serverUUID=5f834973-b47c-11e2-80f3-ad3533d90f8c":
        at org.hornetq.jms.server.impl.JMSServerManagerImpl.activated(JMSServerManagerImpl.java:216)
        - waiting to lock <0x00000000d0383620> (a org.hornetq.jms.server.impl.JMSServerManagerImpl)
        at org.hornetq.core.server.impl.HornetQServerImpl.callActivateCallbacks(HornetQServerImpl.java:1368)
        at org.hornetq.core.server.impl.HornetQServerImpl.initialisePart2(HornetQServerImpl.java:1591)
        - locked <0x00000000d0232ea8> (a org.hornetq.core.server.impl.HornetQServerImpl)
        at org.hornetq.core.server.impl.HornetQServerImpl.access$1400(HornetQServerImpl.java:169)
        at org.hornetq.core.server.impl.HornetQServerImpl$SharedStoreBackupActivation.run(HornetQServerImpl.java:2128)
        at java.lang.Thread.run(Thread.java:662)
"MSC service thread 1-13":
        at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:558)
        - waiting to lock <0x00000000d0232ea8> (a org.hornetq.core.server.impl.HornetQServerImpl)
        at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:538)
        at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:505)
        at org.hornetq.jms.server.impl.JMSServerManagerImpl.stop(JMSServerManagerImpl.java:502)
        - locked <0x00000000d0383620> (a org.hornetq.jms.server.impl.JMSServerManagerImpl)
        at org.jboss.as.messaging.jms.JMSService.stop(JMSService.java:124)
        - locked <0x00000000d0561b38> (a org.jboss.as.messaging.jms.JMSService)
        at org.jboss.msc.service.ServiceControllerImpl$StopTask.stopService(ServiceControllerImpl.java:1911)
        at org.jboss.msc.service.ServiceControllerImpl$StopTask.run(ServiceControllerImpl.java:1874)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)

Found 1 deadlock.

Full thread dump attached (threadump.txt)

Comment 1 Miroslav Novak 2013-05-04 07:10:23 UTC
Created attachment 743465 [details]
threadump.txt

Comment 2 Rostislav Svoboda 2013-05-04 11:12:06 UTC
Not sure if this is a blocker for release (knowing where we are), but I want the triage team reviews this one.

Comment 3 John Doyle 2013-05-05 13:40:49 UTC
Is there any user visible consequence of this deadlock?

Comment 4 Clebert Suconic 2013-05-05 14:13:26 UTC
When using a backup and issuing a clean shutdown only. And even so you can kill -9

I don't think this should be a blocker because:

- it won't affect production. 
- it affect only a few users 
- for the few using this feature and eventually issuing a clean shutdown can do a kill -3

And Mainly:

- that's a risk change to be done under pressure. I don't think we should do it now.

Comment 5 Miroslav Novak 2013-05-05 14:15:47 UTC
Customer will need to use "kill -9 ..." to kill the server where HQ is configured as backup when this is hit. 

So far I saw this dead lock only on backup and with EAP 6.1.0.ER6. But it's hard to say whether it's regression.

This issue is problematic for our failover tests. I'll have to kill server when dead lock is hit during clean shutdown (tear down phase) so the test won't hang. It's not a test blocker but would help to have it fixed.

Comment 6 Mark Little 2013-05-05 14:18:27 UTC
I agree with Clebert, but this should be fixed immediately after 6.1 goes out.

Comment 7 John Doyle 2013-05-05 14:25:35 UTC
I agree, not a blocker.

Comment 8 Clebert Suconic 2013-05-05 14:32:25 UTC
It seems this issue will only happen if you have MDBs on the backup. Most backup users do it remotely. 

Maybe for the tests you could do something like removing the pooled connection from the standalone.

Comment 10 Dana Mison 2013-05-07 05:05:10 UTC
Documented as Known Issue for EAP 6.1.0

Comment 11 Clebert Suconic 2013-05-08 22:24:14 UTC
I am fixing this issue on master and 2.3.x. 

It turns out to be an easy fix, and I have replicated it with a byteman test.


If you guys want to I can make a new release with this fix here.

Comment 12 Clebert Suconic 2013-05-08 23:03:28 UTC
https://github.com/hornetq/hornetq/pull/1045

Comment 14 Clebert Suconic 2013-05-08 23:44:59 UTC
After investigating this issue, this will only happen if you shutdown the server while activating. a workaround for QE would be to sleep a few seconds before shutting down the server. A Workaround for customers will be not shut down during activation. (i.e... it would be a really rare event, shutdown during activation).

This will be fixed on next release just in case, but it's definitely not a big deal.

Comment 15 Miroslav Novak 2013-08-22 09:22:41 UTC
BZ is in incorrect state. Fix for this issue is present in EAP 6.1.1.ER7(HQ 2.3.5.Final) 

Verified in EAP 6.1.1.ER7. I can no longer hit the issue. Great work, Clebert!


Note You need to log in before you can comment on or make changes to this bug.