Bug 901137 (JBPAPP6-1273)

Summary: Server cannot be shutdowned gracefully when reconnect-attempts is set to -1
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Miroslav Novak <mnovak>
Component: HornetQAssignee: Francisco Borges <francisco.borges>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 6.1.0CC: anmiller, atangrin, cdewolf, csuconic, dandread, francisco.borges, jawilson, mnovak, myarboro, nziakova, pslavice, sappleto
Target Milestone: ER6Keywords: TestBlocker
Target Release: EAP 6.1.0   
Hardware: Unspecified   
OS: Unspecified   
URL: http://jira.jboss.org/jira/browse/JBPAPP6-1273
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 928911    
Bug Blocks:    
Attachments:
Description Flags
reproducer-shutdown.zip
none
mdb-server-console.log
none
reproducer.zip
none
thread dump from mdb server (EAP 6.1.0.DR4)
none
threaddump-master.txt
none
threaddump_hq230cr2.txt none

Description Miroslav Novak 2012-10-31 12:59:02 UTC
project_key: JBPAPP6

Test scenario:

1. Start two EAP 6 servers - mdb and jms server. mdb server is configured to connect through remote JCA to jms server.
2. Stop (ctrl-c) jms server
3. Try to stop (ctrl-c) mdb server

Result:
Mdb server hangs (I waited more than 10 minutes) - attached mdb-server-console.log (with thread dump during hang)

Attached reproducer-shutdown.zip:
1. Download and unzip reproducer-shutdown.zip
2. Prepare servers - "sh prepare.sh"
3. Start jms server - "sh start-server1.sh localhost"
4. Start mdb server - "sh start-server2.sh <some_other_ip>"
5  Shutdown jms server
6. Try to shutdown mdb server

Issue seems to be caused by setting <reconnect-attempts>-1</reconnect-attempts> in "hornetq-ra" connection factory:
{code}
                   <pooled-connection-factory name="hornetq-ra">
                        <transaction mode="xa"/>
			<reconnect-attempts>-1</reconnect-attempts>
                        <connectors>
                            <connector-ref connector-name="netty-remote"/>
                        </connectors>
                        <entries>
                            <entry name="java:/JmsXA"/>
                        </entries>
                    </pooled-connection-factory>
{code}

When reconnect-attempts is set to 0 then server shutdowns immediately.

Comment 1 Miroslav Novak 2012-10-31 12:59:35 UTC
Attachment: Added: reproducer-shutdown.zip


Comment 2 Miroslav Novak 2012-10-31 13:16:05 UTC
Attachment: Added: mdb-server-console.log


Comment 3 Anne-Louise Tangring 2012-11-01 17:41:28 UTC
This issue is listed as Major or below and as such is not targetted for the EAP 6.0.1 release, now that we are in Blocker or Critical issue only mode. Should this be reconsidered, please contact the EAP PM team.

Comment 4 Anne-Louise Tangring 2012-11-13 20:53:13 UTC
Docs QE Status: Removed: NEW 


Comment 5 Miroslav Novak 2012-11-26 11:47:33 UTC
Link: Added: This issue Cloned to JBPAPP6-1654


Comment 6 Miroslav Novak 2013-02-19 18:34:49 UTC
This issue is still valid with a little different scenario (step 5. is new) 

Steps to reproduce:
1. Download and unzip reproducer.zip from attachement. Next steps excexute in unzipped "reproducer" directory
2. run "sh prepare.sh"
  - dowloads EAP 6.1.0.DR4
  - creates two directories server1 and server2
  - copies directory jboss-eap-6.1 to server1 and server2
  - copies configuration standalone-full-ha-jms.xml to server1
  - copies configuration standalone-full-ha-mdb.xml to server2
  - copies mdb1.jar to server2's deployments directory
3. start first (jms) server by "sh start-server1.sh localhost"
4. start second (mdb) server by "sh start-server2.sh <some_other_ip>"
5. start jms producer by "sh start-producer.sh localhost 1000"
6. shutdown first (jms) server by ctrl-c
7. try to shutdown second (mdb) server -> server hangs (threadump.txt attached)

Comment 7 Miroslav Novak 2013-02-19 18:35:44 UTC
Created attachment 699585 [details]
reproducer.zip

Comment 8 Miroslav Novak 2013-02-19 18:36:43 UTC
Created attachment 699586 [details]
thread dump from mdb server (EAP 6.1.0.DR4)

Comment 9 Clebert Suconic 2013-02-19 19:34:23 UTC
Can you try replacing the Jars from trunk? I believe this is fixed.

Comment 10 Miroslav Novak 2013-02-20 08:49:00 UTC
Server still hangs with trunk/master.

Check what I did, please:
- switched to master branch in git in HornetQ project: "git checkout master; git pull"
- build hornetq jars by: "mvn -Prelease package"
- copied built ./hornetq-ra/target/hornetq-ra-2.3.0.CR1.jar to ./server2/jboss-eap-6.1/modules/system/layers/base/org/hornetq/ra/main/hornetq-ra-2.3.0.CR1.jar
- tried last test scenario (from comment 2013-02-19 13:34:49 EST)

Comment 11 Miroslav Novak 2013-02-20 08:54:06 UTC
Created attachment 699886 [details]
threaddump-master.txt

Comment 12 Clebert Suconic 2013-03-28 15:24:40 UTC
Can you try with the latest CR2?

Comment 13 Paul Gier 2013-03-28 15:39:31 UTC
PR for the hornetq CR2 upgrade: https://github.com/jbossas/jboss-eap/pull/79

Comment 14 Miroslav Novak 2013-04-03 11:21:04 UTC
I can still hit this problem with HornetQ 2.3.0.CR2. Thread dump from mdb server attached (threaddump_hq230cr2.txt)

Comment 15 Miroslav Novak 2013-04-03 11:21:45 UTC
Created attachment 731118 [details]
threaddump_hq230cr2.txt

Comment 16 Clebert Suconic 2013-04-03 21:20:51 UTC
@Miroslav: I"m not sure we should fix this... 

First, the use case is something really of an edge case..  you first shutdown one server, than the remote server. it's not even a developer's case.

Second, that would break other cases that are more important because of this edge case.


So, I would say this is a won't fix it.. you could even document the case if you wanted.. but this is also somewhat obvious...  I think we should just close this as won't fix.

The ristk of breaking other cases is too great...   The proper fix here would be to change the session.close() to be ignored in case of a failover is in place.. and this could break other scenarios that are not considered as edgy as this one here.

Comment 17 Miroslav Novak 2013-04-04 09:11:59 UTC
I'm also afraid of regressions. Problem is that this is not such edge case it appears to be. 
We're testing this scenario because there were support tickets for it from our customers. Check comments in related jira from Jimmy Wilson and Shaun Appleton [1]. There is a high probability that we'll have to fix it anyway. 

[1] https://issues.jboss.org/browse/JBPAPP-10450?focusedCommentId=12737770&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12737770

Comment 18 Miroslav Novak 2013-04-10 08:55:19 UTC
This issue is related to:
https://bugzilla.redhat.com/show_bug.cgi?id=877277

Comment 19 Francisco Borges 2013-04-11 15:25:26 UTC
Fixed on https://github.com/FranciscoBorges/hornetq/tree/shutdownOnReconnect but I still need to verify it (although the fix is so simple that I am calling it "fixed")

Comment 20 Clebert Suconic 2013-04-11 18:11:07 UTC
@Francisco: I looked at your fix... Do we really to still close those sessions? AFAIK a connection.close() will close any session. (Maybe I am missing something on the Resource Adapter?)

Comment 21 Clebert Suconic 2013-04-11 18:11:52 UTC
the fix looks good BTW: simple change! which is great! thanks man!

Comment 22 Francisco Borges 2013-04-12 14:09:11 UTC
No, we do not need those close sessions, I left them there for safety sake until I figured it out how to reproduce and verify this case.

Fwiw, I just tried to verify and we are now hanging somewhere else, assuming I did everything correctly.

Comment 23 Francisco Borges 2013-04-12 15:29:46 UTC
Ok, Miroslav Novak confirmed, that change got us ahead but the server still did not exit.

I made a second change and pushed, after a while the server will exit. At least it did here for me. On Monday we try to do some more throughout verification.

Comment 24 Francisco Borges 2013-04-18 11:00:14 UTC
A fix was merged. The commit is this one https://github.com/hornetq/hornetq/commit/6eb89a7288fc1f9a569641ee9058df004db24257

Comment 25 Miroslav Novak 2013-05-03 09:22:17 UTC
Cannot hit the problem with EAP 6.1.0.ER6 (HQ 2.3.0.Final). Great work, Francisco!