989520 – Replicated journal - Failback fails with acknowledge mode set to SESSION_TRANSACTED

Bug 989520 - Replicated journal - Failback fails with acknowledge mode set to SESSION_TRANSACTED

Summary: Replicated journal - Failback fails with acknowledge mode set to SESSION_TRAN...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	HornetQ
Sub Component:
Version:	6.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ER5
Target Release:	EAP 6.1.1
Assignee:	Clebert Suconic
QA Contact:	Martin Svehla
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-29 12:33 UTC by Martin Svehla
Modified:	2013-09-16 20:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-09-16 20:24:52 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
live server configuration (26.69 KB, application/xml) 2013-07-29 12:34 UTC, Martin Svehla	no flags	Details
backup server configuration (26.78 KB, application/xml) 2013-07-29 12:34 UTC, Martin Svehla	no flags	Details
client (6.76 KB, application/zip) 2013-07-29 12:35 UTC, Martin Svehla	no flags	Details
live server log (3.94 MB, text/x-log) 2013-07-29 12:36 UTC, Martin Svehla	no flags	Details
backup server log (3.87 MB, text/x-log) 2013-07-29 12:37 UTC, Martin Svehla	no flags	Details
thread dump from the live server after the error (33.77 KB, text/plain) 2013-07-29 14:35 UTC, Martin Svehla	no flags	Details
live server log from the run where i generated thread dump above (2.29 MB, text/x-log) 2013-07-29 14:52 UTC, Martin Svehla	no flags	Details
Shell commands used to start up live (313 bytes, text/plain) 2013-07-29 16:08 UTC, Francisco Borges	no flags	Details
application-users and application-roles property files for both servers (1.33 KB, application/zip) 2013-07-30 09:02 UTC, Martin Svehla	no flags	Details
View All

Description Martin Svehla 2013-07-29 12:33:14 UTC

Description of problem:

I have 2 servers with replicated journal. When live crashes, failover is done correctly. However, sometimes (I get that around 30-40% times), whole cluster hang us when trying to start live server again. From the live server log it seems that the journal sync to is done correctly, but then there's some issue that causes live server to crash. Backup is shut down in the meantime).

This seem to affect only clients with acknowledge mode set to SESSION_TRANSACTED.


Version-Release number of selected component (if applicable):
EAP 6.1.1.ER3, HornetQ 2.3.2.Final


How reproducible:

See below, you might need to run it few times to get the error, I'll post my server logs too.


Steps to Reproduce:
1. Start live server with the configuration attached below.
2. Start backup server with the configuration attached below.
3. Run maven project with client with mvn compile exec:java -Dexec.args="[hostname] [port]"
4. Kill live server with kill -9. Client will properly failover to backup server.
5. Start live server again. Both live and backup servers end up in hang up state.


Additional info:

See live server log for error.

Also please review the server configuration files, this might be configuration issue.

Comment 1 Martin Svehla 2013-07-29 12:34:05 UTC

Created attachment 779851 [details]
live server configuration

Comment 2 Martin Svehla 2013-07-29 12:34:32 UTC

Created attachment 779852 [details]
backup server configuration

Comment 3 Martin Svehla 2013-07-29 12:35:14 UTC

Created attachment 779853 [details]
client

Comment 4 Martin Svehla 2013-07-29 12:36:39 UTC

Created attachment 779861 [details]
live server log

Comment 5 Martin Svehla 2013-07-29 12:37:31 UTC

Created attachment 779866 [details]
backup server log

Comment 6 Francisco Borges 2013-07-29 13:58:07 UTC

After a first look, this is what I can tell:

1. at server-backup.log, time-stamp "14:17:17,054" we see warning message with code HQ222163. That message shows the original-backup-now-live thought it was in sync with the original-live-trying-to-failback and exited.

2. at server-live.log we see that a page operation goes dearly wrong at timestamp "14:17:17,051" warning code HQ222010. That is a Critical IOError and it should shutdown the original-live-trying-to-failback.

[...]

Questions to be answered are:

A. What caused the Critical IOError?

B. Why the original-live-trying-to-failback did not shutdown and seemed to hang. (Martin Svehla reports waiting for it to exit for 10 minutes without success.)

Comment 7 Martin Svehla 2013-07-29 14:35:40 UTC

Created attachment 779935 [details]
thread dump from the live server after the error

Comment 8 Francisco Borges 2013-07-29 14:48:58 UTC

Ok, turns out that the HQ server refered to as original-live-trying-to-failback did indeed exit (last org.hornet in its logs). As expected the EAP server keeps running.

The question now is whether the critical IO exception is the cause of the critical IO exception.

Comment 9 Martin Svehla 2013-07-29 14:52:05 UTC

Created attachment 779972 [details]
live server log from the run where i generated thread dump above

Comment 10 Francisco Borges 2013-07-29 16:07:39 UTC

ok, something missing in the set-up, or I am doing something wrong:

1. ssh'ed into messaging-09.jbm.lab.bos.redhat.com
  
mkdir ~/BZ_989520
cd ~/BZ_989520
unzip /home/hudson/STATIC/eap/6.1.1.ER3/jboss-eap-6.1.1.ER3.zip
cp standalone-full-ha-live.xml ~/BZ_989520/jboss-eap-6.1/standalone/configuration
cd jboss-eap-6.1 && ./bin/standalone.sh -c standalone-full-ha-live.xml

Everything starts up. If I try to run the code replication-io-error code (on the same host) (without even bothering to have a backup in position), I get:

%  mvn compile exec:java -Dexec.args="localhost 5445"

[...]
javax.naming.NamingException: Failed to connect to any server. Servers tried: [remote://localhost:5445]
[...]


At the EAP server logs I get:

16:44:47,846 ERROR [org.hornetq.core.client] (Old I/O server worker (parentId: -777577764, [id: 0xd1a71adc, /127.0.0.1:5445])) HQ214013: Failed to decode packet: java.lang.IllegalArgumentException: HQ119032: Invalid type: -1
	at org.hornetq.core.protocol.core.impl.PacketDecoder.decode(PacketDecoder.java:427) [hornetq-core-client-2.3.2.Final-redhat-1.jar:2.3.2.Final-redhat-1]
	at org.hornetq.core.protocol.ServerPacketDecoder.decode(ServerPacketDecoder.java:180) [hornetq-server-2.3.2.Final-redhat-1.jar:2.3.2.Final-redhat-1]
	at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:485) [hornetq-core-client-2.3.2.Final-redhat-1.jar:2.3.2.Final-redhat-1]

@"Martin Svehla", can you spot what is going wrong? Or produce a shell-script that reproduces the issue in a clean environment? (see the shell script I will attach to verify what I did to start the live)

Comment 11 Francisco Borges 2013-07-29 16:08:44 UTC

Created attachment 780010 [details]
Shell commands used to start up live

Comment 12 Martin Svehla 2013-07-30 06:28:57 UTC

@Francisco The client is talking to EAP through its remoting protocol, so you need to use remoting port (4447). Please try that. (Sorry, I should have specified that.)

Comment 13 Martin Svehla 2013-07-30 09:02:34 UTC

Created attachment 780472 [details]
application-users and application-roles property files for both servers

Comment 14 Francisco Borges 2013-07-30 10:14:08 UTC

It is not clear what is going on, but the bug is (for all I can tell) valid.

The original-live gets a "Critical IO Error" while trying to fail-back when opening a page. What is weird is that (for all I know) this is not an issue when it is a backup doing the exactly same thing (and the code involved is all the same).

Comment 15 Francisco Borges 2013-07-30 14:45:17 UTC

After a full day of debugging the cause of this bug was found. There are multiple interacting pieces that need addressing but the core is that:

In an orderly shutdown:

1. we send a STOP_CALLED from original-backup (now playing "live") to the original-live. This keeps the server from failing-over too fast and makes sure we replicate everything we need to replicate.

2. we (are supposed to) send a FAILOVER order from the original-backup to the original-live. This never gets there, we have a time out for failing over just in case, but it appears that something goes wrong with the topology and the server crashes.

Other than understanding what goes wrong, a fix needs:

1. a lower timeout if the failover order never gets there.
2. make sure we actually send the package and do not terminate the replicationManager so fast.

Comment 16 Francisco Borges 2013-08-01 09:07:38 UTC

Ok, from a discussion with Clebert in the HQ forum we concluded that the live closing its channel with the backup would trigger such an interrupt signal. If the backup is still busy doing any IO this would trigger a the java.nio.channels.ClosedByInterruptException we observe in our logs.

Since we understand this situation, this will be placed in a JIRA (https://issues.jboss.org/browse/HORNETQ-1241) and will be reviewed later. I.e. we concluded that there is no urgency to work on it now.

Comment 21 Clebert Suconic 2013-08-05 15:38:59 UTC

PR for this was sent: https://github.com/hornetq/hornetq/pull/1186

I'm not sure what's the Target milestone for that.. as soon as either I figure out or someone change it for me, this issue can get set as Posted.. just waiting the release and the PR of a new HornetQ

Comment 22 Martin Svehla 2013-08-16 11:08:50 UTC

Tested with EAP-6.1.1.ER6 / HornetQ 2.3.4.Final

Wasn't able to hit the error again after about 20 runs of the use case.

Note You need to log in before you can comment on or make changes to this bug.