Bug 989520
| Summary: | Replicated journal - Failback fails with acknowledge mode set to SESSION_TRANSACTED | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Martin Svehla <msvehla> |
| Component: | HornetQ | Assignee: | Clebert Suconic <csuconic> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Svehla <msvehla> |
| Severity: | high | Docs Contact: | Russell Dickenson <rdickens> |
| Priority: | unspecified | ||
| Version: | 6.1.1 | CC: | anmiller, cdewolf, csuconic, msvehla, myarboro |
| Target Milestone: | ER5 | ||
| Target Release: | EAP 6.1.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-09-16 20:24:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Martin Svehla
2013-07-29 12:33:14 UTC
Created attachment 779851 [details]
live server configuration
Created attachment 779852 [details]
backup server configuration
Created attachment 779853 [details]
client
Created attachment 779861 [details]
live server log
Created attachment 779866 [details]
backup server log
After a first look, this is what I can tell: 1. at server-backup.log, time-stamp "14:17:17,054" we see warning message with code HQ222163. That message shows the original-backup-now-live thought it was in sync with the original-live-trying-to-failback and exited. 2. at server-live.log we see that a page operation goes dearly wrong at timestamp "14:17:17,051" warning code HQ222010. That is a Critical IOError and it should shutdown the original-live-trying-to-failback. [...] Questions to be answered are: A. What caused the Critical IOError? B. Why the original-live-trying-to-failback did not shutdown and seemed to hang. (Martin Svehla reports waiting for it to exit for 10 minutes without success.) Created attachment 779935 [details]
thread dump from the live server after the error
Ok, turns out that the HQ server refered to as original-live-trying-to-failback did indeed exit (last org.hornet in its logs). As expected the EAP server keeps running. The question now is whether the critical IO exception is the cause of the critical IO exception. Created attachment 779972 [details]
live server log from the run where i generated thread dump above
ok, something missing in the set-up, or I am doing something wrong: 1. ssh'ed into messaging-09.jbm.lab.bos.redhat.com mkdir ~/BZ_989520 cd ~/BZ_989520 unzip /home/hudson/STATIC/eap/6.1.1.ER3/jboss-eap-6.1.1.ER3.zip cp standalone-full-ha-live.xml ~/BZ_989520/jboss-eap-6.1/standalone/configuration cd jboss-eap-6.1 && ./bin/standalone.sh -c standalone-full-ha-live.xml Everything starts up. If I try to run the code replication-io-error code (on the same host) (without even bothering to have a backup in position), I get: % mvn compile exec:java -Dexec.args="localhost 5445" [...] javax.naming.NamingException: Failed to connect to any server. Servers tried: [remote://localhost:5445] [...] At the EAP server logs I get: 16:44:47,846 ERROR [org.hornetq.core.client] (Old I/O server worker (parentId: -777577764, [id: 0xd1a71adc, /127.0.0.1:5445])) HQ214013: Failed to decode packet: java.lang.IllegalArgumentException: HQ119032: Invalid type: -1 at org.hornetq.core.protocol.core.impl.PacketDecoder.decode(PacketDecoder.java:427) [hornetq-core-client-2.3.2.Final-redhat-1.jar:2.3.2.Final-redhat-1] at org.hornetq.core.protocol.ServerPacketDecoder.decode(ServerPacketDecoder.java:180) [hornetq-server-2.3.2.Final-redhat-1.jar:2.3.2.Final-redhat-1] at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:485) [hornetq-core-client-2.3.2.Final-redhat-1.jar:2.3.2.Final-redhat-1] @"Martin Svehla", can you spot what is going wrong? Or produce a shell-script that reproduces the issue in a clean environment? (see the shell script I will attach to verify what I did to start the live) Created attachment 780010 [details]
Shell commands used to start up live
@Francisco The client is talking to EAP through its remoting protocol, so you need to use remoting port (4447). Please try that. (Sorry, I should have specified that.) Created attachment 780472 [details]
application-users and application-roles property files for both servers
It is not clear what is going on, but the bug is (for all I can tell) valid. The original-live gets a "Critical IO Error" while trying to fail-back when opening a page. What is weird is that (for all I know) this is not an issue when it is a backup doing the exactly same thing (and the code involved is all the same). After a full day of debugging the cause of this bug was found. There are multiple interacting pieces that need addressing but the core is that: In an orderly shutdown: 1. we send a STOP_CALLED from original-backup (now playing "live") to the original-live. This keeps the server from failing-over too fast and makes sure we replicate everything we need to replicate. 2. we (are supposed to) send a FAILOVER order from the original-backup to the original-live. This never gets there, we have a time out for failing over just in case, but it appears that something goes wrong with the topology and the server crashes. Other than understanding what goes wrong, a fix needs: 1. a lower timeout if the failover order never gets there. 2. make sure we actually send the package and do not terminate the replicationManager so fast. Ok, from a discussion with Clebert in the HQ forum we concluded that the live closing its channel with the backup would trigger such an interrupt signal. If the backup is still busy doing any IO this would trigger a the java.nio.channels.ClosedByInterruptException we observe in our logs. Since we understand this situation, this will be placed in a JIRA (https://issues.jboss.org/browse/HORNETQ-1241) and will be reviewed later. I.e. we concluded that there is no urgency to work on it now. PR for this was sent: https://github.com/hornetq/hornetq/pull/1186 I'm not sure what's the Target milestone for that.. as soon as either I figure out or someone change it for me, this issue can get set as Posted.. just waiting the release and the PR of a new HornetQ Tested with EAP-6.1.1.ER6 / HornetQ 2.3.4.Final Wasn't able to hit the error again after about 20 runs of the use case. |