Created attachment 962151 [details] server-logs.zip Description of problem: Sometimes there is "Critical IO Error" in log of backup server after failover from live to backup. Test scenario: 1. Start 2 EAP 6.4.0.DR11 in collocated HA topology with shared store (journal-type NIO) 2. Start producer and consumer with transacted session connected to 1st server server 3. Kill 1st EAP server so clients failover to collocated backup on 2nd server 4. Wait for clients to finish Result: Sometimes happens that after step 3. collocated backup on 2nd EAP server does not activate and logs warning "HQ222010: Critical IO Error, shutting down the server...". (see full stack trace below). We saw similar problem in the past in bz#989520 How reproducible: We don't have a reproducer at this moment. Expected results: Backup should start without problem. Additional info: I'm attaching trace logs from EAP servers. Full stack trace from backup: 08:59:39,083 WARN [org.hornetq.core.server] (Old I/O server worker (parentId: 1934013939, [id: 0x7346b5f3, /127.0.0.1:7446])) HQ222010: Critical IO Error, shutting down the server. file=NIOSequentialFile /mnt/j bm/hudson/hornetq-journal-A-colocated/large-messages/2147484245.msg, message=null: HornetQException[errorType=IO_ERROR message=null] at org.hornetq.core.journal.impl.NIOSequentialFile.open(NIOSequentialFile.java:111) [hornetq-journal-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.journal.impl.NIOSequentialFile.open(NIOSequentialFile.java:96) [hornetq-journal-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.persistence.impl.journal.LargeServerMessageImpl.openFile(LargeServerMessageImpl.java:434) [hornetq-server-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.persistence.impl.journal.LargeServerMessageImpl.validateFile(LargeServerMessageImpl.java:406) [hornetq-server-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.persistence.impl.journal.LargeServerMessageImpl.addBytes(LargeServerMessageImpl.java:111) [hornetq-server-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.server.impl.ServerSessionImpl.sendContinuations(ServerSessionImpl.java:1370) [hornetq-server-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:472) [hornetq-server-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:641) [hornetq-core-client-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:557) [hornetq-core-client-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:533) [hornetq-core-client-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:587) [hornetq-server-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(HornetQChannelHandler.java:72) [hornetq-core-client-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:281) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(HornetQFrameDecoder2.java:122) [hornetq-core-client-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.socket.oio.OioWorker.process(OioWorker.java:71) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.socket.oio.AbstractOioWorker.run(AbstractOioWorker.java:73) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.channel.socket.oio.OioWorker.run(OioWorker.java:51) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at org.jboss.netty.util.VirtualExecutorService$ChildExecutorRunnable.run(VirtualExecutorService.java:175) [netty-3.6.10.Final-redhat-1.jar:3.6.10.Final-redhat-1] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_45] at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_45] Caused by: java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) [rt.jar:1.6.0_45] at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:304) [rt.jar:1.6.0_45] at org.hornetq.core.journal.impl.NIOSequentialFile.open(NIOSequentialFile.java:107) [hornetq-journal-2.3.24.Final-redhat-1.jar:2.3.24.Final-redhat-1] ... 30 more
This relates to jira - HORNETQ-1241
I can see this issue in other failover tests as well. Thus increasing severity.
Did you wait the server to backup its file before failback?
In this scenario is used shared store. There is no need to wait for synchronization.
I believe this would be BZ-1193085
PR https://github.com/hornetq/hornetq/pull/2105 this issue is already merged -> moving to CP 13
7.1 and WF branch are solved by ARTEMIS-831. 7.0 is not containing the issue - confirmed by Tom Ross and Clebert
Verified with EAP 6.4.13.CP.CR2
Released with EAP 6.4.13 on Feb 02 2017.