Description of problem: We can see unexpected exception and bad behaviour of JMS client in following scenario: 1) Start 2 EAP 6.4.0.DR6 (HQ 2.3.21.Final) with HornetQ in dedicated topology with shared store and deployed topic "jms/topic/InTopic0" 2) Subscribe and start 1 subscriber with "client acknowledge" session to topic 3) Start 3 publishers which sends messages to topic 4) Kill "live" server 5) Check that all clients failovered to backup 6) Stop publishers and wait for subscriber to receive all messages. Check that number of sent and received messages is equal. Unexpected exception occured on backup's log after step 4.: 10:19:21,593 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) Exception in thread "hornetq-discovery-group-thread-dg-group1" java.lang.InternalError: unhandled utf8 byte 0 10:19:21,595 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.utils.UTF8Util.readUTF(UTF8Util.java:164) 10:19:21,597 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readUTF(ChannelBufferWrapper.java:105) 10:19:21,599 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readStringInternal(ChannelBufferWrapper.java:95) 10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readString(ChannelBufferWrapper.java:77) 10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.cluster.DiscoveryGroup$DiscoveryRunnable.run(DiscoveryGroup.java:303) 10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at java.lang.Thread.run(Thread.java:745) This happened after EAP 6.4.0.DR6 with HornetQ live was killed by Byteman. It looks like that incomplete connector was broadcasted by the "live" server and blew up discovery group on backup by throwing InternalError. There is other problem which occurred during failover and might be related to this error. This is failover scenario with 3 publishers and 1 subscriber on topic. When live server was killed publishers did failover but subscriber did not. From the thread dump it hangs in consumer.receive(1 min); for more than 2 minutes and does not receive any new messages after live is killed: Stack trace of thread: Thread[Thread-344,5,main] ---java.lang.Object.wait(Native Method) ---org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:259) ---org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:401) ---org.hornetq.jms.client.HornetQMessageConsumer.getMessage(HornetQMessageConsumer.java:220) ---org.hornetq.jms.client.HornetQMessageConsumer.receive(HornetQMessageConsumer.java:129) ---org.jboss.qa.hornetq.apps.clients.SubscriberClientAck.receiveMessage(SubscriberClientAck.java:278) ---org.jboss.qa.hornetq.apps.clients.SubscriberClientAck.run(SubscriberClientAck.java:122) How reproducible: We're not able to reproduce this without proper byteman rule and with the same test. We saw this just once. Expected results: java.lang.InternalError should not be thrown to error output and destroy discovery group on backup. Subscriber should failover without problem. Note: I'm attaching trace logs from servers and test (contains trace logs from clients) - logs.zip I'm not sure when to trigger byteman rule so we could reproduce the problem. Any suggestions?
Created attachment 952144 [details] logs.zip
Link to failed test in Jenkins: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-60-hornetq-ha-failover-dedicated/226/testReport/org.jboss.qa.hornetq.test.failover/DedicatedFailoverTestCase/testFailoverClientAckTopic/
We hit this issue again during EAP 6.4.0.DR9 testing but in slightly different scenario: 1. Start 2 EAP 6.4.0.DR9 servers in dedicated topology with shared store and deploye queue InQueue and OutQueue 2. Send 2000 messages to InQueue to 1st EAP server (live) 3. Start 3rd EAP 6.4.0.DR9 with deployed MDB. MDB reads messages through remote JCA from InQueue and sends to OutQueue (in XA transaction) 4. When MDB is processing messages then cleanly shutdown 1st server (live) 5. MDB failovers to 2nd server (backup) 6. Wait for MDB to finish processing and read all messages from OutQueue. 7. Check number of send and received messages. Problem occurred in step 5. MDB did not receive any new messages from backup after failover. I can see following warnings in log of 2nd EAP (backup) server: 14:13:31,320 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) Exception in thread "hornetq-discovery-group-thread-dg-group1" java.lang.InternalError: unhandled utf8 byte 0 14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.utils.UTF8Util.readUTF(UTF8Util.java:164) 14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readUTF(ChannelBufferWrapper.java:105) 14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readStringInternal(ChannelBufferWrapper.java:95) 14:13:31,329 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readString(ChannelBufferWrapper.java:77) 14:13:31,330 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at org.hornetq.core.cluster.DiscoveryGroup$DiscoveryRunnable.run(DiscoveryGroup.java:303) 14:13:31,331 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) at java.lang.Thread.run(Thread.java:745) ... 14:13:32,771 WARN [org.hornetq.core.server] (Thread-20 (HornetQ-server-HornetQServerImpl::serverUUID=bf00dd22-69d6-11e4-8bd5-8513819ff1c5-266116125)) HQ222015: Internal error! Delivery logic has identified a non delivery and still handled a consumer! Attaching logs-mdb-failover.zip with info and trace logs from servers. Because this issue breaks HA, I'm increasing severity and setting blocker ?.
Created attachment 956655 [details] logs-mdb-failover.zip
Couldn't this being caused by the fact we compiled the latest release with Java8?
We should be more resilient to failures on the DiscvoeryRunnable. Any exceptions would interrupt the loop as identified by Miroslav. Miro was spot on the issue. The fix here will be simple: while (started) { try { } catch (Thrwoable e) // do the exception treatment inside the while, don't interrupt it if any exception happened { // logging } } I'm not sure how to test this though.. .we would need a ByteMan test interrupting the send and running it in a loop
I have a fix for this one already.. leave it with me
PR Sent: https://github.com/hornetq/hornetq/pull/1958
Solved by HQ 2.3.24.Final upgrade
During EAP 6.4.0.DR11 testing cycle this issue was not hit. Still I'll not set this as verified and check again with DR12 to have some confidence that issue is gone and there are no more problems.
Moving to DR12 to have it in priority filter
No related issue was found during EAP 6.4.0.DR12 testing. Setting as verified.