Bug 1158920 - InternalError exception thrown after fail-over
Summary: InternalError exception thrown after fail-over
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: HornetQ
Version: 6.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: DR12
: EAP 6.4.0
Assignee: Clebert Suconic
QA Contact: Miroslav Novak
URL:
Whiteboard: dev_best_effort, dev_no_blocker
Depends On:
Blocks: 1166735
TreeView+ depends on / blocked
 
Reported: 2014-10-30 14:02 UTC by Miroslav Novak
Modified: 2019-08-19 12:43 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-08-19 12:43:46 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
logs.zip (11.27 MB, application/zip)
2014-10-30 14:05 UTC, Miroslav Novak
no flags Details
logs-mdb-failover.zip (3.30 MB, application/zip)
2014-11-12 09:33 UTC, Miroslav Novak
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA ACTIVEMQ6-44 0 None None None Never
Red Hat Issue Tracker HORNETQ-1446 0 Major Open Error parsing UDP after interrupted process 2018-10-02 15:14:27 UTC

Description Miroslav Novak 2014-10-30 14:02:32 UTC
Description of problem:
We can see unexpected exception and bad behaviour of JMS client in following scenario:
1) Start 2 EAP 6.4.0.DR6 (HQ 2.3.21.Final) with HornetQ in dedicated topology with shared store and deployed topic "jms/topic/InTopic0"
2) Subscribe and start 1 subscriber with "client acknowledge" session to topic
3) Start 3 publishers which sends messages to topic
4) Kill "live" server
5) Check that all clients failovered to backup
6) Stop publishers and wait for subscriber to receive all messages. Check that number of sent and received messages is equal.

Unexpected exception occured on backup's log after step 4.:

10:19:21,593 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) Exception in thread "hornetq-discovery-group-thread-dg-group1" java.lang.InternalError: unhandled utf8 byte 0
10:19:21,595 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.utils.UTF8Util.readUTF(UTF8Util.java:164)
10:19:21,597 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readUTF(ChannelBufferWrapper.java:105)
10:19:21,599 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readStringInternal(ChannelBufferWrapper.java:95)
10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readString(ChannelBufferWrapper.java:77)
10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.cluster.DiscoveryGroup$DiscoveryRunnable.run(DiscoveryGroup.java:303)
10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at java.lang.Thread.run(Thread.java:745)

This happened after EAP 6.4.0.DR6 with HornetQ live was killed by Byteman. It looks like that incomplete connector was broadcasted by the "live" server and blew up discovery group on backup by throwing InternalError.

There is other problem which occurred during failover and might be related to this error. This is failover scenario with 3 publishers and 1 subscriber on topic. When live server was killed publishers did failover but subscriber did not. From the thread dump it hangs in consumer.receive(1 min); for more than 2 minutes and does not receive any new messages after live is killed:
Stack trace of thread: Thread[Thread-344,5,main]
---java.lang.Object.wait(Native Method)
---org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:259)
---org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:401)
---org.hornetq.jms.client.HornetQMessageConsumer.getMessage(HornetQMessageConsumer.java:220)
---org.hornetq.jms.client.HornetQMessageConsumer.receive(HornetQMessageConsumer.java:129)
---org.jboss.qa.hornetq.apps.clients.SubscriberClientAck.receiveMessage(SubscriberClientAck.java:278)
---org.jboss.qa.hornetq.apps.clients.SubscriberClientAck.run(SubscriberClientAck.java:122)

How reproducible:
We're not able to reproduce this without proper byteman rule and with the same test. We saw this just once.

Expected results:
java.lang.InternalError should not be thrown to error output and destroy discovery group on backup. 

Subscriber should failover without problem. 

Note:
I'm attaching trace logs from servers and test (contains trace logs from clients) - logs.zip

I'm not sure when to trigger byteman rule so we could reproduce the problem. Any suggestions?

Comment 1 Miroslav Novak 2014-10-30 14:05:59 UTC
Created attachment 952144 [details]
logs.zip

Comment 3 Miroslav Novak 2014-11-12 09:30:12 UTC
We hit this issue again during EAP 6.4.0.DR9 testing but in slightly different scenario:

1. Start 2 EAP 6.4.0.DR9 servers in dedicated topology with shared store and deploye queue InQueue and OutQueue
2. Send 2000 messages to InQueue to 1st EAP server (live)
3. Start 3rd EAP 6.4.0.DR9 with deployed MDB. MDB reads messages through remote JCA from InQueue and sends to OutQueue (in XA transaction)
4. When MDB is processing messages then cleanly shutdown 1st server (live)
5. MDB failovers to 2nd server (backup)
6. Wait for MDB to finish processing and read all messages from OutQueue. 
7. Check number of send and received messages. 

Problem occurred in step 5. MDB did not receive any new messages from backup after failover. I can see following warnings in log of 2nd EAP (backup) server:

14:13:31,320 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) Exception in thread "hornetq-discovery-group-thread-dg-group1" java.lang.InternalError: unhandled utf8 byte 0
14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.utils.UTF8Util.readUTF(UTF8Util.java:164)
14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readUTF(ChannelBufferWrapper.java:105)
14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readStringInternal(ChannelBufferWrapper.java:95)
14:13:31,329 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readString(ChannelBufferWrapper.java:77)
14:13:31,330 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.cluster.DiscoveryGroup$DiscoveryRunnable.run(DiscoveryGroup.java:303)
14:13:31,331 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at java.lang.Thread.run(Thread.java:745)

...

14:13:32,771 WARN  [org.hornetq.core.server] (Thread-20 (HornetQ-server-HornetQServerImpl::serverUUID=bf00dd22-69d6-11e4-8bd5-8513819ff1c5-266116125)) HQ222015: Internal error! Delivery logic has identified a non delivery and still handled a consumer!

Attaching logs-mdb-failover.zip with info and trace logs from servers. 

Because this issue breaks HA, I'm increasing severity and setting blocker ?.

Comment 4 Miroslav Novak 2014-11-12 09:33:53 UTC
Created attachment 956655 [details]
logs-mdb-failover.zip

Comment 5 Clebert Suconic 2014-11-20 14:35:31 UTC
Couldn't this being caused by the fact we compiled the latest release with Java8?

Comment 6 Clebert Suconic 2014-11-20 14:45:02 UTC
We should be more resilient to failures on the DiscvoeryRunnable. Any exceptions would interrupt the loop as identified by Miroslav. Miro was spot on the issue.


The fix here will be simple:


while (started)
{
   try
   {
   }
   catch (Thrwoable e) // do the exception treatment inside the while, don't interrupt it if any exception happened
   {
        // logging
   }
}



I'm not sure how to test this though.. .we would need a ByteMan test interrupting the send and running it in a loop

Comment 8 Clebert Suconic 2014-11-20 19:22:07 UTC
I have a fix for this one already.. leave it with me

Comment 9 Clebert Suconic 2014-11-20 20:27:41 UTC
PR Sent: https://github.com/hornetq/hornetq/pull/1958

Comment 10 Kabir Khan 2014-11-24 12:52:05 UTC
Solved by HQ 2.3.24.Final upgrade

Comment 11 Miroslav Novak 2014-11-28 12:12:46 UTC
During EAP 6.4.0.DR11 testing cycle this issue was not hit. 

Still I'll not set this as verified and check again with DR12 to have some confidence that issue is gone and there are no more problems.

Comment 12 Rostislav Svoboda 2014-12-02 12:00:35 UTC
Moving to DR12 to have it in priority filter

Comment 13 Miroslav Novak 2014-12-05 11:09:51 UTC
No related issue was found during EAP 6.4.0.DR12 testing. Setting as verified.


Note You need to log in before you can comment on or make changes to this bug.