Bug 1158920

Summary:

InternalError exception thrown after fail-over

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Miroslav Novak <mnovak>

Component:

HornetQ

Assignee:

Clebert Suconic <csuconic>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Miroslav Novak <mnovak>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

6.4.0

CC:

ataylor, cdewolf, csuconic, dandread, jbertram, kkhan, msvehla, rsvoboda

Target Milestone:

DR12

Target Release:

EAP 6.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

dev_best_effort, dev_no_blocker

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-08-19 12:43:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1166735

Attachments:

Description	Flags
logs.zip	none
logs-mdb-failover.zip	none

Description Miroslav Novak 2014-10-30 14:02:32 UTC

Description of problem:
We can see unexpected exception and bad behaviour of JMS client in following scenario:
1) Start 2 EAP 6.4.0.DR6 (HQ 2.3.21.Final) with HornetQ in dedicated topology with shared store and deployed topic "jms/topic/InTopic0"
2) Subscribe and start 1 subscriber with "client acknowledge" session to topic
3) Start 3 publishers which sends messages to topic
4) Kill "live" server
5) Check that all clients failovered to backup
6) Stop publishers and wait for subscriber to receive all messages. Check that number of sent and received messages is equal.

Unexpected exception occured on backup's log after step 4.:

10:19:21,593 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) Exception in thread "hornetq-discovery-group-thread-dg-group1" java.lang.InternalError: unhandled utf8 byte 0
10:19:21,595 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.utils.UTF8Util.readUTF(UTF8Util.java:164)
10:19:21,597 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readUTF(ChannelBufferWrapper.java:105)
10:19:21,599 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readStringInternal(ChannelBufferWrapper.java:95)
10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readString(ChannelBufferWrapper.java:77)
10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.cluster.DiscoveryGroup$DiscoveryRunnable.run(DiscoveryGroup.java:303)
10:19:21,601 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at java.lang.Thread.run(Thread.java:745)

This happened after EAP 6.4.0.DR6 with HornetQ live was killed by Byteman. It looks like that incomplete connector was broadcasted by the "live" server and blew up discovery group on backup by throwing InternalError.

There is other problem which occurred during failover and might be related to this error. This is failover scenario with 3 publishers and 1 subscriber on topic. When live server was killed publishers did failover but subscriber did not. From the thread dump it hangs in consumer.receive(1 min); for more than 2 minutes and does not receive any new messages after live is killed:
Stack trace of thread: Thread[Thread-344,5,main]
---java.lang.Object.wait(Native Method)
---org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:259)
---org.hornetq.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:401)
---org.hornetq.jms.client.HornetQMessageConsumer.getMessage(HornetQMessageConsumer.java:220)
---org.hornetq.jms.client.HornetQMessageConsumer.receive(HornetQMessageConsumer.java:129)
---org.jboss.qa.hornetq.apps.clients.SubscriberClientAck.receiveMessage(SubscriberClientAck.java:278)
---org.jboss.qa.hornetq.apps.clients.SubscriberClientAck.run(SubscriberClientAck.java:122)

How reproducible:
We're not able to reproduce this without proper byteman rule and with the same test. We saw this just once.

Expected results:
java.lang.InternalError should not be thrown to error output and destroy discovery group on backup. 

Subscriber should failover without problem. 

Note:
I'm attaching trace logs from servers and test (contains trace logs from clients) - logs.zip

I'm not sure when to trigger byteman rule so we could reproduce the problem. Any suggestions?

Comment 1 Miroslav Novak 2014-10-30 14:05:59 UTC

Created attachment 952144 [details]
logs.zip

Comment 2 Miroslav Novak 2014-10-30 14:07:04 UTC

Link to failed test in Jenkins: 
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-60-hornetq-ha-failover-dedicated/226/testReport/org.jboss.qa.hornetq.test.failover/DedicatedFailoverTestCase/testFailoverClientAckTopic/

Comment 3 Miroslav Novak 2014-11-12 09:30:12 UTC

We hit this issue again during EAP 6.4.0.DR9 testing but in slightly different scenario:

1. Start 2 EAP 6.4.0.DR9 servers in dedicated topology with shared store and deploye queue InQueue and OutQueue
2. Send 2000 messages to InQueue to 1st EAP server (live)
3. Start 3rd EAP 6.4.0.DR9 with deployed MDB. MDB reads messages through remote JCA from InQueue and sends to OutQueue (in XA transaction)
4. When MDB is processing messages then cleanly shutdown 1st server (live)
5. MDB failovers to 2nd server (backup)
6. Wait for MDB to finish processing and read all messages from OutQueue. 
7. Check number of send and received messages. 

Problem occurred in step 5. MDB did not receive any new messages from backup after failover. I can see following warnings in log of 2nd EAP (backup) server:

14:13:31,320 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1) Exception in thread "hornetq-discovery-group-thread-dg-group1" java.lang.InternalError: unhandled utf8 byte 0
14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.utils.UTF8Util.readUTF(UTF8Util.java:164)
14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readUTF(ChannelBufferWrapper.java:105)
14:13:31,323 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readStringInternal(ChannelBufferWrapper.java:95)
14:13:31,329 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.buffers.impl.ChannelBufferWrapper.readString(ChannelBufferWrapper.java:77)
14:13:31,330 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at org.hornetq.core.cluster.DiscoveryGroup$DiscoveryRunnable.run(DiscoveryGroup.java:303)
14:13:31,331 ERROR [stderr] (hornetq-discovery-group-thread-dg-group1)  at java.lang.Thread.run(Thread.java:745)

...

14:13:32,771 WARN  [org.hornetq.core.server] (Thread-20 (HornetQ-server-HornetQServerImpl::serverUUID=bf00dd22-69d6-11e4-8bd5-8513819ff1c5-266116125)) HQ222015: Internal error! Delivery logic has identified a non delivery and still handled a consumer!

Attaching logs-mdb-failover.zip with info and trace logs from servers. 

Because this issue breaks HA, I'm increasing severity and setting blocker ?.

Comment 4 Miroslav Novak 2014-11-12 09:33:53 UTC

Created attachment 956655 [details]
logs-mdb-failover.zip

Comment 5 Clebert Suconic 2014-11-20 14:35:31 UTC

Couldn't this being caused by the fact we compiled the latest release with Java8?

Comment 6 Clebert Suconic 2014-11-20 14:45:02 UTC

We should be more resilient to failures on the DiscvoeryRunnable. Any exceptions would interrupt the loop as identified by Miroslav. Miro was spot on the issue.


The fix here will be simple:


while (started)
{
   try
   {
   }
   catch (Thrwoable e) // do the exception treatment inside the while, don't interrupt it if any exception happened
   {
        // logging
   }
}



I'm not sure how to test this though.. .we would need a ByteMan test interrupting the send and running it in a loop

Comment 8 Clebert Suconic 2014-11-20 19:22:07 UTC

I have a fix for this one already.. leave it with me

Comment 9 Clebert Suconic 2014-11-20 20:27:41 UTC

PR Sent: https://github.com/hornetq/hornetq/pull/1958

Comment 10 Kabir Khan 2014-11-24 12:52:05 UTC

Solved by HQ 2.3.24.Final upgrade

Comment 11 Miroslav Novak 2014-11-28 12:12:46 UTC

During EAP 6.4.0.DR11 testing cycle this issue was not hit. 

Still I'll not set this as verified and check again with DR12 to have some confidence that issue is gone and there are no more problems.

Comment 12 Rostislav Svoboda 2014-12-02 12:00:35 UTC

Moving to DR12 to have it in priority filter

Comment 13 Miroslav Novak 2014-12-05 11:09:51 UTC

No related issue was found during EAP 6.4.0.DR12 testing. Setting as verified.