Bug 1169418

Summary:	[QA](6.4.z) Calling last session.commit() does not get a response and throws "javax.jms.JMSException: HQ119014: Timed out waiting for response when sending packet 43" to client
Product:	[JBoss] JBoss Enterprise Application Platform 6	Reporter:	Miroslav Novak <mnovak>
Component:	HornetQ	Assignee:	Yong Hao Gao <hgao>
Status:	CLOSED WONTFIX	QA Contact:	Miroslav Novak <mnovak>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.4.0	CC:	ataylor, bmaxwell, hgao, jbertram, mnovak, msvehla
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-03-01 12:29:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Miroslav Novak 2014-12-01 15:47:44 UTC

Description of problem:

In some failover tests standalone JMS client which is consuming messages from queue does not get a response for session.commit() when last message from queue was received. It happens with dedicated HA topology with replicated journal. 

How reproducible:
- Start 2 EAP 6.4.0.DR11 servers in dedicated HA topology with replicated journal and queue testQueue0
- Start 3 producers and 1 consumer on testQueue0
- Kill 1st EAP (live) server and wait for clients to failover to 2nd server (backup)
- Start 1st EAP server again and wait for clients to failback
- Stop producers
- Wait for consumer to receive all messages
- Compare number of sent and received messages

Problem occurs when consumer gets null for consumer.receive(...) and commits session. For this commit() call it does not get a response and JMSException is thrown:

15:52:31,287 main INFO  [org.jboss.qa.hornetq.apps.clients.QueueClientsTransAck:160] Consumer: Thread[Thread-13,5,main] is still alive.
javax.jms.JMSException: HQ119014: Timed out waiting for response when sending packet 43
	at org.hornetq.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:390)
	at org.hornetq.core.client.impl.ClientSessionImpl.commit(ClientSessionImpl.java:570)
	at org.hornetq.core.client.impl.DelegatingSession.commit(DelegatingSession.java:156)
	at org.hornetq.jms.client.HornetQSession.commit(HornetQSession.java:229)
	at org.jboss.qa.hornetq.apps.clients.ProducerTransAck.commitSession(ProducerTransAck.java:206)
	at org.jboss.qa.hornetq.apps.clients.ProducerTransAck.run(ProducerTransAck.java:99)
Caused by: HornetQException[errorType=CONNECTION_TIMEDOUT message=HQ119014: Timed out waiting for response when sending packet 43]
	... 6 more



We don't have 100% reproducer for problem but can by following test from HQ integration test suite:

Clone our testsuite from git:
git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git

Go to eap-tests-hornetq/scripts and run groovy script PrepareServers.groovy with -DEAP_VERSION=6.4.0.DR11 parameter:
groovy -DEAP_VERSION=6.4.0.DR11 PrepareServers.groovy

(Script will prepare 4 servers - server1..4 in the directory where are you currently standing.)

Export these paths and also mcast addresse.:
export JBOSS_HOME_1=$PWD/server1/jboss-eap
export JBOSS_HOME_2=$PWD/server2/jboss-eap
export JBOSS_HOME_3=$PWD/server3/jboss-eap
export JBOSS_HOME_4=$PWD/server4/jboss-eap
export MCAST_ADDR=235.3.4.5

Go to jboss-hornetq-testsuite/ in our testsuite and run
mvn install -Dtest=ReplicatedColocatedClusterFailoverTestCase#testFailbackTransAckQueue

Actual results:
IndexOutOfBoundsException is thrown in consumer

Expected results:
IndexOutOfBoundsException should not be thrown.

Comment 1 Clebert Suconic 2014-12-01 17:01:10 UTC

Just like with failover or failback, you are expected to treat for errors and retry... right?

Comment 2 Miroslav Novak 2014-12-02 07:41:09 UTC

Clients retry only for JMSException. IndexOutOfBoundsException is not subtype of it, so this exception ends the client.

Comment 3 Miroslav Novak 2015-01-05 11:05:56 UTC

There should be JMSException in Actual and Expected result instead of IndexOutOfBoundsException. I've accidently copied it from bz#1169348.

Problem here is with the last commit which timeouts and JMSException is thrown for every retry.

Comment 4 Justin Bertram 2015-01-06 22:45:53 UTC

I'm confused a bit on this issue.  Clebert and you both seem to agree that the client should retry on a JMSException, and that is the kind of exception which is thrown - a fact confused by a cut/paste mistake. So, is the client retrying when this JMSException is thrown?  If not, why not?  Please clarify.

Comment 5 Miroslav Novak 2015-01-07 08:03:09 UTC

Client is retrying commit() if JMSException is thrown. There are ~5 retries before giving up. However None of those retries is successful.

Comment 6 Miroslav Novak 2015-01-28 11:48:42 UTC

Hi Howard,

I can see the same exception when backup is synchronizing with live. The test scenario is:

1. Start 2 EAP 6.4.0.ER1 servers in dedicated topology with replicated journal
2. Start producer which sends large messages to queue
3. Wait 4 minutes (this is because we need syncing with backup to take a long time > 30s)
4. Start backup server which starts syncing with live (it must take longer call-timeout=30s)

You will see that Producer gets JMSException:

javax.jms.JMSException: HQ119014: Timed out waiting for response when sending packet 43
	at org.hornetq.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:390)
	at org.hornetq.core.client.impl.ClientSessionImpl.commit(ClientSessionImpl.java:570)
	at org.hornetq.core.client.impl.DelegatingSession.commit(DelegatingSession.java:156)
	at org.hornetq.jms.client.HornetQSession.commit(HornetQSession.java:229)
	at org.jboss.qa.hornetq.apps.clients.ProducerTransAck.commitSession(ProducerTransAck.java:210)
	at org.jboss.qa.hornetq.apps.clients.ProducerTransAck.run(ProducerTransAck.java:100)
Caused by: HornetQException[errorType=CONNECTION_TIMEDOUT message=HQ119014: Timed out waiting for response when sending packet 43]

Synchronization timeouts all blocking client calls. When backup is synced and announced then everything returns to normal. 

Is it possible to send packet to client with something like - "wait, i'm syncing with backup now" so client does not throw JMSException.

Comment 7 Miroslav Novak 2015-01-28 11:52:40 UTC

This is possible to hit with test from HQ upstream test suite:

BackupSyncJournalTest#testReserveFileIdValuesOnBackup

when it's run with low call-timeout - add below to setUp() method:

locator.setCallTimeout(5000);


Here it fails when producer tries to send more messages during synchronization. (block-on-durable-send=true)

Comment 8 Yong Hao Gao 2015-01-28 13:00:39 UTC

Hi Mirek,

I'm still debugging it. So far as I observed that something strange happens during the replication. At some time the replication request take more than 30 sec to get back response, and the 30 sec seems not have spent in replication. I'm tracing the netty layer code now and see if I can find more.

Howard

Comment 9 Miroslav Novak 2015-01-28 15:24:16 UTC

Hi Howard,

BackupSyncJournalTest#testReserveFileIdValuesOnBackup is using BackupSyncDelay class to slow down replication between live and backup. Maybe it's not well implemented. Anyway I can see the same behaviour without it. (by the way in step 1. should be started only live server not both)

Thanks,
Mirek