Bug 1340465 - [GSS](6.4.z) Duplicate messages in replicated HA topology
Summary: [GSS](6.4.z) Duplicate messages in replicated HA topology
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: HornetQ
Version: 6.4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: CR1
: EAP 6.4.10
Assignee: Clebert Suconic
QA Contact: Peter Mackay
URL:
Whiteboard:
Depends On:
Blocks: eap6410-payload 1344476 1348237
TreeView+ depends on / blocked
 
Reported: 2016-05-27 13:19 UTC by Miroslav Novak
Modified: 2017-01-17 13:03 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-17 13:02:17 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker JBEAP-4742 0 Critical Verified (7.0.z) Duplicate messages in replicated HA topology when backup is shutdowned 2017-10-16 19:54:22 UTC
Red Hat Knowledge Base (Article) 2490121 0 None None None 2016-08-09 14:32:05 UTC

Description Miroslav Novak 2016-05-27 13:19:32 UTC
Description of problem:
There are duplicated messages in following test scenario:
    - Start 2 live/backup pairs in dedicated topology with replicated journal
    --I'll name them Live1, Live2, Backup1 and Backup2
    - Deploy queue testQueue0 to all
    - Start 2 producers, first is sending messages to Live1 and second to Live2 to testQueue0
    - Start 2 consumer, first is consuming messages from Live1 and second from Live2 from testQueue0
    - stop Backup1 and Backup2
    - stop Live1 and Live2
    - start Live1 and Live2
    - start Backup1 and Backup2
    - stop producer and wait for receivers to receive all messages

Result: Clients received duplicated messages.

Version-Release number of selected component (if applicable):
This issue affects HornetQ in EAP 6.4.7.CP.

How reproducible:
#Download patched EAP 6.4.7.CP by:
scp jbossqa@10.40.4.81:/home/jbossqa/tmp/jboss-eap-6.4.7-patched.zip . #password: jbossqa

# Download test suite and run the test
git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
cd eap-tests-hornetq/scripts/
git checkout refactoring_modules
# this groovy script takes EAP 6.4.x zip and unzips to 4 directories, it also makes better "default" config, change path to EAP zip per your machine
groovy -DEAP_ZIP_URL=file:///<provide_path_to_downloaded_eap_zip> PrepareServers.groovy

export WORKSPACE=$PWD
export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
cd ../jboss-hornetq-testsuite/

mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testStopLiveAndBackupStartBackupAndLiveInCluster -DfailIfNoTests=false  | tee log


Expected results:
There should be no duplicated messages.

Comment 1 Clebert Suconic 2016-05-31 17:24:45 UTC
Are you sure this is EAP6? shouldn't this be raised on EAP7?

Comment 2 Clebert Suconic 2016-05-31 22:24:18 UTC
I replicated this using EAP7 once at http://messaging-ci-01.mw.lab.eng.bos.redhat.com:8080/view/replication-qa-tests/job/Replication-qa-tests/


I am playing with the possibility of an issue on your test, but I still haven't ruled out bugs yet.

The messages showing as duplicated were only received once according to the traces on the ClientConsumer, so the client never received it duplicated... I'm not sure how it could be reported as duplicated on this situation.


I am still investigating after I added more tracing:


https://github.com/apache/activemq-artemis/pull/547

Comment 3 Clebert Suconic 2016-05-31 22:24:48 UTC
I still need to know if you really meant EAP6 on this report. it seems it should been EAP7.

Comment 4 Clebert Suconic 2016-06-01 00:24:35 UTC
What I see happening is the following:


On the message sending, there's a commit being done.

The backup is shutting down, the server will be holding the response based on the replication response, which will not happen in time due to the backup being shutdown and breaking the response towards the client.


The commit will fail with a timeout, however it has been already effective on the journal.


13:44:55,877 Thread-34 ERROR [org.jboss.qa.hornetq.apps.clients.Producer11:93] Producer got exception for commit(). Producer counter: 140
javax.jms.JMSException: AMQ119014: Timed out after waiting 30,000 ms for response when sending packet 43
        at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:398)
        at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:304)
        at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.simpleCommit(ActiveMQSessionContext.java:295)



The test client will assume this as an error and will retry sending the message which was previously recorded.


if we had implemented retries this would be solved. You could also solve this by using XA. (a bit tricky though since the Transaction Manager doesn't currently support failover).



I can further look if we could avoid the timeout in certain cases, but from what I see this is working as expected.

Comment 5 Clebert Suconic 2016-06-01 00:25:49 UTC
Also, be aware that your client is not working properly with snapshots. I had to use -Deap=7x-dev


@Mnovak please take some time to read these messages throughly. Ping me on IRC whenever you want.

Comment 6 Miroslav Novak 2016-06-01 06:06:53 UTC
This issue was reported against EAP 7 first - https://issues.jboss.org/browse/JBEAP-4742

This scenario simulates how administrator should update all servers. He needs to stop all servers in cluster, update configuration (which does not have to be related to messaging) and start all servers again so new configuration takes effect. 

Problem is that he cannot shutdown live servers first because failover would occur and backups would have the most up-to-date journal. If backup is then started it does not activate and waits for its live. But when live is started then live replicates its old journal to backup and backup move aside its up-to-date journal. Old journal before shutdown of live would be used. So the only way how to do that is to shutdown backup first and then live so no failover occur. 

This is normal admin operation. 

I'll do some tries with changing default call-timeout for clients to higher value. This should help to avoid the situation that producer times out before connection between live and backup is considered dead. 

I understand that XA could solve the problem but this is not supported as we do not say with which Transaction Manager we support it.  

As this is normal customer scenario I believe it should be possible to do that without XA.

Comment 7 JBoss JIRA Server 2016-07-18 08:53:53 UTC
Bartosz Baranowski <bbaranow@redhat.com> updated the status of jira JBEAP-4742 to Resolved

Comment 8 Peter Mackay 2016-08-24 14:27:00 UTC
Verified with EAP 6.4.10.CP.CR2

Comment 9 Petr Penicka 2017-01-17 13:02:17 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 10 Petr Penicka 2017-01-17 13:03:01 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.


Note You need to log in before you can comment on or make changes to this bug.