Bug 1340465

Summary: [GSS](6.4.z) Duplicate messages in replicated HA topology
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Miroslav Novak <mnovak>
Component: HornetQAssignee: Clebert Suconic <csuconic>
Status: CLOSED CURRENTRELEASE QA Contact: Peter Mackay <pmackay>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4.7CC: bmaxwell, csuconic, fgavrilo, jtruhlar, mnovak, msochure, msvehla, pmackay, tom.ross, toross
Target Milestone: CR1   
Target Release: EAP 6.4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-17 13:02:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1339868, 1344476, 1348237    

Description Miroslav Novak 2016-05-27 13:19:32 UTC
Description of problem:
There are duplicated messages in following test scenario:
    - Start 2 live/backup pairs in dedicated topology with replicated journal
    --I'll name them Live1, Live2, Backup1 and Backup2
    - Deploy queue testQueue0 to all
    - Start 2 producers, first is sending messages to Live1 and second to Live2 to testQueue0
    - Start 2 consumer, first is consuming messages from Live1 and second from Live2 from testQueue0
    - stop Backup1 and Backup2
    - stop Live1 and Live2
    - start Live1 and Live2
    - start Backup1 and Backup2
    - stop producer and wait for receivers to receive all messages

Result: Clients received duplicated messages.

Version-Release number of selected component (if applicable):
This issue affects HornetQ in EAP 6.4.7.CP.

How reproducible:
#Download patched EAP 6.4.7.CP by:
scp jbossqa.4.81:/home/jbossqa/tmp/jboss-eap-6.4.7-patched.zip . #password: jbossqa

# Download test suite and run the test
git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
cd eap-tests-hornetq/scripts/
git checkout refactoring_modules
# this groovy script takes EAP 6.4.x zip and unzips to 4 directories, it also makes better "default" config, change path to EAP zip per your machine
groovy -DEAP_ZIP_URL=file:///<provide_path_to_downloaded_eap_zip> PrepareServers.groovy

export WORKSPACE=$PWD
export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
cd ../jboss-hornetq-testsuite/

mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testStopLiveAndBackupStartBackupAndLiveInCluster -DfailIfNoTests=false  | tee log


Expected results:
There should be no duplicated messages.

Comment 1 Clebert Suconic 2016-05-31 17:24:45 UTC
Are you sure this is EAP6? shouldn't this be raised on EAP7?

Comment 2 Clebert Suconic 2016-05-31 22:24:18 UTC
I replicated this using EAP7 once at http://messaging-ci-01.mw.lab.eng.bos.redhat.com:8080/view/replication-qa-tests/job/Replication-qa-tests/


I am playing with the possibility of an issue on your test, but I still haven't ruled out bugs yet.

The messages showing as duplicated were only received once according to the traces on the ClientConsumer, so the client never received it duplicated... I'm not sure how it could be reported as duplicated on this situation.


I am still investigating after I added more tracing:


https://github.com/apache/activemq-artemis/pull/547

Comment 3 Clebert Suconic 2016-05-31 22:24:48 UTC
I still need to know if you really meant EAP6 on this report. it seems it should been EAP7.

Comment 4 Clebert Suconic 2016-06-01 00:24:35 UTC
What I see happening is the following:


On the message sending, there's a commit being done.

The backup is shutting down, the server will be holding the response based on the replication response, which will not happen in time due to the backup being shutdown and breaking the response towards the client.


The commit will fail with a timeout, however it has been already effective on the journal.


13:44:55,877 Thread-34 ERROR [org.jboss.qa.hornetq.apps.clients.Producer11:93] Producer got exception for commit(). Producer counter: 140
javax.jms.JMSException: AMQ119014: Timed out after waiting 30,000 ms for response when sending packet 43
        at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:398)
        at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:304)
        at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.simpleCommit(ActiveMQSessionContext.java:295)



The test client will assume this as an error and will retry sending the message which was previously recorded.


if we had implemented retries this would be solved. You could also solve this by using XA. (a bit tricky though since the Transaction Manager doesn't currently support failover).



I can further look if we could avoid the timeout in certain cases, but from what I see this is working as expected.

Comment 5 Clebert Suconic 2016-06-01 00:25:49 UTC
Also, be aware that your client is not working properly with snapshots. I had to use -Deap=7x-dev


@Mnovak please take some time to read these messages throughly. Ping me on IRC whenever you want.

Comment 6 Miroslav Novak 2016-06-01 06:06:53 UTC
This issue was reported against EAP 7 first - https://issues.jboss.org/browse/JBEAP-4742

This scenario simulates how administrator should update all servers. He needs to stop all servers in cluster, update configuration (which does not have to be related to messaging) and start all servers again so new configuration takes effect. 

Problem is that he cannot shutdown live servers first because failover would occur and backups would have the most up-to-date journal. If backup is then started it does not activate and waits for its live. But when live is started then live replicates its old journal to backup and backup move aside its up-to-date journal. Old journal before shutdown of live would be used. So the only way how to do that is to shutdown backup first and then live so no failover occur. 

This is normal admin operation. 

I'll do some tries with changing default call-timeout for clients to higher value. This should help to avoid the situation that producer times out before connection between live and backup is considered dead. 

I understand that XA could solve the problem but this is not supported as we do not say with which Transaction Manager we support it.  

As this is normal customer scenario I believe it should be possible to do that without XA.

Comment 7 JBoss JIRA Server 2016-07-18 08:53:53 UTC
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-4742 to Resolved

Comment 8 Peter Mackay 2016-08-24 14:27:00 UTC
Verified with EAP 6.4.10.CP.CR2

Comment 9 Petr Penicka 2017-01-17 13:02:17 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 10 Petr Penicka 2017-01-17 13:03:01 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.