Description of problem:
There are duplicated messages in following test scenario:
- Start 2 live/backup pairs in dedicated topology with replicated journal
--I'll name them Live1, Live2, Backup1 and Backup2
- Deploy queue testQueue0 to all
- Start 2 producers, first is sending messages to Live1 and second to Live2 to testQueue0
- Start 2 consumer, first is consuming messages from Live1 and second from Live2 from testQueue0
- stop Backup1 and Backup2
- stop Live1 and Live2
- start Live1 and Live2
- start Backup1 and Backup2
- stop producer and wait for receivers to receive all messages
Result: Clients received duplicated messages.
Version-Release number of selected component (if applicable):
This issue affects HornetQ in EAP 6.4.7.CP.
#Download patched EAP 6.4.7.CP by:
scp jbossqa.4.81:/home/jbossqa/tmp/jboss-eap-6.4.7-patched.zip . #password: jbossqa
# Download test suite and run the test
git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
git checkout refactoring_modules
# this groovy script takes EAP 6.4.x zip and unzips to 4 directories, it also makes better "default" config, change path to EAP zip per your machine
groovy -DEAP_ZIP_URL=file:///<provide_path_to_downloaded_eap_zip> PrepareServers.groovy
mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testStopLiveAndBackupStartBackupAndLiveInCluster -DfailIfNoTests=false | tee log
There should be no duplicated messages.
Are you sure this is EAP6? shouldn't this be raised on EAP7?
I replicated this using EAP7 once at http://messaging-ci-01.mw.lab.eng.bos.redhat.com:8080/view/replication-qa-tests/job/Replication-qa-tests/
I am playing with the possibility of an issue on your test, but I still haven't ruled out bugs yet.
The messages showing as duplicated were only received once according to the traces on the ClientConsumer, so the client never received it duplicated... I'm not sure how it could be reported as duplicated on this situation.
I am still investigating after I added more tracing:
I still need to know if you really meant EAP6 on this report. it seems it should been EAP7.
What I see happening is the following:
On the message sending, there's a commit being done.
The backup is shutting down, the server will be holding the response based on the replication response, which will not happen in time due to the backup being shutdown and breaking the response towards the client.
The commit will fail with a timeout, however it has been already effective on the journal.
13:44:55,877 Thread-34 ERROR [org.jboss.qa.hornetq.apps.clients.Producer11:93] Producer got exception for commit(). Producer counter: 140
javax.jms.JMSException: AMQ119014: Timed out after waiting 30,000 ms for response when sending packet 43
The test client will assume this as an error and will retry sending the message which was previously recorded.
if we had implemented retries this would be solved. You could also solve this by using XA. (a bit tricky though since the Transaction Manager doesn't currently support failover).
I can further look if we could avoid the timeout in certain cases, but from what I see this is working as expected.
Also, be aware that your client is not working properly with snapshots. I had to use -Deap=7x-dev
@Mnovak please take some time to read these messages throughly. Ping me on IRC whenever you want.
This issue was reported against EAP 7 first - https://issues.jboss.org/browse/JBEAP-4742
This scenario simulates how administrator should update all servers. He needs to stop all servers in cluster, update configuration (which does not have to be related to messaging) and start all servers again so new configuration takes effect.
Problem is that he cannot shutdown live servers first because failover would occur and backups would have the most up-to-date journal. If backup is then started it does not activate and waits for its live. But when live is started then live replicates its old journal to backup and backup move aside its up-to-date journal. Old journal before shutdown of live would be used. So the only way how to do that is to shutdown backup first and then live so no failover occur.
This is normal admin operation.
I'll do some tries with changing default call-timeout for clients to higher value. This should help to avoid the situation that producer times out before connection between live and backup is considered dead.
I understand that XA could solve the problem but this is not supported as we do not say with which Transaction Manager we support it.
As this is normal customer scenario I believe it should be possible to do that without XA.
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-4742 to Resolved
Verified with EAP 6.4.10.CP.CR2
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.