Bug 1340465
Summary: | [GSS](6.4.z) Duplicate messages in replicated HA topology | ||
---|---|---|---|
Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Miroslav Novak <mnovak> |
Component: | HornetQ | Assignee: | Clebert Suconic <csuconic> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Peter Mackay <pmackay> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.4.7 | CC: | bmaxwell, csuconic, fgavrilo, jtruhlar, mnovak, msochure, msvehla, pmackay, tom.ross, toross |
Target Milestone: | CR1 | ||
Target Release: | EAP 6.4.10 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-01-17 13:02:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1339868, 1344476, 1348237 |
Description
Miroslav Novak
2016-05-27 13:19:32 UTC
Are you sure this is EAP6? shouldn't this be raised on EAP7? I replicated this using EAP7 once at http://messaging-ci-01.mw.lab.eng.bos.redhat.com:8080/view/replication-qa-tests/job/Replication-qa-tests/ I am playing with the possibility of an issue on your test, but I still haven't ruled out bugs yet. The messages showing as duplicated were only received once according to the traces on the ClientConsumer, so the client never received it duplicated... I'm not sure how it could be reported as duplicated on this situation. I am still investigating after I added more tracing: https://github.com/apache/activemq-artemis/pull/547 I still need to know if you really meant EAP6 on this report. it seems it should been EAP7. What I see happening is the following: On the message sending, there's a commit being done. The backup is shutting down, the server will be holding the response based on the replication response, which will not happen in time due to the backup being shutdown and breaking the response towards the client. The commit will fail with a timeout, however it has been already effective on the journal. 13:44:55,877 Thread-34 ERROR [org.jboss.qa.hornetq.apps.clients.Producer11:93] Producer got exception for commit(). Producer counter: 140 javax.jms.JMSException: AMQ119014: Timed out after waiting 30,000 ms for response when sending packet 43 at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:398) at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:304) at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.simpleCommit(ActiveMQSessionContext.java:295) The test client will assume this as an error and will retry sending the message which was previously recorded. if we had implemented retries this would be solved. You could also solve this by using XA. (a bit tricky though since the Transaction Manager doesn't currently support failover). I can further look if we could avoid the timeout in certain cases, but from what I see this is working as expected. Also, be aware that your client is not working properly with snapshots. I had to use -Deap=7x-dev @Mnovak please take some time to read these messages throughly. Ping me on IRC whenever you want. This issue was reported against EAP 7 first - https://issues.jboss.org/browse/JBEAP-4742 This scenario simulates how administrator should update all servers. He needs to stop all servers in cluster, update configuration (which does not have to be related to messaging) and start all servers again so new configuration takes effect. Problem is that he cannot shutdown live servers first because failover would occur and backups would have the most up-to-date journal. If backup is then started it does not activate and waits for its live. But when live is started then live replicates its old journal to backup and backup move aside its up-to-date journal. Old journal before shutdown of live would be used. So the only way how to do that is to shutdown backup first and then live so no failover occur. This is normal admin operation. I'll do some tries with changing default call-timeout for clients to higher value. This should help to avoid the situation that producer times out before connection between live and backup is considered dead. I understand that XA could solve the problem but this is not supported as we do not say with which Transaction Manager we support it. As this is normal customer scenario I believe it should be possible to do that without XA. Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-4742 to Resolved Verified with EAP 6.4.10.CP.CR2 Retroactively bulk-closing issues from released EAP 6.4 cummulative patches. Retroactively bulk-closing issues from released EAP 6.4 cummulative patches. |