Red Hat Bugzilla – Bug 519476
Invalid accept data sent by Java client after failover.
Last modified: 2010-10-14 12:01:19 EDT
A program intended to reproduce bug 516501 turned up a new bug, possibly a client-side bug in Java failover. It appears that there is a race condition where ack data from an old, disconnected connection is incorrectly sent on a new failed-over connection. The symptom is an error of the form "connfirmed N but sent 0" The reproducer code is https://bugzilla.redhat.com/attachment.cgi?id=357364, here's the description from bug 516501 Comment #5 From Rajith Attapattu (rattapat@redhat.com) 2009-08-13 15:28:17 EDT (-) [reply] ------- Private Created an attachment (id=357364) [details] Reproducer The attachment contains a JMS based reproducer. Just untar the package and run the scramble_brokers.sh script. It basically starts a jms producer and jms consumer that uses ** sync_ack ** in the bg and then changes the 4 node cluster membership rapidly to force failover. I tried with a 2 node cluster to keep things simple but the probability of the error happening was pretty low. Also in this case it was hitting a known issue in the JMS clients FailoverExchangeMethod. The script is running the java clients with log level at WARN. You can easily change that in the script to debug ..etc. You could also get the brokers to log into a file. Feel free to modify the tests as you see fit. Please ping me if you make any improvements to the test script and I could incorporate those changes. into my nightly runs.
I am currently unable to reproduce this issue with the latest package set. I even tried with a broker prior to r794736. I have done a fair amount of testing and I am yet to see this issue.
Any progress? I there any known reproducer?
Not that know of. This issue seems to be fixed, but sadly know way of verifying.
Tested: on -2 bug does not appear and on 1.2 also not. We (Rajith,Me) were not able to reproduce it anymore. Probably fixed on broker side, but nobody know when. Discussed with Rajith and Alan and both proposed mark it as verified validated on packages: # rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u openais-0.80.6-16.el5 openais-debuginfo-0.80.6-16.el5 python-qpid-0.7.917557-4.el5 qpid-cpp-client-0.7.916826-2.el5 qpid-cpp-client-devel-0.7.916826-2.el5 qpid-cpp-client-rdma-0.7.916826-2.el5 qpid-cpp-client-ssl-0.7.916826-2.el5 qpid-cpp-mrg-debuginfo-0.7.916826-2.el5 qpid-cpp-server-0.7.916826-2.el5 qpid-cpp-server-cluster-0.7.916826-2.el5 qpid-cpp-server-devel-0.7.916826-2.el5 qpid-cpp-server-rdma-0.7.916826-2.el5 qpid-cpp-server-ssl-0.7.916826-2.el5 qpid-cpp-server-store-0.7.916826-2.el5 qpid-cpp-server-xml-0.7.916826-2.el5 qpid-dotnet-0.4.738274-2.el5 qpid-java-client-0.7.918215-1.el5 qpid-java-common-0.7.918215-1.el5 qpid-tools-0.7.917557-4.el5 ->VERIFIED
tested on RHEL 5.5 i386 / x86_64 and RHEL 4.8 i386 / x86_64
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the Java client sent invalid accept data after a failover. This was caused by a race condition where data from an old disconnected connection was incorrectly sent to a new failed-over connection. With this update, the Java client no longer sends invalid data after a failover.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html